Recently Quora put out a Question similarity competition on Kaggle. This is the first time I was attempting an NLP problem so a lot to learn. The one thing that blew my mind away was the word2vec embeddings.
Till now whenever I heard the term word2vec I visualized it as a way to create a bag of words vector for a sentence.
For those who don’t know bag of words: If we have a series of sentences(documents)
Bag of words would encode it using 0:This 1:is 2:good 3:bad 4:awesome
But it is much more powerful than that.
What word2vec does is that it creates vectors for words. What I mean by that is that we have a 300 dimensional vector for every word(common bigrams too) in a dictionary.
We can use this for multiple scenarios but the most common are:
A. Using word2vec embeddings we can find out similarity between words. Assume you have to answer if these two statements signify the same thing:
If we do a sentence similarity metric or a bag of words approach to compare these two sentences we will get a pretty low score.
But with a word encoding we can say that
B. Encode Sentences: I read a post from Abhishek Thakur a prominent kaggler.(Must Read). What he did was he used these word embeddings to create a 300 dimensional vector for every sentence.
His Approach: Lets say the sentence is “What is this” And lets say the embedding for every word is given in 4 dimension(normally 300 dimensional encoding is given)
Then the vector for the sentence is normalized elementwise addition of the vectors. i.e.
Elementwise addition : [.25+1+0.5, 0.25+0+0 , 0.25+0+0, .25+0+.5] = [1.75, .25, .25, .75] divided by math.sqrt(1.25^2 + .25^2 + .25^2 + .75^2) = 1.5 gives:[1.16, .17, .17, 0.5]
Thus I can convert any sentence to a vector of a fixed dimension(decided by the embedding). To find similarity between two sentences I can use a variety of distance/similarity metrics.
C. Also It enables us to do algebraic manipulations on words which was not possible before. For example: What is king - man + woman ?
Guess what it comes out to be : Queen
Now lets get down to the coding part as we know a little bit of fundamentals.
First of all we download a custom word embedding from Google. There are many other embeddings too.
The above file is pretty big. Might take some time. Then moving on to coding.
from gensim.models import word2vec model = gensim.models.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)
What is king - man + woman?
model.most_similar(positive = ['king','woman'],negative = ['man'])
You can do plenty of freaky/cool things using this:
model.most_similar(positive = ['emma','he','male','mr'],negative = ['she','mrs','female'])
model.doesnt_match("math shopping reading science".split(" "))
I think staple doesn’t belong in this list!
In this paper , the authors have shown that itembased CF can be cast in the same framework of word embedding.
Library - Books = Hall
Obama + Russia - USA = Putin
Iraq - Violence = Jordan
President - Power = Prime Minister (Not in India Though)
Is this model sexist?
model.most_similar(positive = ["donald_trump"],negative = ['brain'])
Whatever it is doing it surely feels like magic. Next time I will try to write more on how it works once I understand it fully.comments powered by Disqus