Recently Quora put out a Question similarity competition on Kaggle. This is the first time I was attempting an NLP problem so a lot to learn. The one thing that blew my mind away was the word2vec embeddings.
Till now whenever I heard the term word2vec I visualized it as a way to create a bag of words vector for a sentence.
For those who don’t know bag of words: If we have a series of sentences(documents)
- This is good - [1,1,1,0,0]
- This is bad - [1,1,0,1,0]
- This is awesome - [1,1,0,0,1]
Bag of words would encode it using 0:This 1:is 2:good 3:bad 4:awesome
But it is much more powerful than that.
What word2vec does is that it creates vectors for words. What I mean by that is that we have a 300 dimensional vector for every word(common bigrams too) in a dictionary.
How does that help?
We can use this for multiple scenarios but the most common are:
A. Using word2vec embeddings we can find out similarity between words. Assume you have to answer if these two statements signify the same thing:
- President greets press in Chicago
- Obama speaks to media in Illinois.
If we do a sentence similarity metric or a bag of words approach to compare these two sentences we will get a pretty low score.
But with a word encoding we can say that
- President is similar to Obama
- greets is similar to speaks
- press is similar to media
- Chicago is similar to Illinois
B. Encode Sentences: I read a post from Abhishek Thakur a prominent kaggler.(Must Read). What he did was he used these word embeddings to create a 300 dimensional vector for every sentence.
His Approach: Lets say the sentence is “What is this” And lets say the embedding for every word is given in 4 dimension(normally 300 dimensional encoding is given)
- what : [.25 ,.25 ,.25 ,.25]
- is : [ 1 , 0 , 0 , 0]
- this : [ .5 , 0 , 0 , .5]
Then the vector for the sentence is normalized elementwise addition of the vectors. i.e.
Elementwise addition : [.25+1+0.5, 0.25+0+0 , 0.25+0+0, .25+0+.5] = [1.75, .25, .25, .75] divided by math.sqrt(1.25^2 + .25^2 + .25^2 + .75^2) = 1.5 gives:[1.16, .17, .17, 0.5]
Thus I can convert any sentence to a vector of a fixed dimension(decided by the embedding). To find similarity between two sentences I can use a variety of distance/similarity metrics.
C. Also It enables us to do algebraic manipulations on words which was not possible before. For example: What is king - man + woman ?
Guess what it comes out to be : Queen
Now lets get down to the coding part as we know a little bit of fundamentals.
First of all we download a custom word embedding from Google. There are many other embeddings too.
The above file is pretty big. Might take some time. Then moving on to coding.
from gensim.models import word2vec model = gensim.models.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)
1. Starting simple, lets find out similar words. Want to find similar words to python?
2. Now we can use this model to find the solution to the equation:
What is king - man + woman?
model.most_similar(positive = ['king','woman'],negative = ['man'])
You can do plenty of freaky/cool things using this:
3. Lets say you wanted a girl and had a girl name like emma in mind but you got a boy. So what is the male version for emma?
model.most_similar(positive = ['emma','he','male','mr'],negative = ['she','mrs','female'])
4. Find which word doesn’t belong to a list?
model.doesnt_match("math shopping reading science".split(" "))
I think staple doesn’t belong in this list!
Other Cool Things
In this paper, the authors have shown that itembased CF can be cast in the same framework of word embedding.
2. Some other examples that people have seen after using their own embeddings:
Library - Books = Hall
Obama + Russia - USA = Putin
Iraq - Violence = Jordan
President - Power = Prime Minister (Not in India Though)
3.Seeing the above I started playing with it a little.
Is this model sexist?
model.most_similar(positive = ["donald_trump"],negative = ['brain'])
Whatever it is doing it surely feels like magic. Next time I will try to write more on how it works once I understand it fully.