Today I Learned This Part I: What are word2vec Embeddings?
Recently Quora put out a Question similarity competition on Kaggle. This is the first time I was attempting an NLP problem so a lot to learn. The one thing that blew my mind away was the word2vec embeddings.
Till now whenever I heard the term word2vec I visualized it as a way to create a bag of words vector for a sentence.
For those who don’t know bag of words: If we have a series of sentences(documents)
This is good - [1,1,1,0,0]
This is bad - [1,1,0,1,0]
This is awesome - [1,1,0,0,1]
Bag of words would encode it using 0:This 1:is 2:good 3:bad 4:awesome
But it is much more powerful than that.
What word2vec does is that it creates vectors for words. What I mean by that is that we have a 300 dimensional vector for every word(common bigrams too) in a dictionary.
How does that help?
We can use this for multiple scenarios but the most common are:
A. Using word2vec embeddings we can find out similarity between words. Assume you have to answer if these two statements signify the same thing:
President greets press in Chicago
Obama speaks to media in Illinois.
If we do a sentence similarity metric or a bag of words approach to compare these two sentences we will get a pretty low score.
But with a word encoding we can say that
President is similar to Obama
greets is similar to speaks
press is similar to media
Chicago is similar to Illinois
B. Encode Sentences: I read a post from Abhishek Thakur a prominent kaggler.(Must Read). What he did was he used these word embeddings to create a 300 dimensional vector for every sentence.
His Approach: Lets say the sentence is “What is this” And lets say the embedding for every word is given in 4 dimension(normally 300 dimensional encoding is given)
what : [.25 ,.25 ,.25 ,.25]
is : [ 1 , 0 , 0 , 0]
this : [ .5 , 0 , 0 , .5]
Then the vector for the sentence is normalized elementwise addition of the vectors. i.e.
Elementwise addition : [.25+1+0.5, 0.25+0+0 , 0.25+0+0, .25+0+.5] = [1.75, .25, .25, .75]
divided by
math.sqrt(1.25^2 + .25^2 + .25^2 + .75^2) = 1.5
gives:[1.16, .17, .17, 0.5]
Thus I can convert any sentence to a vector of a fixed dimension(decided by the embedding). To find similarity between two sentences I can use a variety of distance/similarity metrics.
C. Also It enables us to do algebraic manipulations on words which was not possible before. For example: What is king - man + woman ?
Guess what it comes out to be : Queen
Application/Coding:
Now lets get down to the coding part as we know a little bit of fundamentals.
First of all we download a custom word embedding from Google. There are many other embeddings too.
wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
The above file is pretty big. Might take some time. Then moving on to coding.
from gensim.models import word2vec
model = gensim.models.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)
1. Starting simple, lets find out similar words. Want to find similar words to python?
model.most_similar('python')
[(u'pythons', 0.6688377261161804),
(u'Burmese_python', 0.6680364608764648),
(u'snake', 0.6606293320655823),
(u'crocodile', 0.6591362953186035),
(u'boa_constrictor', 0.6443519592285156),
(u'alligator', 0.6421656608581543),
(u'reptile', 0.6387745141983032),
(u'albino_python', 0.6158879995346069),
(u'croc', 0.6083582639694214),
(u'lizard', 0.601341724395752)]
2. Now we can use this model to find the solution to the equation:
What is king - man + woman?
model.most_similar(positive = ['king','woman'],negative = ['man'])
[(u'queen', 0.7118192315101624),
(u'monarch', 0.6189674139022827),
(u'princess', 0.5902431011199951),
(u'crown_prince', 0.5499460697174072),
(u'prince', 0.5377321839332581),
(u'kings', 0.5236844420433044),
(u'Queen_Consort', 0.5235946178436279),
(u'queens', 0.5181134343147278),
(u'sultan', 0.5098593235015869),
(u'monarchy', 0.5087412595748901)]
You can do plenty of freaky/cool things using this:
3. Lets say you wanted a girl and had a girl name like emma in mind but you got a boy. So what is the male version for emma?
model.most_similar(positive = ['emma','he','male','mr'],negative = ['she','mrs','female'])
[(u'sanchez', 0.4920658469200134),
(u'kenny', 0.48300960659980774),
(u'alves', 0.4684845209121704),
(u'gareth', 0.4530612826347351),
(u'bellamy', 0.44884198904037476),
(u'gibbs', 0.445194810628891),
(u'dos_santos', 0.44508373737335205),
(u'gasol', 0.44387346506118774),
(u'silva', 0.4424275755882263),
(u'shaun', 0.44144102931022644)]
4. Find which word doesn’t belong to a list ?
model.doesnt_match("math shopping reading science".split(" "))
I think staple doesn’t belong in this list!
Other Cool Things
1. Recommendations:
In this paper , the authors have shown that itembased CF can be cast in the same framework of word embedding.
2. Some other examples that people have seen after using their own embeddings:
Library - Books = Hall
Obama + Russia - USA = Putin
Iraq - Violence = Jordan
President - Power = Prime Minister (Not in India Though)
3.Seeing the above I started playing with it a little.
Is this model sexist?
model.most_similar(positive = ["donald_trump"],negative = ['brain'])
[(u'novak', 0.40405112504959106),
(u'ozzie', 0.39440611004829407),
(u'democrate', 0.39187556505203247),
(u'clinton', 0.390536367893219),
(u'hillary_clinton', 0.3862358033657074),
(u'bnp', 0.38295692205429077),
(u'klaar', 0.38228923082351685),
(u'geithner', 0.380607008934021),
(u'bafana_bafana', 0.3801495432853699),
(u'whitman', 0.3790769875049591)]
Whatever it is doing it surely feels like magic. Next time I will try to write more on how it works once I understand it fully.