# word2vec

## Everything you know about word2vec is wrong

The classic explanation of

`word2vec`

, in skip-gram, with negative sampling, in the paper and countless blog posts on the internet is as follows:`while(1) { 1. vf = vector of focus word 2. vc = vector of context word 3. train such that (vc . vf = 1) 4. for(0 <= i < negative samples): vneg = vector of word *not* in context train such that (vf . vneg = 0) }`

Indeed, if I google "word2vec skipgram", the results I get are:

- The wikipedia page which describes the algorithm on a high level
- The tensorflow page with the same explanation
- The towards data science blog which describes the same algorithm the list goes on. However,
every single one of these implementations is wrong.The original word2vec

`C`

implementation doesnotdo what's explained above, and isdrastically different. Most serious users of word embeddings, who use embeddings generated from`word2vec`

do one of the following things:

- They invoke the original C implementation directly.
- They invoke the
`gensim`

implementation, which istransliteratedfrom the C source to the extent that the variables names are the same.Indeed, the

`gensim`

implementation is theonly one that I know of which is faithful to the C implementation.## The C implementation

The C implementation in fact maintains

two vectors for each word, one where it appears as a focus word, and one where it appears as a context word. (Is this sounding familiar? Indeed, it appears that GloVe actually took this idea from`word2vec`

, which has never mentioned this fact!)The setup is incredibly well done in the C code:

- An array called
`syn0`

holds the vector embedding of a word when it occurs as afocus word. This israndom initialized.https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L369 for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) { next_random = next_random * (unsigned long long)25214903917 + 11; syn0[a * layer1_size + b] = (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size; }

- Another array called
`syn1neg`

holds the vector of a word when it occurs as acontext word. This iszero initialized.https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L365 for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) syn1neg[a * layer1_size + b] = 0;

- During training (skip-gram, negative sampling, though other cases are also similar), we first pick a focus word. This is held constant throughout the positive and negative sample training. The gradients of the focus vector are accumulated in a buffer, and are applied to the focus word
after it has been affected by both positive and negative samples.if (negative > 0) for (d = 0; d < negative + 1; d++) { // if we are performing negative sampling, in the 1st iteration, // pick a word from the context and set the dot product target to 1 if (d == 0) { target = word; label = 1; } else { // for all other iterations, pick a word randomly and set the dot //product target to 0 next_random = next_random * (unsigned long long)25214903917 + 11; target = table[(next_random >> 16) % table_size]; if (target == 0) target = next_random % (vocab_size - 1) + 1; if (target == word) continue; label = 0; } l2 = target * layer1_size; f = 0; // find dot product of original vector with negative sample vector // store in f for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2]; // set g = sigmoid(f) (roughly, the actual formula is slightly more complex) if (f > MAX_EXP) g = (label - 1) * alpha; else if (f < -MAX_EXP) g = (label - 0) * alpha; else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha; // 1. update the vector syn1neg, // 2. DO NOT UPDATE syn0 // 3. STORE THE syn0 gradient in a temporary buffer neu1e for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2]; for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1]; } // Finally, after all samples, update syn1 from neu1e https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L541 // Learn weights input -> hidden for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];## Why random and zero initialization?

Once again, since none of this actually explained in the original papers

or on the web, I can only hypothesize.My hypothesis is that since the negative samples come from all over the text and are not really weighed by frequency, you can wind up picking

any word, and more often than not,a word whose vector has not been trained much at all. If this vector actually had a value, then it could move the actually important focus word randomly.The solution is to set all negative samples to zero, so that

only vectors that have occurred somewhat frequentlywill affect the representation of another vector.It's quite ingenious, really, and until this, I'd never really thought of how important initialization strategies really are.

## Why I'm writing this

I spent two months of my life trying to reproduce

`word2vec`

, following the paper exactly, reading countless articles, and simply not succeeding. I was unable to reach the same scores that`word2vec`

did, and it was not for lack of trying.I could not have imagined that the paper would have literally fabricated an algorithm that doesn't work, while the implementation does something completely different.

Eventually, I decided to read the sources, and spent three whole days convinced I was reading the code wrong since literally everything on the internet told me otherwise.

I don't understand why the original paper and the internet contain zero explanations of the

actualmechanism behind`word2vec`

, so I decided to put it up myself.This also explains GloVe's radical choice of having a separate vector for the negative context --- they were just doing what

`word2vec`

does, but they told people about it`:)`

.Is this academic dishonesty? I don't know the answer, and that's a heavy question. But I'm frankly incredibly pissed, and this is probably the last time I take a machine learning paper's explanation of the algorithm seriously again --- from next time, I read the source

first.

GitHub - bollu/bollu.github.io: code + contents of my website, and programming life