In the previous post, we have seen the 3 word2vec models: skip-gram, CBOW and GloVe. Now let's have a look at negative sampling and what it is used to make training skip-gram faster.
The idea is originated from this paper:
"Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013)
In the previous example, we have seen that if we have a vocabulary of size 10K, and we want to train word vectors of size 300. Then the number of parameters we have to estimate in each layer is 10Kx300. This number is big and makes training prone to over-fitting and gives too much focus on words that appear often, and less focus on rare words.
So each word will have a probability of being "kept", where more frequent words have lower probability of being "kept" and vice versa.
We use negative sampling to address this problem: instead of modifying all of the weights, we modify only a small percentage of them.
In particular, we will take some random negative words (i.e., words that are not in the context) and update weights for our positive words (i.e., words that are in the context).
So how many "random negative words" should we draw? In the paper, they suggest 2-5 words for large dataset and 5-20 words for small dataset.
The idea is originated from this paper:
"Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013)
In the previous example, we have seen that if we have a vocabulary of size 10K, and we want to train word vectors of size 300. Then the number of parameters we have to estimate in each layer is 10Kx300. This number is big and makes training prone to over-fitting and gives too much focus on words that appear often, and less focus on rare words.
Subsampling of frequent words
So the idea of subsampling is that: we try to maximize the probability that "real outside word" appears, and minimize the probability that "random words" appear around center word. Real outside words are words that characterize the meaning of the center word, while "random words" tend to occur very often and go together with many other different words (e.g., "the", "an", "a").So each word will have a probability of being "kept", where more frequent words have lower probability of being "kept" and vice versa.
Negative Sampling
In our model, we have a very big number of weights. Everytime we process a new training sample, we will have to go back and update all our weights in the model, which makes this process very slow.We use negative sampling to address this problem: instead of modifying all of the weights, we modify only a small percentage of them.
In particular, we will take some random negative words (i.e., words that are not in the context) and update weights for our positive words (i.e., words that are in the context).
So how many "random negative words" should we draw? In the paper, they suggest 2-5 words for large dataset and 5-20 words for small dataset.
Comments
Post a Comment