Skip to main content

Skip-gram model and negative sampling

In the previous post, we have seen the 3 word2vec models: skip-gram, CBOW and GloVe. Now let's have a look at negative sampling and what it is used to make training skip-gram faster.
The idea is originated from this paper:
"Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013)
In the previous example, we have seen that if we have a vocabulary of size 10K, and we want to train word vectors of size 300. Then the number of parameters we have to estimate in each layer is 10Kx300. This number is big and makes training prone to over-fitting and gives too much focus on words that appear often, and less focus on rare words.

Subsampling of frequent words

So the idea of subsampling is that: we try to maximize the probability that "real outside word" appears, and minimize the probability that "random words" appear around center word. Real outside words are words that characterize the meaning of the center word, while "random words" tend to occur very often and go together with many other different words (e.g., "the", "an", "a").
So each word will have a probability of being "kept", where more frequent words have lower probability of being "kept" and vice versa.

Negative Sampling

In our model, we have a very big number of weights. Everytime we process a new training sample, we will have to go back and update all our weights in the model, which makes this process very slow.
We use negative sampling to address this problem: instead of modifying all of the weights, we modify only a small percentage of them.
In particular, we will take some random negative words (i.e., words that are not in the context) and update weights for our positive words (i.e., words that are in the context).
So how many "random negative words" should we draw? In the paper, they suggest 2-5 words for large dataset and 5-20 words for small dataset.


Comments

Popular posts from this blog

Pytorch and Keras cheat sheets

Sigmoid, tanh, ReLU functions. What are they and when to use which?

If you are working on Deep Learning or Machine Learning in general, you have heard of these three functions quite frequently. We know that they can all be used as activation functions in neural networks. But what are these functions and why do people use for example ReLU in this part, sigmoid in another part and so on? Here is a friendly introduction to these functions and a brief explanation of when to use which. Sigmoid function Output from 0 to 1 Exponential computation (hence, slow) Is usually used for binary classification (when output is 0 or 1) Almost never used (e.g., tanh is a better option) Tanh function A rescaled logistic sigmoid function (center at 0) Exponential computation Works better than sigmoid ReLU function (Rectified Linear Unit) and its variants Faster to compute Often used as default for activation function in hidden layers ReLU is a simple model which gives 0 value to all W*x + b < 0. The importance is that it introduces t...

Python Tkinter: Changing background images using key press

Let's write a simple Python application that changes its background image everytime you click on it. Here is a short code that helps you do that: import os, sys import Tkinter import Image, ImageTk def key(event): print "pressed", repr(event.char) event.widget.quit() root = Tkinter.Tk() root.bind_all(' ', key) root.geometry('+%d+%d' % (100,100)) dirlist = os.listdir('.') old_label_image = None for f in dirlist: try: image1 = Image.open(f) root.geometry('%dx%d' % (image1.size[0],image1.size[1])) tkpi = ImageTk.PhotoImage(image1) label_image = Tkinter.Label(root, image=tkpi) label_image.place(x=0,y=0,width=image1.size[0],height=image1.size[1]) root.title(f) if old_label_image is not None: old_label_image.destroy() old_label_image = label_image root.mainloop() # wait until user clicks the window except Exception, e: # Skip a...