Skip to main content

Skip-gram model and negative sampling

In the previous post, we have seen the 3 word2vec models: skip-gram, CBOW and GloVe. Now let's have a look at negative sampling and what it is used to make training skip-gram faster.
The idea is originated from this paper:
"Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013)
In the previous example, we have seen that if we have a vocabulary of size 10K, and we want to train word vectors of size 300. Then the number of parameters we have to estimate in each layer is 10Kx300. This number is big and makes training prone to over-fitting and gives too much focus on words that appear often, and less focus on rare words.

Subsampling of frequent words

So the idea of subsampling is that: we try to maximize the probability that "real outside word" appears, and minimize the probability that "random words" appear around center word. Real outside words are words that characterize the meaning of the center word, while "random words" tend to occur very often and go together with many other different words (e.g., "the", "an", "a").
So each word will have a probability of being "kept", where more frequent words have lower probability of being "kept" and vice versa.

Negative Sampling

In our model, we have a very big number of weights. Everytime we process a new training sample, we will have to go back and update all our weights in the model, which makes this process very slow.
We use negative sampling to address this problem: instead of modifying all of the weights, we modify only a small percentage of them.
In particular, we will take some random negative words (i.e., words that are not in the context) and update weights for our positive words (i.e., words that are in the context).
So how many "random negative words" should we draw? In the paper, they suggest 2-5 words for large dataset and 5-20 words for small dataset.


Comments

Popular posts from this blog

Python Tkinter: Changing background images using key press

Let's write a simple Python application that changes its background image everytime you click on it. Here is a short code that helps you do that: import os, sys import Tkinter import Image, ImageTk def key(event): print "pressed", repr(event.char) event.widget.quit() root = Tkinter.Tk() root.bind_all(' ', key) root.geometry('+%d+%d' % (100,100)) dirlist = os.listdir('.') old_label_image = None for f in dirlist: try: image1 = Image.open(f) root.geometry('%dx%d' % (image1.size[0],image1.size[1])) tkpi = ImageTk.PhotoImage(image1) label_image = Tkinter.Label(root, image=tkpi) label_image.place(x=0,y=0,width=image1.size[0],height=image1.size[1]) root.title(f) if old_label_image is not None: old_label_image.destroy() old_label_image = label_image root.mainloop() # wait until user clicks the window except Exception, e: # Skip a...

Parameter estimation

parameter-estimation Parameter Estimation ¶ Fundamentals ¶ Problem Statement Suppose that the population distribution follows a parameteric model $f(x|\theta)$ and given a random sample $X_1,X_2, ..., X_n$ from the population $X_i\tilde{} f(x|\theta)$, estimate the parameter of interest $\theta$ Basic assumption in parametric estimation is that the population distribution follows some parameteric model . Here, parametric models are those of the form: $$\mathcal{F}=f(x,\theta), \theta\in\Theta$$ where $\Theta\subset R^k$ is the parameter space, and $\theta$ is the parameter. Example Normal distribution has two parameters $\mu$ and $\sigma$ Terminologies Estimator $\hat{\theta}$ is a rule to calculate an estimate of a given quantity (model parameter) based on observed data. Estimate is a fixed value of that estimator for a particular observed sample. S...