Skip to main content

Word embeddings

In this post, we are going to talk about word embedding (or word vector), which is how we represent words in NLP. Word embedding is used in many higher-level applications such as sentiment analysis, Q&A, etc. Let's have a look at the most currently widely used models.

One-hot vector

is a vector of size V, with V is the vocabulary size. It has value 1 in one position (represents the value of this word "appears") and 0 in all other positions.
[0, 0, ... 1, .., 0]
This is usually used as the input of a word2vec model. It is just operating as a lookup table.
So this one-hot encoding treats words as independent units. In fact, we want to find the "similarity" between words for many other higher-level tasks such as document classification, Q&A, etc.
The idea is: To capture the meaning of a word, we look at the words that frequently appear close-by this word. Let's have a look at some state-of-the-art architectures that give us the results of word vectors.

The Skip-gram model


http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

The intuition behind this model is to predict the "context" words given the "center" word.
Given a center word, we want to tell what the probability of another word appearing in its context is. In order to do that, we can use an architecture as illustrated in the picture.
  • Input: One-hot vector, with the size = the size of the whole vocabulary
  • Training examples: word pairs (words that appear together in a window with a certain size)
  • Output: Vector representation of each word, or more specific in the testing phase, given an input word, the output is a probability vector (how likely each word to be the context word of the given input word is).
Now let's have a look at the architecture:
  • The hidden layer: Let's say we want to learn a word vector with 500 features (e.g., 500 dimensions), our vocabulary size is 10K. Then the hidden layer is going to be represented by a matrix of size (10K, 500). 
  • The output layer (softmax classifier): The output layer is a softmax regression classifier, which gives output between 0 and 1. So each given word, it will output a vector size 500 to represent this word. 
A more details about this model can be found here.

The CBOW (Continuous Bag of Words) model

Its architecture is very similar to the skip-gram model, except for the fact that, in skip-gram, we predict context based on center words, while CBOW predicts a "center" word given a bag of "context" words.

GloVe model

This model combines traditional count-based model with direct prediction models as we have been above (Skip-gram and CBOW). This is one of the most popular model that is currently used nowadays.
The model is pretty fast training, scalable to huge corpora and has good performance even with small corpus, and small vectors.

Comments

Popular posts from this blog

Pytorch and Keras cheat sheets

openNLP: getting started, code example

openNLP is an interesting tool for Natural Language Processing. Today, it took me a while to get started. So I want to write again that next time (or anyone) who wants to try it out, it will take less time for you. Here we go: 1. Download and build: This is the main website: http://opennlp.sourceforge.net/ You can either download the package there and build to the .jar file (where you have to set your $JAVA_HOME environment - see below). Or you can directly download the .jar file from this link . This step took me a while since I didn't know how to set my $JAVA_HOME, and didn't find out that there's already a .jar file to download. 2. Some code So now, you want to start with some code. Here is some sample code for doing Sentence Detection and Tokenization. Note that you can either download the models from the previous website or have the training dataset yourself. In this example, I used 2 models of openNLP (EnglishSD.bin.gz and EnglishTok.bin.gz). //This is ...

Sigmoid, tanh, ReLU functions. What are they and when to use which?

If you are working on Deep Learning or Machine Learning in general, you have heard of these three functions quite frequently. We know that they can all be used as activation functions in neural networks. But what are these functions and why do people use for example ReLU in this part, sigmoid in another part and so on? Here is a friendly introduction to these functions and a brief explanation of when to use which. Sigmoid function Output from 0 to 1 Exponential computation (hence, slow) Is usually used for binary classification (when output is 0 or 1) Almost never used (e.g., tanh is a better option) Tanh function A rescaled logistic sigmoid function (center at 0) Exponential computation Works better than sigmoid ReLU function (Rectified Linear Unit) and its variants Faster to compute Often used as default for activation function in hidden layers ReLU is a simple model which gives 0 value to all W*x + b < 0. The importance is that it introduces t...