Skip to main content

openNLP: getting started, code example

openNLP is an interesting tool for Natural Language Processing. Today, it took me a while to get started. So I want to write again that next time (or anyone) who wants to try it out, it will take less time for you.

Here we go:

1. Download and build:
This is the main website: http://opennlp.sourceforge.net/

You can either download the package there and build to the .jar file (where you have to set your $JAVA_HOME environment - see below). Or you can directly download the .jar file from this link.

This step took me a while since I didn't know how to set my $JAVA_HOME, and didn't find out that there's already a .jar file to download.


2. Some code


So now, you want to start with some code. Here is some sample code for doing Sentence Detection and Tokenization.

Note that you can either download the models from the previous website or have the training dataset yourself.

In this example, I used 2 models of openNLP (EnglishSD.bin.gz and EnglishTok.bin.gz).

//This is the path to your model files
SentenceDetector sendet = new SentenceDetector("opennlp-tools-1.4.3/Models/EnglishSD.bin.gz");
Tokenizer tok = new Tokenizer("opennlp-tools-1.4.3/Models/EnglishTok.bin.gz");

//Sentence detection
String[] sens = sendet.sentDetect("This is sentence one. This is sentence two.");

for (int i=0; i<sens.length; i++)

{
System.out.println("Sentence " + i + ": ");
String[] tokens = tok.tokenize(sens[i]);
for (int j=0; j
<sens.length; j++)
System.out.print(tokens[j] + " - ");
System.out.println();
}



3. Other notes
If you got some exception when running the above code, it's probably that you didn't include the .jar files (e.g., maxent.jar and trove.jar) in the /lib folder.

Good luck!

Comments

Popular posts from this blog

Skip-gram model and negative sampling

In the previous post , we have seen the 3 word2vec models: skip-gram, CBOW and GloVe. Now let's have a look at negative sampling and what it is used to make training skip-gram faster. The idea is originated from this paper: " Distributed Representations of Words and Phrases and their Compositionality ” (Mikolov et al. 2013) In the previous example , we have seen that if we have a vocabulary of size 10K, and we want to train word vectors of size 300. Then the number of parameters we have to estimate in each layer is 10Kx300. This number is big and makes training prone to over-fitting and gives too much focus on words that appear often, and less focus on rare words. Subsampling of frequent words So the idea of subsampling is that: we try to maximize the probability that "real outside word" appears, and minimize the probability that "random words" appear around center word. Real outside words are words that characterize the meaning of the center word, wh

Spam and Bayes' theorem

I divide my email into three categories: A1 = spam. A2 = low priority, A3 = high priority. I find that: P(A1) = .7 P(A2) = .2 P(A3) = .1 Let B be the event that an email contains the word "free". P(B|A1) = .9 P(B|A2) = .01 P(B|A3) = .01 I receive an email with the word "free". What is the probability that it is spam?

Pytorch and Keras cheat sheets