Skip to main content

openNLP: getting started, code example

openNLP is an interesting tool for Natural Language Processing. Today, it took me a while to get started. So I want to write again that next time (or anyone) who wants to try it out, it will take less time for you.

Here we go:

1. Download and build:
This is the main website: http://opennlp.sourceforge.net/

You can either download the package there and build to the .jar file (where you have to set your $JAVA_HOME environment - see below). Or you can directly download the .jar file from this link.

This step took me a while since I didn't know how to set my $JAVA_HOME, and didn't find out that there's already a .jar file to download.


2. Some code


So now, you want to start with some code. Here is some sample code for doing Sentence Detection and Tokenization.

Note that you can either download the models from the previous website or have the training dataset yourself.

In this example, I used 2 models of openNLP (EnglishSD.bin.gz and EnglishTok.bin.gz).

//This is the path to your model files
SentenceDetector sendet = new SentenceDetector("opennlp-tools-1.4.3/Models/EnglishSD.bin.gz");
Tokenizer tok = new Tokenizer("opennlp-tools-1.4.3/Models/EnglishTok.bin.gz");

//Sentence detection
String[] sens = sendet.sentDetect("This is sentence one. This is sentence two.");

for (int i=0; i<sens.length; i++)

{
System.out.println("Sentence " + i + ": ");
String[] tokens = tok.tokenize(sens[i]);
for (int j=0; j
<sens.length; j++)
System.out.print(tokens[j] + " - ");
System.out.println();
}



3. Other notes
If you got some exception when running the above code, it's probably that you didn't include the .jar files (e.g., maxent.jar and trove.jar) in the /lib folder.

Good luck!

Comments

Popular posts from this blog

Pytorch and Keras cheat sheets

Python Tkinter: Changing background images using key press

Let's write a simple Python application that changes its background image everytime you click on it. Here is a short code that helps you do that: import os, sys import Tkinter import Image, ImageTk def key(event): print "pressed", repr(event.char) event.widget.quit() root = Tkinter.Tk() root.bind_all(' ', key) root.geometry('+%d+%d' % (100,100)) dirlist = os.listdir('.') old_label_image = None for f in dirlist: try: image1 = Image.open(f) root.geometry('%dx%d' % (image1.size[0],image1.size[1])) tkpi = ImageTk.PhotoImage(image1) label_image = Tkinter.Label(root, image=tkpi) label_image.place(x=0,y=0,width=image1.size[0],height=image1.size[1]) root.title(f) if old_label_image is not None: old_label_image.destroy() old_label_image = label_image root.mainloop() # wait until user clicks the window except Exception, e: # Skip a...

Word embeddings

In this post, we are going to talk about word embedding (or word vector), which is how we represent words in NLP. Word embedding is used in many higher-level applications such as sentiment analysis, Q&A, etc. Let's have a look at the most currently widely used models. One-hot vector is a vector of size V, with V is the vocabulary size. It has value 1 in one position (represents the value of this word "appears") and 0 in all other positions. [0, 0, ... 1, .., 0] This is usually used as the input of a word2vec model. It is just operating as a lookup table. So this one-hot encoding treats words as independent units. In fact, we want to find the "similarity" between words for many other higher-level tasks such as document classification, Q&A, etc. The idea is: To capture the meaning of a word, we look at the words that frequently appear close-by this word. Let's have a look at some state-of-the-art architectures that give us the results of word ve...