Unix pipelines for basic spelling error detection

Topics: Unix,spelling,tr,sort,uniq,find,awk

Introduction

We can of course write programs to do most anything we want, but often the Unix command line has everything we need to perform a series of useful operations without writing a line of code. In my Applied NLP class today, I show how one can get a high-confidence dictionary out of a body of raw text with a series of Unix pipes, and I’m posting that here so students can refer back to it later and see some pointers to other useful Unix resources.

Note: for help with any of the commands, just type “man <command>” at the Unix prompt.

Checking for spelling errors

We are working on automated spelling correction as an in-class exercise, with a particular emphasis on the following sentence:

This Facebook app shows that she is there favorite acress in tonw

So, this has a contextual spelling error (there), an error that could be a valid English word but isn’t (acress) and an error that violates English sound patterns (tonw).

One of the key ingredients for spelling correction is a dictionary of words known to be valid in the language. Let’s assume we are working with English here. On most Unix systems, you can pick up an English dictionary in /usr/share/dict/words, though the words you find may vary from one platform to another. If you can’t find anything there, there are many word lists available online, e.g. check out the Wordlist project for downloads and links.

We can easily use the dictionary and Unix to check for words in the above sentence that don’t occur in the dictionary. First, save the sentence to a file.

[sourcecode language=”bash”]
$ echo "This Facebook app shows that she is there favorite acress in tonw" > sentence.txt
[/sourcecode]

Next, we need to get the unique word types (rather than tokens) is sorted lexicographic order. The following Unix pipeline accomplishes this.

[sourcecode language=”bash”]
$ cat sentence.txt | tr ‘ ‘ ‘n’ | sort | uniq > words.txt
[/sourcecode]

To break it down:

  •  The cat command spills the file to standard output.
  • The tr command “translates” all spaces to new lines. So, this gives us one word per line.
  • The sort command sorts the lines lexicographically.
  • The uniq command makes those lines uniq by making adjacent duplicates disappear. (This doesn’t do anything for this particular sentence, but I’m putting it in there in case you try other sentences that have multiple tokens of the type “the”, for example.)

You can see these effects by doing each in turn, building up the pipeline incrementally.

[sourcecode language=”bash”]
$ cat sentence.txt
This Facebook app shows that she is there favorite acress in tonw
$ cat sentence.txt | tr ‘ ‘ ‘n’
This
Facebook
app
shows
that
she
is
there
favorite
acress
in
tonw
$ cat sentence.txt | tr ‘ ‘ ‘n’ | sort
Facebook
This
acress
app
favorite
in
is
she
shows
that
there
tonw
[/sourcecode]

Note: the use of cat above is a UUOC (unnecessary use of cat) that is dispreferred to sending the input directly into tr at the start. I do it this way in the tutorial so that everything flows left-to-right. However, if you want to avoid cat abuse, here’s how you’d do it.

[sourcecode language=”bash”]

$ tr ‘ ‘ ‘n’ < sentence.txt | sort | uniq
[/sourcecode]

We can now use the comm command to compare the file words.txt and the dictionary. It produces three columns of output: the first gives the lines only in the first file, the second are lines only in the second file, and the third are those in common. So, the first column has what we need, because those are words in our sentence that are not found in the dictionary. Here’s the command to get that.

[sourcecode language=”bash”]
$ comm -23 words.txt /usr/share/dict/words
Facebook
This
acress
app
shows
tonw
[/sourcecode]

The -23 options indicate we should suppress columns 2 and 3 and show only column 1. If we just use -2, we get the words in the sentence with the non-dictionary words on the left and the dictionary words on the right (try it).

The problem of course is that any word list will have gaps. This dictionary doesn’t have more recent terms like Facebook and app. It also doesn’t have upper-case This. You can ignore case with comm using the -i option and this goes away. It doesn’t have shows, which is not in the dictionary since it is an inflected form of the verb stem show. We could fix this with some morphological analysis, but instead of that, let’s go the lazy route and just grab a larger list of words.

Extracting a high-confidence dictionary from a corpus

Raw text often contains spelling errors, but errors don’t tend to happen with very high frequency, so we can often get pretty good expanded word lists by computing frequencies of word types on lots of text and then applying reasonable cutoffs. (There are much more refined methods, but this will suffice for current purposes.)

First, let’s get some data. The Open American National Corpus has just released v3.0.0 of its Manually Annotated Sub-Corpus (MASC), which you can get from this link.

– http://www.anc.org/masc/MASC-3.0.0.tgz

Do the following to get it and set things up for further processing:

[sourcecode language=”bash”]
$ mkdir masc
$ cd masc
$ wget http://www.anc.org/masc/MASC-3.0.0.tgz
$ tar xzf MASC-3.0.0.tgz
[/sourcecode]

(If you don’t have wget, you can just download the MASC file in your browser and then move it over.)

Next, we want all the text from the data/written directory. The find command is very handy for this.

[sourcecode language=”bash”]
$ find data/written -name "*.txt" -exec cat {} ; > all-written.txt
[/sourcecode]

To see how much is there, use the wc command.

[sourcecode language=”bash”]
$ wc all-written.txt
43061 400169 2557685 all-written.txt
[/sourcecode]

So, there are 43k lines, and 400k tokens. That’s a bit small for what we are trying to do, but it will suffice for the example.

Again, I’ll build up a Unix pipeline to extract the high-confidence word types from this corpus. I’ll use the head command to show just part of the output at each stage.

Here are the raw contents.

[sourcecode language=”bash”]
$ cat all-written.txt | head

I can’t believe I wrote all that last year.
Acephalous

Friday, 07 May 2010

[/sourcecode]

Now, get one word per line.

[sourcecode language=”bash”]
$ cat all-written.txt | tr -cs ‘A-Za-z’ ‘n’ | head

I
can
t
believe
I
wrote
all
that
last
[/sourcecode]

The tr translator is used very crudely: basically, anything that is not an ASCII letter character is turned into a new line. The -cs options indicate to take the complement (opposite) of the ‘A-Za-z’ argument and to squeeze duplicates (e.g. A42, becomes A with a single new line rather than three).

Next, we sort and uniq, as above, except that we use the -c option to uniq so that it produces counts.

[sourcecode language=”bash”]
$ cat all-written.txt | tr -cs ‘A-Za-z’ ‘n’ | sort | uniq -c | head
1
737 A
22 AA
1 AAA
1 AAF
1 AAPs
21 AB
3 ABC
1 ABDULWAHAB
1 ABLE
[/sourcecode]

Because the MASC corpus includes tweets and blogs and other unedited text, we don’t trust words that have low counts, e.g. four or fewer tokens of that type. We can use awk to filter those out.

[sourcecode language=”bash”]
$ cat all-written.txt | tr -cs ‘A-Za-z’ ‘n’ | sort | uniq -c | awk ‘{ if($1>4) print $2 }’ | head
A
AA
AB
ACR
ADDRESS
ADP
ADPNP
AER
AIG
ALAN
[/sourcecode]

Awk makes it easy to process lines of files, and gives you indexes into the first column ($1), second ($2), and so on. There’s much more you can do, but this shows how you can conditionally output some information from each line using awk.

You can of course change the threshold. You can also turn all words to lower-case by inserting another tr call into the pipe, e.g.:

[sourcecode language=”bash”]
$ cat all-written.txt | tr ‘A-Z’ ‘a-z’ | tr -cs ‘a-z’ ‘n’ | sort | uniq -c | awk ‘{ if($1>8) print $2 }’ | head
a
aa
ab
abandoned
abbey
ability
able
abnormal
abnormalities
aboard
[/sourcecode]

It all comes down to what you need out of the text.

Combining and using the dictionaries

Let’s do the check on the sentence above, but using both the standard dictionary and the one derived from MASC. Run the following command first.

[sourcecode language=”bash”]
$ cat all-written.txt | tr -cs ‘A-Za-z’ ‘n’ | sort | uniq -c | awk ‘{ if($1>4) print $2 }’ > /tmp/masc_vocab.txt
[/sourcecode]

Then in the directory where you saved words.txt, do the following.

[sourcecode language=”bash”]
$ cat /usr/share/dict/words /tmp/masc_vocab.txt | sort | uniq > big_vocab.txt
$ comm -23 words.txt big_vocab.txt
acress
tonw
[/sourcecode]

Ta-da! The MASC corpus provided us with enough examples of other words that This, Facebook, app, and shows are no longer detected as errors. Of course, detecting there as an error is much more difficult and requires language models and more.

Conclusion

Learn to use the Unix command line! This post is just a start into many cool things you can do with Unix pipes. Here are some other resources:

Happy (Unix) hacking!

A walk-through for the Twitter streaming API

Topics: Twitter, streaming API

Introduction

Analyzing tweets is all the rage, and if you are new to the game you want to know how to get them programmatically. There are many ways to do this, but a great start is to use the Twitter streaming API, a RESTful service that allows you to pull tweets in real time based on criteria you specify. For most people, this will mean having access to the spritzer, which provides only a very small percentage of all the tweets going through Twitter at any given moment. For access to more, you need to have a special relationship with Twitter or pay Twitter or an affiliate like Gnip.

This post provides a basic walk-through for using the Twitter streaming API. You can get all of this based on the documentation provided by Twitter, but this will be slightly easier going for those new to such services. (This post is mainly geared for the first phase of the course project for students in my Applied Natural Language Processing class this semester.)

You need to have a Twitter account to do this walk-through, so obtain one now if you don’t have one already.

Accessing a random sample of tweets

First, trying pulling a random sample of tweets using your browser by going to the following link.

  • https://stream.twitter.com/1/statuses/sample.json

You should see a growing, unwieldy list of raw tweets flowing by. It should look something like the following image.

tweets_sample

Here’s an example of a “raw” tweet (which comes in JSON, or JavaScript Object Notation):

[sourcecode language=”json”]
{"text":"#LetsGoMavs til the end RT @dallasmavs: Are You ALL IN?","truncated":false,"retweeted":false,"geo":null,"retweet_count":0,"source":"web","in_reply_to_status_id_str":null,"created_at":"Wed Apr 25 15:47:39 +0000 2012","in_reply_to_user_id_str":null,"id_str":"195177260792299521","coordinates":null,"in_reply_to_user_id":null,"favorited":false,"entities":{"hashtags":[{"text":"LetsGoMavs","indices":[0,11]}],"urls":[],"user_mentions":[{"indices":[27,38],"screen_name":"dallasmavs","id_str":"22185437","name":"Dallas Mavericks","id":22185437}]},"contributors":null,"user":{"show_all_inline_media":true,"statuses_count":3101,"following":null,"profile_background_image_url_https":"https://si0.twimg.com/profile_background_images/285480449/AAC_med500.jpg","profile_sidebar_border_color":"eeeeee","screen_name":"flyingcape","follow_request_sent":null,"verified":false,"listed_count":2,"profile_use_background_image":true,"time_zone":"Mountain Time (US &amp; Canada)","description":"HUGE ROCKETS &amp; MAVS fan. Lets take down the Lakers &amp; beat up on the East. Inaugural member of the FC Dallas – Fort Worth fan club.","profile_text_color":"333333","default_profile":false,"profile_background_image_url":"http://a0.twimg.com/profile_background_images/285480449/AAC_med500.jpg","created_at":"Thu Oct 21 15:40:21 +0000 2010","is_translator":false,"profile_link_color":"1212cc","followers_count":35,"url":null,"profile_image_url_https":"https://si0.twimg.com/profile_images/1658982184/204970_10100514487859080_7909803_68807593_5366704_o_normal.jpg","profile_image_url":"http://a0.twimg.com/profile_images/1658982184/204970_10100514487859080_7909803_68807593_5366704_o_normal.jpg","id_str":"205774740","protected":false,"contributors_enabled":false,"geo_enabled":true,"notifications":null,"profile_background_color":"0a2afa","name":"Mandy","default_profile_image":false,"lang":"en","profile_background_tile":true,"friends_count":48,"location":"ATX / FDub. From Galveston !","id":205774740,"utc_offset":-25200,"favourites_count":231,"profile_sidebar_fill_color":"efefef"},"id":195177260792299521,"place":{"bounding_box":{"type":"Polygon","coordinates":[[[-97.938383,30.098659],[-97.56842,30.098659],[-97.56842,30.49685],[-97.938383,30.49685]]]},"country":"United States","url":"http://api.twitter.com/1/geo/id/c3f37afa9efcf94b.json","attributes":{},"full_name":"Austin, TX","country_code":"US","name":"Austin","place_type":"city","id":"c3f37afa9efcf94b"},"in_reply_to_screen_name":null,"in_reply_to_status_id":null}
[/sourcecode]

There is a lot of information in there beyond the tweet text itself, which is simply “#LetsGoMavs til the end RT @dallasmavs: Are You ALL IN?” It is basically a map from attributes to values (and values may themselves be such a map, e.g. for the “user” attribute above). You can see whether the tweet has been retweeted (which will be zero when the tweet is first published), what time it was created, the unique tweet id, the geo-coordinates (if available), and more. If an attribute does not have a value for the tweet, it is ‘null’.

I will return to JSON processing of tweets in a later tutorial, but you can get a head start by seeing my tutorial on using Scala to process JSON in general.

Command line access to tweets

Assuming you were successful in being able to view tweets in the browser, we can now proceed to using the command line. For this, it will be convenient to first set environment variables for your Twitter username and password.

[sourcecode language=”bash”]
$ export TWUSER=foo
$ export TWPWD=bar
[/sourcecode]

Obviously, you need to provide your Twitter account details instead of foo and bar…

Next, we’ll use the program curl to interact with the API. Try it out by downloading this blog post.

[sourcecode language=”bash”]
$ curl http://bcomposes.wordpress.com/2013/01/25/a-walk-through-for-the-twitter-streaming-api/ > bcomposes-twitter-api.html
$ less bcomposes-twitter-api.html
[/sourcecode]

Given that you pulled tweets from the API using your web browser, and that curl can access web pages in this way, it is simple to use curl to get tweets and direct them straight to a file.

[sourcecode language=”bash”]
$ curl https://stream.twitter.com/1/statuses/sample.json -u$TWUSER:$TWPWD > tweets.json
[/sourcecode]

That’s it: you now have an ever-growing file with randomly sampled tweets. Have a look and try not to lose your faith in humanity. 😉

Pulling tweets with specific properties

You might want to get the tweets from specific users rather than a random sample. This requires user ids rather than the user names we usually see. The id for a user can be obtained from the Twitter API by looking at the /users/show endpoint. For example, the following gives my information:

  • https://api.twitter.com/1/users/show.xml?screen_name=jasonbaldridge

Which gives:

[sourcecode language=”xml”]

<user>
<id>119837224</id>
<name>Jason Baldridge</name>
<screen_name>jasonbaldridge</screen_name>
<location>Austin, Texas</location>
<description>
Assoc. Prof., Computational Linguistics, UT Austin. Senior Data Scientist, Converseon. OpenNLP developer. Scala, Java, R, and Python programmer.
</description>
…MORE…

[/sourcecode]

So, to follow @jasonbaldridge via the Twitter API, you need user id 119837224. You can pull my tweets via the API using the “follow” query parameter.

[sourcecode language=”bash”]
$ curl -d follow=119837224 https://stream.twitter.com/1/statuses/filter.json -u$TWUSER:$TWPWD
[/sourcecode]

There is a good chance I’m not tweeting right now, so you’ll probably not see anything. Let’s follow more users, which we can do by adding more id’s separated by commas.

[sourcecode language=”bash”]
$ curl -d follow=1344951,5988062,807095,3108351 https://stream.twitter.com/1/statuses/filter.json -u$TWUSER:$TWPWD
[/sourcecode]

This will follow Wired Magazine (@wired), The Economist (@theeconomist), the New York Times (@nytimes), and the Wall Street Journal (@wsj).

You can also write those ids to a file and read them from the file. For example:

[sourcecode language=”bash”]
$ echo "follow=1344951,5988062,807095,3108351" > following
$ curl -d @following https://stream.twitter.com/1/statuses/filter.json -u$TWUSER:$TWPWD
[/sourcecode]

You can of course edit the file “following” rather than using echo to create it. Also, the file name can be named whatever you like (“following” as the name is not important here).

You can search for a particular term in tweets, such as “Scala”, using the “track” query parameter.

[sourcecode language=”bash”]
$ curl -d track=scala https://stream.twitter.com/1/statuses/filter.json -u$TWUSER:$TWPWD
[/sourcecode]

And, no surprise, you can search for multiple items by using commas to separate them.

[sourcecode language=”bash”]
$ curl -d track=scala,python,java https://stream.twitter.com/1/statuses/filter.json -u$TWUSER:$TWPWD
[/sourcecode]

However, this only requires that a tweet match at least one of these terms. If you want to ensure that multiple terms match, you’ll need to write them to a file and then refer to that file. For example, to get tweets that have both “sentiment” and “analysis” OR both “machine” and “learning” OR both “text” and “analytics”, you could do the following:

[sourcecode language=”bash”]
$ echo "track=sentiment analysis,machine learning,text analytics" > tracking
$ curl -d @tracking https://stream.twitter.com/1/statuses/filter.json -u$TWUSER:$TWPWD
[/sourcecode]

You can pull tweets from a specific rectangular area (bounding box) on the Earth’s surface. For example, the following pulls geotagged tweets from Austin, Texas.

[sourcecode language=”bash”]
$ curl -d locations=-97.8,30.25,-97.65,30.35 https://stream.twitter.com/1/statuses/filter.json -u$TWUSER:$TWPWD
[/sourcecode]

The bounding box is given as latitude (bottom left), longitude (bottom left), latitude (top right), longitude (top right). You can add further bounding boxes to capture more locations. For example, the following captures tweets from Austin, San Francisco, and New York City.

[sourcecode language=”bash”]
$ curl -d locations=-97.8,30.25,-97.65,30.35,-122.75,36.8,-121.75,37.8,-74,40,-73,41 https://stream.twitter.com/1/statuses/filter.json -u$TWUSER:$TWPWD
[/sourcecode]

Conclusion

It’s all pretty straightforward, and quite handy for many kinds of tweet-gathering needs. One of the problems is that Twitter will drop the connection at times, and you’ll end up missing tweets until you start a new process. If you need constant monitoring,  see UT Austin’s Twools (Twitter tools) for obtaining a steady stream of tweets that picks up whenever Twitter drops your connection.

In a later post, I’ll detail how to use an API like twitter4j to pull tweets and interact with Twitter at a more fundamental level.

Incorporating and using OpenNLP in Scalabha’s SBT build system

Topics: natural language processing, OpenNLP, SBT, Maven, resources, sentence detection, tokenization, part-of-speech tagging

Introduction

Natural language processing involves a wide-range of methods and tasks. However, we usually start with some raw text and start by demarcating what the sentences and the tokens are. We then go on to further levels of processing, such as predicting part-of-speech tags, syntactic chunks, named entities, syntactic structures, and more.

This tutorial has two goals. First, it shows how to use the OpenNLP Tools as an API for doing sentence detection, tokenization, and part-of-speech tagging. It also shows how to add new dependencies and resources to a system like Scalabha and then using those to add new functionality. As prerequisites, see previous tutorials on getting used to working with the SBT build system of Scalabha and adding new code to the existing build system. To see the other tutorials in this series, check out the list on the links page of my Applied Text Analysis course. Of particular relevance is the one on SBT, Scalabha, packages, and build systems.

To do this tutorial, you should be working with Scalabha version 0.2.3. By the end, you should have recreated version 0.2.4, allowing you to check your progress if you run into any problems.

Adding OpenNLP Tools as a dependency

To use OpenNLP’s API, we need to have access to its jar (Java ARchive) files such that our code can compile using classes from the API and then later be executed. It is important at this point to distinguish between explicitly putting a jar file in your build system versus making it available as a managed dependency. To see some explicitly added (unmanaged) dependencies in Scalabha, look at $SCALABHA_DIR/lib.

[sourcecode lang=”bash”]
$ ls $SCALABHA_DIR/lib
Jama-1.0.2.jar          pca_transform-0.7.2.jar
crunch-0.2.0.jar        scrunch-0.1.0.jar
[/sourcecode]

These have been added to the Scalabha repository and are available even before you do any compilation. You can even see them listed in the Scalabha repository on Github.

In contrast, there are many managed dependencies. When you first download Scalabha, you won’t see them, but once you compile, you can look in the lib_managed directory and will find it is populated

[sourcecode lang=”bash”]
$ ls $SCALABHA_DIR/lib_managed
bundles jars    poms
[/sourcecode]

You can go looking into the jars sub-directory to see some of the jars that have been brought in.

To see where these came from, look in the file $SCALABHA_DIR/build.sbt, which declares much of the information that the SBT program needs in order to build the Scalabha system. The dependencies are given in the following declaration.

[sourcecode lang=”scala”]
libraryDependencies ++= Seq(
"org.apache.opennlp" % "opennlp-maxent" % "3.0.2-incubating",
"org.clapper" %% "argot" % "0.3.8",
"org.apache.commons" % "commons-lang3" % "3.0.1",
"commons-logging" % "commons-logging" % "1.1.1",
"log4j" % "log4j" % "1.2.16",
"org.scalatest" % "scalatest_2.9.0" % "1.6.1" % "test",
"junit" % "junit" % "4.10" % "test",
"com.novocode" % "junit-interface" % "0.6" % "test->default") //switch to ScalaTest at some point…
[/sourcecode]

Notice that the OpenNLP Maxent toolkit is in there (along with others), but not the OpenNLP Tools. The Maxent toolkit is used by the OpenNLP Tools (and is part of the same software group/effort), but it can be used independently of it. For example, it is used for the classification homework for the Applied Text Analysis class I’m teaching this semester, which is in fact why the dependency is already in Scalabha v0.2.3.

So, how does one know to write the following to get the OpenNLP Maxent Toolkit as a dependency?

[sourcecode lang=”scala”]
"org.apache.opennlp" % "opennlp-maxent" % "3.0.2-incubating",
[/sourcecode]

I’m not going to go into lots of detail on this, but basically this is what is known as a Maven dependency. On the OpenNLP home page, there is a page for the OpenNLP Maven dependency.  Look on that page to where it defines the OpenNLP Maxent Dependency, repeated here.

[sourcecode lang=”xml”]
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-maxent</artifactId>
<version>3.0.2-incubating</version>
</dependency>
[/sourcecode]

The group ID indicates the organization that is responsible for the artifact (e.g. a given organization can have many different systems that it develops and deploys in this manner). The artifact ID is the name of that particular artifact to distinguish it from others by the same organization, and the version is obviously the particular version number of that artifact. (This makes it possible to use older versions as and when needed.)

The XML above is what one needs if one is using the Maven build system, which many Java projects use. SBT is compatible with such dependencies, but the terser format given above is used instead of XML.

We now want to add the OpenNLP Tools as a dependency. From the OpenNLP dependencies page we see that it is declared this way in XML.

[sourcecode lang=”xml”]
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.5.2-incubating</version>
</dependency>
[/sourcecode]

That means we just need to add the following line to build.sbt in the libraryDependencies declaration.

[sourcecode lang=”scala”]
"org.apache.opennlp" % "opennlp-tools" % "1.5.2-incubating",
[/sourcecode]

And, we can remove the Maxent declaration because OpenNLP Tools depends on it (though it isn’t necessarily a problem if it stays in Scalabha’s build.sbt). The library dependencies should now look as follows.

[sourcecode lang=”scala”]
libraryDependencies ++= Seq(
"org.apache.opennlp" % "opennlp-tools" % "1.5.2-incubating",
"org.clapper" %% "argot" % "0.3.8",
"org.apache.commons" % "commons-lang3" % "3.0.1",
"commons-logging" % "commons-logging" % "1.1.1",
"log4j" % "log4j" % "1.2.16",
"org.scalatest" % "scalatest_2.9.0" % "1.6.1" % "test",
"junit" % "junit" % "4.10" % "test",
"com.novocode" % "junit-interface" % "0.6" % "test->default") //switch to ScalaTest at some point…
[/sourcecode]

The next time you run scalabha build, SBT will read the new dependency declaration and retrieve the dependency. At this point, you might say “What?” How is that sufficient to get the required jars? Here’s how, briefly and at a high level. The OpenNLP artifacts are available on the Maven2 site, and SBT already knows to look there. Put simply, it knows to check this site:

http://repo1.maven.org/maven2

And given that the organization is org.apache.opennlp it knows to then look in this directory:

http://repo1.maven.org/maven2/org/apache/opennlp/

Given that we want the opennlp-tools artifact, it looks here:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/

And finally, given that we want the version 1.5.2-incubating, it looks here:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/1.5.2-incubating/

In that directory are all the files that SBT needs to pull down to your local machine, plus information about any dependencies of OpenNLP Tools that it needs to grab. Here is the main jar:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/1.5.2-incubating/opennlp-tools-1.5.2-incubating.jar

And here is the POM (“Project Object Model”), for OpenNLP Tools:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/1.5.2-incubating/opennlp-tools-1.5.2-incubating.pom

Notice that it includes a reference to OpenNLP Maxent in it, which is why Scalabha’s build.sbt no longer needs to include it explicitly. In fact, it is better to not have it in Scalabha’s build.sbt so that we ensure that the version used by OpenNLP Tools is the one we are using (which matters when we update to, say, a later version of OpenNLP Tools).

In many cases, such artifacts are not hosted at repo1.maven.org. In such cases, you must add a “resolver” that points to another site that contains artifacts. This is done by adding to the resolvers declaration, which is shown here for Scalabha v0.2.3.

[sourcecode lang=”scala”]
resolvers ++= Seq(
"Cloudera Hadoop Releases" at "https://repository.cloudera.com/content/repositories/releases/",
"Thrift location" at "http://people.apache.org/~rawson/repo/"
)
[/sourcecode]

So, when dependencies are declared, SBT will also search through those locations, in addition to its defaults, to find them and pull them down to your machine. As it turns out, OpenNLP has a dependency on the Java WordNet Library, which is hosted on a non-standard Maven repository (which is associated with OpenNLP’s old development site on Sourceforge). You should update build.sbt to be the following:

[sourcecode lang=”scala”]
resolvers ++= Seq(
"Cloudera Hadoop Releases" at "https://repository.cloudera.com/content/repositories/releases/",
"Thrift location" at "http://people.apache.org/~rawson/repo/",
"opennlp sourceforge repo" at "http://opennlp.sourceforge.net/maven2"
)
[/sourcecode]

That was a lot of description, but note that it was a simple change to build.sbt and now we can use the OpenNLP Tools API.

Tip: if you already had SBT running (e.g. via scalabha build) then you must use the reload command at the SBT command after you change build.sbt in order for SBT to know about the changes.

What do you do if the library you want to use isn’t available as a Maven artifact? In that case, you need to put the jar (or jars) for that library, plus any jars it depends on, into the $SCALABHA_DIR/lib directory. Then SBT will see that they are there and add them to your classpath, enabling you to use them just as if they were a managed dependency. The downside is that you must put it there explicitly, which means a bit more hassle when you want to update to later versions, and a fair amount more hassle if that library has lots of dependencies that you also need to manage.

Obtaining and installing the OpenNLP sentence detector model

Now on to the processing of language. Sentence detection simply refers to the basic process of taking a text and identifying the character positions that indicate sentence breaks. As a running example, we’ll use the first several sentences from the Penn Treebank.

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.

Note that the “.” character is not a reliable indicator of what is the end of a sentence. While one can build a regular expression based sentence detector, machine learned models are typically used to figure this out, based on some reasonable numbers of example sentences identified as such by a human.

Roughly and somewhat crudely speaking, a machine learned model is a set of features that are associated with real-valued weights which have been determined from some training material. Once these weights have been learned, the model can be saved and reused (e.g. see the classification homework for Applied Text Analysis).

OpenNLP has pretrained models available for several NLP tasks, including sentence detection. Note also that there is an effort I’m heading to make it possible to distribute and, where possible, rebuild models — see the OpenNLP Models Github repository.

We want to do English sentence detection, so the model we need right now is the en | Sentence Detector. Rather than putting it in some random place on your computer, we’ll add it as part of the Scalabha build system and exploit this to simplify the loading of models (more on this later). Recall that the $SCALABHA_DIR/src/main/scala directory is where the actual code of Scalabha is kept (and is also where you can add additional code to do your own tasks, as covered in the previous tutorials). If you look at the $SCALABHA_DIR/src/main directory, you’ll see an additional resources directory. Go there and list the directory contents:

[sourcecode lang=”bash”]
$ cd $SCALABHA_DIR/src/main/resources
$ ls
log4j.properties
[/sourcecode]

All that is there now is a properties file that defines default logging behavior (which is a good way to output debugging information, e.g. as it is done in the opennlp.scalabha.cluster package used in the clustering homework of Applied Text Analysis). What is very nice about the resources directory is that any files in it are accessible in the classpath of the application we are building. That won’t make total sense right away, but it will be clear as we go along — the end result is that it simplifies a number of things a great deal, so bear with me.

What we are going to do now is place the sentence detector model in a subdirectory of resources that will give us access to it, and also organize things for future additions (wrt languages and systems). So, do the following:

[sourcecode lang=”bash”]
$ cd $SCALABHA_DIR/src/main/resources
$ mkdir -p lang/eng/opennlp
$ cd lang/eng/opennlp/
$ wget http://opennlp.sourceforge.net/models-1.5/en-sent.bin
–2012-04-10 12:24:42–  http://opennlp.sourceforge.net/models-1.5/en-sent.bin
Resolving opennlp.sourceforge.net… 216.34.181.96
Connecting to opennlp.sourceforge.net|216.34.181.96|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 98533 (96K) [application/octet-stream]
Saving to: `en-sent.bin’

100%[======================================>] 98,533       411K/s   in 0.2s

2012-04-10 12:24:43 (411 KB/s) – `en-sent.bin’ saved [98533/98533]
[/sourcecode]

Note: the last command uses the program wget, which may not be available on your machine. If that is the case, you can download en-sent.bin in your browser (using the link given after wget above) and move it to the directory $SCALABHA_DIR/src/main/resources/lang/eng/opennlp. (Better yet, install wget since it is so useful…)

Status check: you should now see en-sent.bin when you do the following:

[sourcecode lang=”bash”]
$ ls $SCALABHA_DIR/src/main/resources/lang/eng/opennlp
en-sent.bin
[/sourcecode]

Using the sentence detector

Let’s now use the model! That requires create an example application that will read in the model, construct a sentence detector object from it, and then apply it to some example text. Do the following:

[sourcecode lang=”bash”]
$ touch $SCALABHA_DIR/src/main/scala/opennlp/scalabha/tag/OpenNlpTagger.scala
[/sourcecode]

This creates an empty file at that location that you should now open in a text editor. Add the following Scala code (to be explained) to that file:.

[sourcecode lang=”scala”]
package opennlp.scalabha.tag

object OpenNlpTagger {
import opennlp.tools.sentdetect.SentenceDetectorME
import opennlp.tools.sentdetect.SentenceModel

lazy val sentenceDetector =
new SentenceDetectorME(
new SentenceModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-sent.bin")))

def main (args: Array[String]) {
val test = io.Source.fromFile(args(0)).mkString
sentenceDetector.sentDetect(test).foreach(println)
}

}
[/sourcecode]

Here are the relevant bits of explanation needed to understand what is going on. We need to import the SentenceDetectorME and SentenceModel classes (you should verify that you can find them in the OpenNLP API). The former is a class for sentence detectors that are based on trained maximum entropy models, and the latter is for holding such models. We then must create our sentence detector. This is where we get the advantage of having put it into the resources directory of Scalabha. We obtain it by getting the Class of the object (via this.getClass) and then using the getResourceAsStream method of the Class class. That’s a bit meta, but it boils down to enabling you to just follow this recipe for getting the resource. The return value of getResourceAsStream is an InputStream, which is what is needed to construct a SentenceModel.

Once we have a SentenceModel, that can be used to create a SentenceDetectorME. Note that the sentenceDetector object is declared as a lazy val. By doing this, the model is only loaded when we need it. For a small program like this one, this doesn’t matter much, but in a larger system with many components, using lazy vals allows the application to get fired up much more quickly and then load thing like models on demand. (You’ll actually see a nice, concrete example of this by the end of the tutorial.) In general, using lazy vals is a good idea.

We then just need to get some text and use the sentence detector. The application gets a file name from the command line and then reads in its contents. The sentence detector has a method sentDectect (see the API) that takes a String and returns an Array[String], where each element of the Array is a sentence. So, we run sentDetect on the input text and then print out each line.

Once you have added the above code to OpenNlpTagger.scala, you should compile in SBT (I recommend using ~compile so that it compiles every time you make a change). Then, do the following:

[sourcecode lang=”bash”]
$ cd /tmp
$ echo "Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate." > vinken.txt
$ scalabha run opennlp.scalabha.tag.OpenNlpTagger vinken.txt
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.
[/sourcecode]

So, the model does perfectly on these sentences (but don’t expect it to do quite so well on other domains, such as Twitter). We are now ready to do the next step of splitting up the characters in each sentence into tokens.

Tokenizing

Once we have identified the sentences, we need to tokenize them to turn them into a sequence of tokens where each token is a symbol or word (conforming to some predefined notion of what is a “word”). For example, the tokens for the first sentence of the running example are the following, where a token is indicated via space:

Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .

Most NLP tools then build on these units.

To enable tokenization, we must first make the English tokenizer available as a resource in Scalabha.

[sourcecode lang=”bash”]
$ cd $SCALABHA_DIR/src/main/resources/lang/eng/opennlp
$ wget http://opennlp.sourceforge.net/models-1.5/en-token.bin
–2012-04-10 14:21:14–  http://opennlp.sourceforge.net/models-1.5/en-token.bin
Resolving opennlp.sourceforge.net… 216.34.181.96
Connecting to opennlp.sourceforge.net|216.34.181.96|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 439890 (430K) [application/octet-stream]
Saving to: `en-token.bin’

100%[========================================================================>] 439,890      592K/s   in 0.7s

2012-04-10 14:21:16 (592 KB/s) – `en-token.bin’ saved [439890/439890]
[/sourcecode]

Then, change OpenNlpTagger.scala to have the following contents.

[sourcecode lang=”scala”]
package opennlp.scalabha.tag

object OpenNlpTagger {
import opennlp.tools.sentdetect.SentenceDetectorME
import opennlp.tools.sentdetect.SentenceModel
import opennlp.tools.tokenize.TokenizerME
import opennlp.tools.tokenize.TokenizerModel

lazy val sentenceDetector =
new SentenceDetectorME(
new SentenceModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-sent.bin")))

lazy val tokenizer =
new TokenizerME(
new TokenizerModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-token.bin")))

def main (args: Array[String]) {
val test = io.Source.fromFile(args(0)).mkString
val sentences = sentenceDetector.sentDetect(test)
val tokenizedSentences = sentences.map(tokenizer.tokenize(_))
tokenizedSentences.foreach(tokens => println(tokens.mkString(" ")))
}

}
[/sourcecode]

The process is very similar to what was done for the sentence detector. The only difference is that we now use the tokenizer’s tokenize method on each sentence. This method returns an Array[String], where each element is a token. We thus map the Array[String] of sentences to the Array[Array[String]] of tokenizedSentences. Simple!

Make sure to test that everything is working.

[sourcecode lang=”bash”]
$ cd /tmp
$ scalabha run opennlp.scalabha.tag.OpenNlpTagger vinken.txt
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate .
[/sourcecode]

Now that we have these tokens, the input is ready for part-of-speech tagging.

Part-of-speech tagging

Part-of-speech (POS) tagging involves identifing whether each token is a noun, verb, determiner and so on. Some part-of-speech tag sets have more detail, such as NN for a singular noun and NNS for a plural one. See the previous tutorial on iteration for more details and pointers.

The OpenNLP POS tagger is trained on the Penn Treebank, so it uses that tagset. As with the other models, we must download it and place it in the resources directory.

[sourcecode lang=”bash”]
$ cd $SCALABHA_DIR/src/main/resources/lang/eng/opennlp
$ wget http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin
–2012-04-10 14:31:33–  http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin
Resolving opennlp.sourceforge.net… 216.34.181.96
Connecting to opennlp.sourceforge.net|216.34.181.96|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 5696197 (5.4M) [application/octet-stream]
Saving to: `en-pos-maxent.bin’

100%[========================================================================>] 5,696,197    671K/s   in 8.2s

2012-04-10 14:31:42 (681 KB/s) – `en-pos-maxent.bin’ saved [5696197/5696197]
[/sourcecode]

Then, update OpenNlpTagger.scala to have the following contents, which involve some additional output over what you saw the previous times.

[sourcecode lang=”scala”]
package opennlp.scalabha.tag

object OpenNlpTagger {

import opennlp.tools.sentdetect.SentenceDetectorME
import opennlp.tools.sentdetect.SentenceModel
import opennlp.tools.tokenize.TokenizerME
import opennlp.tools.tokenize.TokenizerModel
import opennlp.tools.postag.POSTaggerME
import opennlp.tools.postag.POSModel

lazy val sentenceDetector =
new SentenceDetectorME(
new SentenceModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-sent.bin")))

lazy val tokenizer =
new TokenizerME(
new TokenizerModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-token.bin")))

lazy val tagger =
new POSTaggerME(
new POSModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-pos-maxent.bin")))

def main (args: Array[String]) {

val test = io.Source.fromFile(args(0)).mkString

println("n*********************")
println("Showing sentences.")
println("*********************")
val sentences = sentenceDetector.sentDetect(test)
sentences.foreach(println)

println("n*********************")
println("Showing tokens.")
println("*********************")
val tokenizedSentences = sentences.map(tokenizer.tokenize(_))
tokenizedSentences.foreach(tokens => println(tokens.mkString(" ")))

println("n*********************")
println("Showing POS.")
println("*********************")
val postaggedSentences = tokenizedSentences.map(tagger.tag(_))
postaggedSentences.foreach(postags => println(postags.mkString(" ")))

println("n*********************")
println("Zipping tokens and tags.")
println("*********************")
val tokposSentences =
tokenizedSentences.zip(postaggedSentences).map { case(tokens, postags) =>
tokens.zip(postags).map { case(tok,pos) => tok + "/" + pos }
}
tokposSentences.foreach(tokposSentence => println(tokposSentence.mkString(" ")))

}

}
[/sourcecode]

Everything is as before, so it should be pretty much self-explanatory. Just note that the tagger’s tag method takes a token sequence (Array[String], written as String[] in OpenNLP’s Javadoc) as its input and it returns an Array[String] of the tags for each token. Thus, when we output the postaggedSentences in the “Showing POS” part, it prints only the tags. We can then bring the tokens and their corresponding tags together by zipping the tokenizedSentences with the postaggedSentences and then zipping the word and POS tokens in each sentence together, as shown in the “Zipping tokens and tags” portion.

When this is run, you should get the following output.

[sourcecode lang=”bash”]
$ cd /tmp
$ scalabha run opennlp.scalabha.tag.OpenNlpTagger vinken.txt

*********************
Showing sentences.
*********************
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.

*********************
Showing tokens.
*********************
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate .

*********************
Showing POS.
*********************
NNP NNP , CD NNS JJ , MD VB DT NN IN DT JJ NN NNP CD .
NNP NNP VBZ NN IN NNP NNP , DT JJ NN NN .
NNP NNP , CD NNS JJ CC JJ NN IN NNP NNP NNP NNP , VBD VBN DT NN IN DT JJ JJ NN .

*********************
Zipping tokens and tags.
*********************
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
Mr./NNP Vinken/NNP is/VBZ chairman/NN of/IN Elsevier/NNP N.V./NNP ,/, the/DT Dutch/JJ publishing/NN group/NN ./.
Rudolph/NNP Agnew/NNP ,/, 55/CD years/NNS old/JJ and/CC former/JJ chairman/NN of/IN Consolidated/NNP Gold/NNP Fields/NNP PLC/NNP ,/, was/VBD named/VBN a/DT director/NN of/IN this/DT British/JJ industrial/JJ conglomerate/NN ./.
[/sourcecode]

Note: You’ll probably notice a pause just after it says “Showing POS” — that is because the tagger is defined as a lazy val, so the model is loaded at that time since it is the first point where it is needed. Try removing “lazy” from the declarations of sentenceDetector, tokenizer, and tagger, recompiling and then running it again — you’ll now see that the pause before anything is done is greater, but that once it starts processing everything goes very quickly. That’s a fairly good way of seeing part of why lazy values are quite handy.

And that’s it. To see the output on a longer example, you can run it on any text you like, e.g. the ones in the Scalabha’s data directory, like the Federalist Papers:

[sourcecode lang=”bash”]

$ scalabha run opennlp.scalabha.tag.OpenNlpTagger $SCALABHA_DIR/data/cluster/federalist/federalist.txt

[/sourcecode]

Now as an exercise, turn the standalone application, defined as the object OpenNlpTagger, into a class, OpenNlpTagger, that takes a raw text as input (not via the command line, but as an argument to a method and returns a List[List[(String,String)]] that contains the sentences and for each sentence a sequence of (token,tag) pairs. For example, after running it on the Vinken text, you should produce the following.

[sourcecode lang=”scala”]
List(List((Pierre,NNP), (Vinken,NNP), (,,,), (61,CD), (years,NNS), (old,JJ), (,,,), (will,MD), (join,VB), (the,DT), (board,NN), (as,IN), (a,DT), (nonexecutive,JJ), (director,NN), (Nov.,NNP), (29,CD), (.,.)), List((Mr.,NNP), (Vinken,NNP), (is,VBZ), (chairman,NN), (of,IN), (Elsevier,NNP), (N.V.,NNP), (,,,), (the,DT), (Dutch,JJ), (publishing,NN), (group,NN), (.,.)), List((Rudolph,NNP), (Agnew,NNP), (,,,), (55,CD), (years,NNS), (old,JJ), (and,CC), (former,JJ), (chairman,NN), (of,IN), (Consolidated,NNP), (Gold,NNP), (Fields,NNP), (PLC,NNP), (,,,), (was,VBD), (named,VBN), (a,DT), (director,NN), (of,IN), (this,DT), (British,JJ), (industrial,JJ), (conglomerate,NN), (.,.)))
[/sourcecode]

Spans

You may notice that the sentence detector and tokenizer APIs both include methods that return Array[Span] (note: Span[] in OpenNLP’s Javadoc). These are preferable in many contexts since they don’t lose information from the original text, unlike the ones we used above which turned the original text into sequences of portions of the original. Spans just record the character offsets at which the sentences start and end, or at which tokens start and end. This is quite handy for further processing and is what is generally used in non-trivial applications. But, for many cases, the methods that return Array[String] will be just fine and require learning a bit less.

Conclusion

This tutorial has taken you from a version of Scalabha that does not have the OpenNLP Tools API available to a version which does have it and also has several pretrained models available and an example application to use the API for part-of-speech tagging. You can of course follow similar recipes for bringing in other libraries and using them in your code, so this setup gives you a lot of power and is easy to use once you’ve done it a few times. If you have any trouble, or want to check it against a definitely working version, get Scalabha v0.2.4, which differs from v0.2.3 primarily only with respect to this tutorial.

A final note: you may be wondering what the heck OpenNLP is, given that Scalabha’s classpath starts with opennlp.scalabha, but we were adding the OpenNLP Tools as a dependency. Basically, Gann Bierner and I started OpenNLP in 1999, and part of the goal of that was to provide a high-level organizational domain name so that we could ensure uniqueness in classpaths. So, we have opennlp.tools, opennlp.maxent, opennlp.scalabha, and there are others. These are thus clearly different, in terms of their unique classpaths, from foo.tools, foo.maxent, and so on. So, when I started Scalabha, I used opennlp.scalabha (though in all likelihood, no one else would pick scalabha as a top-level for a class path). Nonetheless, when one speaks of OpenNLP generally, it usually refers to the OpenNLP Tools, the first of the projects to be in the OpenNLP “family”.

Copyright 2012 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.