Smoke and Mirrors: the odd question of a civil rights leader’s race

Questioning the race of Shaun King is an ad hominem attack based on flawed assumptions that distracts from the real issues.

Shaun King has been one of the most visible and vocal leaders of the Black Lives Matter movement over the past year. He’s done a great deal to raise awareness in particular of police misconduct and brutality, with a particular emphasis on the disproportionate targeting of black Americans. (Though it is worth noting that he and others have noted when people of other races have been killed by police, even while the supposed #AllLivesMatter folks seemed oddly silent.)

An unsurprising development as regards to Black Lives Matter is that its leaders are coming under character attacks. There is a long tradition of privileged segments of society and even the government doing this, including to the now revered and respected Martin Luther King. And now, Shaun King has recently come under a very odd sort of attack from the conservative media: they are now saying he is not truly black and are accusing him of being duplicitous, like Rachel Dolezal.

This issue has a particular resonance for me because my family is a tangible example of the complexities of the concept of race. I’m white and my wife is black. We have two sons, both our biological children. The picture on the right is of our four hands. family_handsOur older son has darker skin and dark curly hair. He’s absolutely beautiful. Most people see him and think of him as “black”. Our younger son has light skin and blonde hair with just a hint of curl. He’s absolutely beautiful. Most people see him and think of him as “white”. In fact, when my wife is in public with our younger son (and without me), most people assume she’s his nanny. (And white people monitor her to make sure she’s treating him well, but that’s another story.)

So, we have these two children who are perceived very differently by others. Are you to tell me that the younger one isn’t “black” or is less “black” than his older brother? Just like Shaun King isn’t black because his skin is too light? What if both of my sons strongly identify with their black heritage and become leaders of some future “black” movement that seeks to reduce racial disparities? Would my younger son be attacked for not being “black” enough? With his older brother standing right by his side and no one questioning his blackness? One gets to speak for the black community because the genetic dice gave him the darker skin and hair, while the other is unsuitable? That would be pure and utter bullshit.

Let’s step back for a moment. It’s important to consider what “race” means and how any given individual might define it differently from others. And that one’s own notion of racial categories might shift over time, as applied to others or even to oneself. Can we even operationalize racial categories? It’s rather tricky. I wrote about this in the context of machine learning, and there’s good recent academic work on figuring out what the notion of race fundamentally encompasses. As Sen and Wasow argue in their article “Race as a bundle of sticks“, we should look at race as a multi-faceted group of properties, some of which are immutable (like genes) and many of which are mutable (such as location, religion, diet, etc). The very notion of racial categorization shifts over time—for example, there was a time not long ago when southern Europeans were not considered “white”. All this is not to say that race isn’t a thing, but that it is very very complicated. In fact, it is far more complicated than most people have ever stopped to really consider.

Returning to the attacks on Shaun King, here’s the thing: I personally don’t care if he is “black” or not, or is somewhat “black” or not. He could be Asian or white and it wouldn’t matter. I think he is doing what he’s doing because he is a caring human being who believes it is right and necessary. He wants to raise awareness of and reduce police violence and reduce racial disparities. That’s a laudable goal no matter who you are, no matter what race you identify with, no matter what. Period.

To me, this is clearly an ad hominem attack based on the flawed premise that race is a concept that we can clearly and objectively delineate. It has nothing to do with the facts and arguments that surround questions of racism in the USA, police conduct and related issues. There is plenty to debate there and, for what it’s worth, I don’t agree with Shaun King on many things. We all must do our best to learn, consider and reflect on the information we have. Ideally, we also seek new perspectives and keep an open mind while doing so. As it is, this attack is a distraction designed to deflect attention away from the real issues. It’s just smoke and mirrors.

And if you think there aren’t real issues here… Ask yourself if you think our country should support a truly Kafka-esque institution like Ryker’s Island. Ask yourself if you are comfortable with the Sandra Bland traffic stop (even FOX news and Donald Trump aren’t, as Larry Wilmore noted). Ask yourself if people should be threatened by the police when they are in their own driveway, hooking up their own boat to their own car. Ask whether the police should be outfitted with military-grade vehicles and weapons (see also John Oliver’s serious/humorous take on this). These are just a few (important) examples, and there are unfortunately many more. They do not reflect the United States of America that I believe in—a great country based on a civil society that protects the rights of individuals without prejudice for their race, religion, political affiliation, etc. You are ignoring much evidence if you think there isn’t a problem. Pay attention, please.

Regarding actions by the police in McKinney, Texas

This is a horrible video of police in McKinney, Texas treating a bunch of kids — I stress, KIDS — at a pool party in a very heavy-handed way, way out of proportion to the situation (the “incident”). One officer, Eric Casebolt, pulls his gun as a threat (and he is now on leave because of it). Kids who had nothing to do with the situation are handcuffed, yelled at, and called motherfuckers. I can’t imagine this happening at a similar party in my (almost entirely white) hometown of Rockford, Michigan.

For more context, see this article.

I find this all very upsetting, and I took up Joshua Dubois’ suggestion to write to the police chief. My letter is below.

Dear Police Chief Conley,

I’m writing to express my extreme disapproval and concern regarding the incident in McKinney involving very heavy-handed behavior by police, and in particular Corporal Eric Casebolt, against a group of teens.

I have reviewed the videos and read many different reports on the matter, and I realize that there may be more information yet to come to light. Regardless of how things transpired prior to the police force arriving, the actions of Corporal Casebolt are incredibly disturbing: yanking a 14-year-old girl by her hair, pinning her to the ground, chasing other teens with a gun, and swearing and cursing at teens. Many of the teens were interacting very respectfully, yet he tells them to “sit your asses down on the ground”. Many of the other teens appear incredibly scared — wanting to help their friends, but not wanting to escalate the situation (which is probably wise given recent events in the country and Corporal Casebolt’s disposition and his brandishing of his gun).

This is not behavior befitting an officer of the law. I fully realize that the police have an important and difficult job to do, and I’m thankful to those who serve and keep the peace. I believe a big part of that job is to show respect to the people that the police serve, and to apply rules and force consistently, regardless of the age, race, or socio-economic status of the individuals involved. Sadly, recent events in the country, including Saturday’s incident in McKinney, indicate that this is far from the case currently.

I’m not writing this just as a concerned citizen from afar. I live in Austin, Texas. My wife is African-American and we have two biracial sons, currently two and six years old. My six year old likes dinosaurs, tennis, and math. He’s going to do amazing things, but I fear that society—including the authorities—will view him as a threat by the time he becomes a teenager in 2022. My wife has family who live in Lewisville, less than 30 minutes from McKinney. If my son goes to a pool party with his cousin in seven years, should I worry that he will be handcuffed just for being present? And that no matter how polite and respectful he is, he’ll be told to sit his ass down? I certainly hope not, but seven years isn’t very much time. I sincerely hope that you and others in similar positions will do whatever you can to help reduce the likelihood of these sorts of incidents and to ensure that the members of the police force are respectful of the rights of all citizens. A good start to this would be for you to dismiss Corporal Casebolt.

Sincerely,

Dr. Jason Baldridge

Associate Professor of Computational Linguistics, The University of Texas at Austin

Co-founder and Chief Scientist, People Pattern

I’m not at all sure it will do any good, but it’s a start to trying to effect some change. If you feel the same, please consider writing, and getting involved. Follow Shaun King and Deray McKeeson for much much more on what is going on with the police and racism. We need to find a better way forward, as a society.

Incorporating and using OpenNLP in Scalabha’s SBT build system

Topics: natural language processing, OpenNLP, SBT, Maven, resources, sentence detection, tokenization, part-of-speech tagging

Introduction

Natural language processing involves a wide-range of methods and tasks. However, we usually start with some raw text and start by demarcating what the sentences and the tokens are. We then go on to further levels of processing, such as predicting part-of-speech tags, syntactic chunks, named entities, syntactic structures, and more.

This tutorial has two goals. First, it shows how to use the OpenNLP Tools as an API for doing sentence detection, tokenization, and part-of-speech tagging. It also shows how to add new dependencies and resources to a system like Scalabha and then using those to add new functionality. As prerequisites, see previous tutorials on getting used to working with the SBT build system of Scalabha and adding new code to the existing build system. To see the other tutorials in this series, check out the list on the links page of my Applied Text Analysis course. Of particular relevance is the one on SBT, Scalabha, packages, and build systems.

To do this tutorial, you should be working with Scalabha version 0.2.3. By the end, you should have recreated version 0.2.4, allowing you to check your progress if you run into any problems.

Adding OpenNLP Tools as a dependency

To use OpenNLP’s API, we need to have access to its jar (Java ARchive) files such that our code can compile using classes from the API and then later be executed. It is important at this point to distinguish between explicitly putting a jar file in your build system versus making it available as a managed dependency. To see some explicitly added (unmanaged) dependencies in Scalabha, look at $SCALABHA_DIR/lib.

[sourcecode lang=”bash”]
$ ls $SCALABHA_DIR/lib
Jama-1.0.2.jar          pca_transform-0.7.2.jar
crunch-0.2.0.jar        scrunch-0.1.0.jar
[/sourcecode]

These have been added to the Scalabha repository and are available even before you do any compilation. You can even see them listed in the Scalabha repository on Github.

In contrast, there are many managed dependencies. When you first download Scalabha, you won’t see them, but once you compile, you can look in the lib_managed directory and will find it is populated

[sourcecode lang=”bash”]
$ ls $SCALABHA_DIR/lib_managed
bundles jars    poms
[/sourcecode]

You can go looking into the jars sub-directory to see some of the jars that have been brought in.

To see where these came from, look in the file $SCALABHA_DIR/build.sbt, which declares much of the information that the SBT program needs in order to build the Scalabha system. The dependencies are given in the following declaration.

[sourcecode lang=”scala”]
libraryDependencies ++= Seq(
"org.apache.opennlp" % "opennlp-maxent" % "3.0.2-incubating",
"org.clapper" %% "argot" % "0.3.8",
"org.apache.commons" % "commons-lang3" % "3.0.1",
"commons-logging" % "commons-logging" % "1.1.1",
"log4j" % "log4j" % "1.2.16",
"org.scalatest" % "scalatest_2.9.0" % "1.6.1" % "test",
"junit" % "junit" % "4.10" % "test",
"com.novocode" % "junit-interface" % "0.6" % "test->default") //switch to ScalaTest at some point…
[/sourcecode]

Notice that the OpenNLP Maxent toolkit is in there (along with others), but not the OpenNLP Tools. The Maxent toolkit is used by the OpenNLP Tools (and is part of the same software group/effort), but it can be used independently of it. For example, it is used for the classification homework for the Applied Text Analysis class I’m teaching this semester, which is in fact why the dependency is already in Scalabha v0.2.3.

So, how does one know to write the following to get the OpenNLP Maxent Toolkit as a dependency?

[sourcecode lang=”scala”]
"org.apache.opennlp" % "opennlp-maxent" % "3.0.2-incubating",
[/sourcecode]

I’m not going to go into lots of detail on this, but basically this is what is known as a Maven dependency. On the OpenNLP home page, there is a page for the OpenNLP Maven dependency.  Look on that page to where it defines the OpenNLP Maxent Dependency, repeated here.

[sourcecode lang=”xml”]
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-maxent</artifactId>
<version>3.0.2-incubating</version>
</dependency>
[/sourcecode]

The group ID indicates the organization that is responsible for the artifact (e.g. a given organization can have many different systems that it develops and deploys in this manner). The artifact ID is the name of that particular artifact to distinguish it from others by the same organization, and the version is obviously the particular version number of that artifact. (This makes it possible to use older versions as and when needed.)

The XML above is what one needs if one is using the Maven build system, which many Java projects use. SBT is compatible with such dependencies, but the terser format given above is used instead of XML.

We now want to add the OpenNLP Tools as a dependency. From the OpenNLP dependencies page we see that it is declared this way in XML.

[sourcecode lang=”xml”]
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.5.2-incubating</version>
</dependency>
[/sourcecode]

That means we just need to add the following line to build.sbt in the libraryDependencies declaration.

[sourcecode lang=”scala”]
"org.apache.opennlp" % "opennlp-tools" % "1.5.2-incubating",
[/sourcecode]

And, we can remove the Maxent declaration because OpenNLP Tools depends on it (though it isn’t necessarily a problem if it stays in Scalabha’s build.sbt). The library dependencies should now look as follows.

[sourcecode lang=”scala”]
libraryDependencies ++= Seq(
"org.apache.opennlp" % "opennlp-tools" % "1.5.2-incubating",
"org.clapper" %% "argot" % "0.3.8",
"org.apache.commons" % "commons-lang3" % "3.0.1",
"commons-logging" % "commons-logging" % "1.1.1",
"log4j" % "log4j" % "1.2.16",
"org.scalatest" % "scalatest_2.9.0" % "1.6.1" % "test",
"junit" % "junit" % "4.10" % "test",
"com.novocode" % "junit-interface" % "0.6" % "test->default") //switch to ScalaTest at some point…
[/sourcecode]

The next time you run scalabha build, SBT will read the new dependency declaration and retrieve the dependency. At this point, you might say “What?” How is that sufficient to get the required jars? Here’s how, briefly and at a high level. The OpenNLP artifacts are available on the Maven2 site, and SBT already knows to look there. Put simply, it knows to check this site:

http://repo1.maven.org/maven2

And given that the organization is org.apache.opennlp it knows to then look in this directory:

http://repo1.maven.org/maven2/org/apache/opennlp/

Given that we want the opennlp-tools artifact, it looks here:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/

And finally, given that we want the version 1.5.2-incubating, it looks here:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/1.5.2-incubating/

In that directory are all the files that SBT needs to pull down to your local machine, plus information about any dependencies of OpenNLP Tools that it needs to grab. Here is the main jar:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/1.5.2-incubating/opennlp-tools-1.5.2-incubating.jar

And here is the POM (“Project Object Model”), for OpenNLP Tools:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/1.5.2-incubating/opennlp-tools-1.5.2-incubating.pom

Notice that it includes a reference to OpenNLP Maxent in it, which is why Scalabha’s build.sbt no longer needs to include it explicitly. In fact, it is better to not have it in Scalabha’s build.sbt so that we ensure that the version used by OpenNLP Tools is the one we are using (which matters when we update to, say, a later version of OpenNLP Tools).

In many cases, such artifacts are not hosted at repo1.maven.org. In such cases, you must add a “resolver” that points to another site that contains artifacts. This is done by adding to the resolvers declaration, which is shown here for Scalabha v0.2.3.

[sourcecode lang=”scala”]
resolvers ++= Seq(
"Cloudera Hadoop Releases" at "https://repository.cloudera.com/content/repositories/releases/",
"Thrift location" at "http://people.apache.org/~rawson/repo/"
)
[/sourcecode]

So, when dependencies are declared, SBT will also search through those locations, in addition to its defaults, to find them and pull them down to your machine. As it turns out, OpenNLP has a dependency on the Java WordNet Library, which is hosted on a non-standard Maven repository (which is associated with OpenNLP’s old development site on Sourceforge). You should update build.sbt to be the following:

[sourcecode lang=”scala”]
resolvers ++= Seq(
"Cloudera Hadoop Releases" at "https://repository.cloudera.com/content/repositories/releases/",
"Thrift location" at "http://people.apache.org/~rawson/repo/",
"opennlp sourceforge repo" at "http://opennlp.sourceforge.net/maven2"
)
[/sourcecode]

That was a lot of description, but note that it was a simple change to build.sbt and now we can use the OpenNLP Tools API.

Tip: if you already had SBT running (e.g. via scalabha build) then you must use the reload command at the SBT command after you change build.sbt in order for SBT to know about the changes.

What do you do if the library you want to use isn’t available as a Maven artifact? In that case, you need to put the jar (or jars) for that library, plus any jars it depends on, into the $SCALABHA_DIR/lib directory. Then SBT will see that they are there and add them to your classpath, enabling you to use them just as if they were a managed dependency. The downside is that you must put it there explicitly, which means a bit more hassle when you want to update to later versions, and a fair amount more hassle if that library has lots of dependencies that you also need to manage.

Obtaining and installing the OpenNLP sentence detector model

Now on to the processing of language. Sentence detection simply refers to the basic process of taking a text and identifying the character positions that indicate sentence breaks. As a running example, we’ll use the first several sentences from the Penn Treebank.

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.

Note that the “.” character is not a reliable indicator of what is the end of a sentence. While one can build a regular expression based sentence detector, machine learned models are typically used to figure this out, based on some reasonable numbers of example sentences identified as such by a human.

Roughly and somewhat crudely speaking, a machine learned model is a set of features that are associated with real-valued weights which have been determined from some training material. Once these weights have been learned, the model can be saved and reused (e.g. see the classification homework for Applied Text Analysis).

OpenNLP has pretrained models available for several NLP tasks, including sentence detection. Note also that there is an effort I’m heading to make it possible to distribute and, where possible, rebuild models — see the OpenNLP Models Github repository.

We want to do English sentence detection, so the model we need right now is the en | Sentence Detector. Rather than putting it in some random place on your computer, we’ll add it as part of the Scalabha build system and exploit this to simplify the loading of models (more on this later). Recall that the $SCALABHA_DIR/src/main/scala directory is where the actual code of Scalabha is kept (and is also where you can add additional code to do your own tasks, as covered in the previous tutorials). If you look at the $SCALABHA_DIR/src/main directory, you’ll see an additional resources directory. Go there and list the directory contents:

[sourcecode lang=”bash”]
$ cd $SCALABHA_DIR/src/main/resources
$ ls
log4j.properties
[/sourcecode]

All that is there now is a properties file that defines default logging behavior (which is a good way to output debugging information, e.g. as it is done in the opennlp.scalabha.cluster package used in the clustering homework of Applied Text Analysis). What is very nice about the resources directory is that any files in it are accessible in the classpath of the application we are building. That won’t make total sense right away, but it will be clear as we go along — the end result is that it simplifies a number of things a great deal, so bear with me.

What we are going to do now is place the sentence detector model in a subdirectory of resources that will give us access to it, and also organize things for future additions (wrt languages and systems). So, do the following:

[sourcecode lang=”bash”]
$ cd $SCALABHA_DIR/src/main/resources
$ mkdir -p lang/eng/opennlp
$ cd lang/eng/opennlp/
$ wget http://opennlp.sourceforge.net/models-1.5/en-sent.bin
–2012-04-10 12:24:42–  http://opennlp.sourceforge.net/models-1.5/en-sent.bin
Resolving opennlp.sourceforge.net… 216.34.181.96
Connecting to opennlp.sourceforge.net|216.34.181.96|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 98533 (96K) [application/octet-stream]
Saving to: `en-sent.bin’

100%[======================================>] 98,533       411K/s   in 0.2s

2012-04-10 12:24:43 (411 KB/s) – `en-sent.bin’ saved [98533/98533]
[/sourcecode]

Note: the last command uses the program wget, which may not be available on your machine. If that is the case, you can download en-sent.bin in your browser (using the link given after wget above) and move it to the directory $SCALABHA_DIR/src/main/resources/lang/eng/opennlp. (Better yet, install wget since it is so useful…)

Status check: you should now see en-sent.bin when you do the following:

[sourcecode lang=”bash”]
$ ls $SCALABHA_DIR/src/main/resources/lang/eng/opennlp
en-sent.bin
[/sourcecode]

Using the sentence detector

Let’s now use the model! That requires create an example application that will read in the model, construct a sentence detector object from it, and then apply it to some example text. Do the following:

[sourcecode lang=”bash”]
$ touch $SCALABHA_DIR/src/main/scala/opennlp/scalabha/tag/OpenNlpTagger.scala
[/sourcecode]

This creates an empty file at that location that you should now open in a text editor. Add the following Scala code (to be explained) to that file:.

[sourcecode lang=”scala”]
package opennlp.scalabha.tag

object OpenNlpTagger {
import opennlp.tools.sentdetect.SentenceDetectorME
import opennlp.tools.sentdetect.SentenceModel

lazy val sentenceDetector =
new SentenceDetectorME(
new SentenceModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-sent.bin")))

def main (args: Array[String]) {
val test = io.Source.fromFile(args(0)).mkString
sentenceDetector.sentDetect(test).foreach(println)
}

}
[/sourcecode]

Here are the relevant bits of explanation needed to understand what is going on. We need to import the SentenceDetectorME and SentenceModel classes (you should verify that you can find them in the OpenNLP API). The former is a class for sentence detectors that are based on trained maximum entropy models, and the latter is for holding such models. We then must create our sentence detector. This is where we get the advantage of having put it into the resources directory of Scalabha. We obtain it by getting the Class of the object (via this.getClass) and then using the getResourceAsStream method of the Class class. That’s a bit meta, but it boils down to enabling you to just follow this recipe for getting the resource. The return value of getResourceAsStream is an InputStream, which is what is needed to construct a SentenceModel.

Once we have a SentenceModel, that can be used to create a SentenceDetectorME. Note that the sentenceDetector object is declared as a lazy val. By doing this, the model is only loaded when we need it. For a small program like this one, this doesn’t matter much, but in a larger system with many components, using lazy vals allows the application to get fired up much more quickly and then load thing like models on demand. (You’ll actually see a nice, concrete example of this by the end of the tutorial.) In general, using lazy vals is a good idea.

We then just need to get some text and use the sentence detector. The application gets a file name from the command line and then reads in its contents. The sentence detector has a method sentDectect (see the API) that takes a String and returns an Array[String], where each element of the Array is a sentence. So, we run sentDetect on the input text and then print out each line.

Once you have added the above code to OpenNlpTagger.scala, you should compile in SBT (I recommend using ~compile so that it compiles every time you make a change). Then, do the following:

[sourcecode lang=”bash”]
$ cd /tmp
$ echo "Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate." > vinken.txt
$ scalabha run opennlp.scalabha.tag.OpenNlpTagger vinken.txt
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.
[/sourcecode]

So, the model does perfectly on these sentences (but don’t expect it to do quite so well on other domains, such as Twitter). We are now ready to do the next step of splitting up the characters in each sentence into tokens.

Tokenizing

Once we have identified the sentences, we need to tokenize them to turn them into a sequence of tokens where each token is a symbol or word (conforming to some predefined notion of what is a “word”). For example, the tokens for the first sentence of the running example are the following, where a token is indicated via space:

Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .

Most NLP tools then build on these units.

To enable tokenization, we must first make the English tokenizer available as a resource in Scalabha.

[sourcecode lang=”bash”]
$ cd $SCALABHA_DIR/src/main/resources/lang/eng/opennlp
$ wget http://opennlp.sourceforge.net/models-1.5/en-token.bin
–2012-04-10 14:21:14–  http://opennlp.sourceforge.net/models-1.5/en-token.bin
Resolving opennlp.sourceforge.net… 216.34.181.96
Connecting to opennlp.sourceforge.net|216.34.181.96|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 439890 (430K) [application/octet-stream]
Saving to: `en-token.bin’

100%[========================================================================>] 439,890      592K/s   in 0.7s

2012-04-10 14:21:16 (592 KB/s) – `en-token.bin’ saved [439890/439890]
[/sourcecode]

Then, change OpenNlpTagger.scala to have the following contents.

[sourcecode lang=”scala”]
package opennlp.scalabha.tag

object OpenNlpTagger {
import opennlp.tools.sentdetect.SentenceDetectorME
import opennlp.tools.sentdetect.SentenceModel
import opennlp.tools.tokenize.TokenizerME
import opennlp.tools.tokenize.TokenizerModel

lazy val sentenceDetector =
new SentenceDetectorME(
new SentenceModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-sent.bin")))

lazy val tokenizer =
new TokenizerME(
new TokenizerModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-token.bin")))

def main (args: Array[String]) {
val test = io.Source.fromFile(args(0)).mkString
val sentences = sentenceDetector.sentDetect(test)
val tokenizedSentences = sentences.map(tokenizer.tokenize(_))
tokenizedSentences.foreach(tokens => println(tokens.mkString(" ")))
}

}
[/sourcecode]

The process is very similar to what was done for the sentence detector. The only difference is that we now use the tokenizer’s tokenize method on each sentence. This method returns an Array[String], where each element is a token. We thus map the Array[String] of sentences to the Array[Array[String]] of tokenizedSentences. Simple!

Make sure to test that everything is working.

[sourcecode lang=”bash”]
$ cd /tmp
$ scalabha run opennlp.scalabha.tag.OpenNlpTagger vinken.txt
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate .
[/sourcecode]

Now that we have these tokens, the input is ready for part-of-speech tagging.

Part-of-speech tagging

Part-of-speech (POS) tagging involves identifing whether each token is a noun, verb, determiner and so on. Some part-of-speech tag sets have more detail, such as NN for a singular noun and NNS for a plural one. See the previous tutorial on iteration for more details and pointers.

The OpenNLP POS tagger is trained on the Penn Treebank, so it uses that tagset. As with the other models, we must download it and place it in the resources directory.

[sourcecode lang=”bash”]
$ cd $SCALABHA_DIR/src/main/resources/lang/eng/opennlp
$ wget http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin
–2012-04-10 14:31:33–  http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin
Resolving opennlp.sourceforge.net… 216.34.181.96
Connecting to opennlp.sourceforge.net|216.34.181.96|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 5696197 (5.4M) [application/octet-stream]
Saving to: `en-pos-maxent.bin’

100%[========================================================================>] 5,696,197    671K/s   in 8.2s

2012-04-10 14:31:42 (681 KB/s) – `en-pos-maxent.bin’ saved [5696197/5696197]
[/sourcecode]

Then, update OpenNlpTagger.scala to have the following contents, which involve some additional output over what you saw the previous times.

[sourcecode lang=”scala”]
package opennlp.scalabha.tag

object OpenNlpTagger {

import opennlp.tools.sentdetect.SentenceDetectorME
import opennlp.tools.sentdetect.SentenceModel
import opennlp.tools.tokenize.TokenizerME
import opennlp.tools.tokenize.TokenizerModel
import opennlp.tools.postag.POSTaggerME
import opennlp.tools.postag.POSModel

lazy val sentenceDetector =
new SentenceDetectorME(
new SentenceModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-sent.bin")))

lazy val tokenizer =
new TokenizerME(
new TokenizerModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-token.bin")))

lazy val tagger =
new POSTaggerME(
new POSModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-pos-maxent.bin")))

def main (args: Array[String]) {

val test = io.Source.fromFile(args(0)).mkString

println("n*********************")
println("Showing sentences.")
println("*********************")
val sentences = sentenceDetector.sentDetect(test)
sentences.foreach(println)

println("n*********************")
println("Showing tokens.")
println("*********************")
val tokenizedSentences = sentences.map(tokenizer.tokenize(_))
tokenizedSentences.foreach(tokens => println(tokens.mkString(" ")))

println("n*********************")
println("Showing POS.")
println("*********************")
val postaggedSentences = tokenizedSentences.map(tagger.tag(_))
postaggedSentences.foreach(postags => println(postags.mkString(" ")))

println("n*********************")
println("Zipping tokens and tags.")
println("*********************")
val tokposSentences =
tokenizedSentences.zip(postaggedSentences).map { case(tokens, postags) =>
tokens.zip(postags).map { case(tok,pos) => tok + "/" + pos }
}
tokposSentences.foreach(tokposSentence => println(tokposSentence.mkString(" ")))

}

}
[/sourcecode]

Everything is as before, so it should be pretty much self-explanatory. Just note that the tagger’s tag method takes a token sequence (Array[String], written as String[] in OpenNLP’s Javadoc) as its input and it returns an Array[String] of the tags for each token. Thus, when we output the postaggedSentences in the “Showing POS” part, it prints only the tags. We can then bring the tokens and their corresponding tags together by zipping the tokenizedSentences with the postaggedSentences and then zipping the word and POS tokens in each sentence together, as shown in the “Zipping tokens and tags” portion.

When this is run, you should get the following output.

[sourcecode lang=”bash”]
$ cd /tmp
$ scalabha run opennlp.scalabha.tag.OpenNlpTagger vinken.txt

*********************
Showing sentences.
*********************
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.

*********************
Showing tokens.
*********************
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate .

*********************
Showing POS.
*********************
NNP NNP , CD NNS JJ , MD VB DT NN IN DT JJ NN NNP CD .
NNP NNP VBZ NN IN NNP NNP , DT JJ NN NN .
NNP NNP , CD NNS JJ CC JJ NN IN NNP NNP NNP NNP , VBD VBN DT NN IN DT JJ JJ NN .

*********************
Zipping tokens and tags.
*********************
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
Mr./NNP Vinken/NNP is/VBZ chairman/NN of/IN Elsevier/NNP N.V./NNP ,/, the/DT Dutch/JJ publishing/NN group/NN ./.
Rudolph/NNP Agnew/NNP ,/, 55/CD years/NNS old/JJ and/CC former/JJ chairman/NN of/IN Consolidated/NNP Gold/NNP Fields/NNP PLC/NNP ,/, was/VBD named/VBN a/DT director/NN of/IN this/DT British/JJ industrial/JJ conglomerate/NN ./.
[/sourcecode]

Note: You’ll probably notice a pause just after it says “Showing POS” — that is because the tagger is defined as a lazy val, so the model is loaded at that time since it is the first point where it is needed. Try removing “lazy” from the declarations of sentenceDetector, tokenizer, and tagger, recompiling and then running it again — you’ll now see that the pause before anything is done is greater, but that once it starts processing everything goes very quickly. That’s a fairly good way of seeing part of why lazy values are quite handy.

And that’s it. To see the output on a longer example, you can run it on any text you like, e.g. the ones in the Scalabha’s data directory, like the Federalist Papers:

[sourcecode lang=”bash”]

$ scalabha run opennlp.scalabha.tag.OpenNlpTagger $SCALABHA_DIR/data/cluster/federalist/federalist.txt

[/sourcecode]

Now as an exercise, turn the standalone application, defined as the object OpenNlpTagger, into a class, OpenNlpTagger, that takes a raw text as input (not via the command line, but as an argument to a method and returns a List[List[(String,String)]] that contains the sentences and for each sentence a sequence of (token,tag) pairs. For example, after running it on the Vinken text, you should produce the following.

[sourcecode lang=”scala”]
List(List((Pierre,NNP), (Vinken,NNP), (,,,), (61,CD), (years,NNS), (old,JJ), (,,,), (will,MD), (join,VB), (the,DT), (board,NN), (as,IN), (a,DT), (nonexecutive,JJ), (director,NN), (Nov.,NNP), (29,CD), (.,.)), List((Mr.,NNP), (Vinken,NNP), (is,VBZ), (chairman,NN), (of,IN), (Elsevier,NNP), (N.V.,NNP), (,,,), (the,DT), (Dutch,JJ), (publishing,NN), (group,NN), (.,.)), List((Rudolph,NNP), (Agnew,NNP), (,,,), (55,CD), (years,NNS), (old,JJ), (and,CC), (former,JJ), (chairman,NN), (of,IN), (Consolidated,NNP), (Gold,NNP), (Fields,NNP), (PLC,NNP), (,,,), (was,VBD), (named,VBN), (a,DT), (director,NN), (of,IN), (this,DT), (British,JJ), (industrial,JJ), (conglomerate,NN), (.,.)))
[/sourcecode]

Spans

You may notice that the sentence detector and tokenizer APIs both include methods that return Array[Span] (note: Span[] in OpenNLP’s Javadoc). These are preferable in many contexts since they don’t lose information from the original text, unlike the ones we used above which turned the original text into sequences of portions of the original. Spans just record the character offsets at which the sentences start and end, or at which tokens start and end. This is quite handy for further processing and is what is generally used in non-trivial applications. But, for many cases, the methods that return Array[String] will be just fine and require learning a bit less.

Conclusion

This tutorial has taken you from a version of Scalabha that does not have the OpenNLP Tools API available to a version which does have it and also has several pretrained models available and an example application to use the API for part-of-speech tagging. You can of course follow similar recipes for bringing in other libraries and using them in your code, so this setup gives you a lot of power and is easy to use once you’ve done it a few times. If you have any trouble, or want to check it against a definitely working version, get Scalabha v0.2.4, which differs from v0.2.3 primarily only with respect to this tutorial.

A final note: you may be wondering what the heck OpenNLP is, given that Scalabha’s classpath starts with opennlp.scalabha, but we were adding the OpenNLP Tools as a dependency. Basically, Gann Bierner and I started OpenNLP in 1999, and part of the goal of that was to provide a high-level organizational domain name so that we could ensure uniqueness in classpaths. So, we have opennlp.tools, opennlp.maxent, opennlp.scalabha, and there are others. These are thus clearly different, in terms of their unique classpaths, from foo.tools, foo.maxent, and so on. So, when I started Scalabha, I used opennlp.scalabha (though in all likelihood, no one else would pick scalabha as a top-level for a class path). Nonetheless, when one speaks of OpenNLP generally, it usually refers to the OpenNLP Tools, the first of the projects to be in the OpenNLP “family”.

Copyright 2012 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

First steps in Scala for beginning programmers, Part 2

Topics: Tuples, Lists, methods on Lists and Strings

Preface

This is the second in a planned series of tutorials on programming in Scala for first-time programmers, with specific reference to my Fall 2011 course Introduction to Computational Linguistics. You can see the other tutorials here on this blog; they are also listed on the course’s links page.

This tutorial focuses on Tuples and Lists, which are two constructs for working with groups of elements. You won’t get much done without the latter, and the former are so incredibly useful you probably find yourself using them a lot.

Tuples

We saw in the previous tutorial how a single value can be assigned to a variable and then used in various contexts. A Tuple is a generalization of that: a collection of two, three, four, and more values. Each value can have its own type.

[sourcecode language=”scala”]
scala> val twoInts = (3,9)
twoInts: (Int, Int) = (3,9)

scala> val twoStrings = ("hello", "world")
twoStrings: (java.lang.String, java.lang.String) = (hello,world)

scala> val threeDoubles = (3.14, 11.29, 1.5)
threeDoubles: (Double, Double, Double) = (3.14,11.29,1.5)

scala> val intAndString = (7, "lucky number")
intAndString: (Int, java.lang.String) = (7,lucky number)

scala> val mixedUp = (1, "hello", 1.16)
mixedUp: (Int, java.lang.String, Double) = (1,hello,1.16)
[/sourcecode]

The elements of a Tuple can be recovered in a few different ways. One way is to use a Tuple when initializing some variables, each of which takes on the value of the corresponding position in the Tuple on the right side of the equal sign.

[sourcecode language=”scala”]
scala> val (first, second) = twoInts
first: Int = 3
second: Int = 9

scala> val (numTimes, thingToSay, price) = mixedUp
numTimes: Int = 1
thingToSay: java.lang.String = hello
price: Double = 1.16
[/sourcecode]

Scala peels off the values and assigns them to each of the single variables. This becomes very useful in the context of functions that return Tuples. For example, consider a function that provides the left and right edges of a range when you give it the midpoint of the range and the size of the interval on each side of the midpoint.

[sourcecode language=”scala”]
scala> def rangeAround(midpoint: Int, size: Int) = (midpoint – size, midpoint + size)
rangeAround: (midpoint: Int, size: Int)(Int, Int)
[/sourcecode]

Since rangeAround returns a Tuple (specifically, a Pair), we can call it and set variables for the left and right directly from the function call.

[sourcecode language=”scala”]
scala> val (left, right) = rangeAround(21, 3)
left: Int = 18
right: Int = 24
[/sourcecode]

Another way to access the values in a Tuple is via indexation, using “_n” where n is the index of the item you want.

[sourcecode language=”scala”]
scala> print(mixedUp._1)
1
scala> print(mixedUp._2)
hello
scala> print(mixedUp._3)
1.16
[/sourcecode]

The syntax on this is a bit odd, but you’ll get used to it.

Tuples are an amazingly useful feature in a programming language. You’ll see some examples of their utility as we progress.

Lists

Lists are collections of ordered items that will be familiar to anyone who has done any shopping. Tuples are obviously related to lists, but they are less versatile in that they must be created in a single statement, they have a bounded length (about 20 or so), and they don’t support operations that perform computations on all of their elements.

In Scala, we can create lists of Strings, Ints, and Doubles (and more).

[sourcecode language=”scala”]
scala> val groceryList = List("apples", "milk", "butter")
groceryList: List[java.lang.String] = List(apples, milk, butter)

scala> val odds = List(1,3,5,7,9)
odds: List[Int] = List(1, 3, 5, 7, 9)

scala> val multinomial = List(.2, .4, .15, .25)
multinomial: List[Double] = List(0.2, 0.4, 0.15, 0.25)
[/sourcecode]

We see that Scala responds that a List has been created, along with brackets around the type of the elements it contains. So, List[Int] is read as “a List of Ints” and so on. This is to say that List is a parameterized data structure: it is a container that holds elements of specific types. We’ll see how knowing this allows us to do different things with Lists parameterized by different types.

We can also create Lists with mixtures of types.

[sourcecode language=”scala”]
scala> val intsAndDoubles = List(1, 1.5, 2, 2.5)
intsAndDoubles: List[Double] = List(1.0, 1.5, 2.0, 2.5)

scala> val today = List("August", 23, 2011)
today: List[Any] = List(August, 23, 2011)
[/sourcecode]

Types are sometimes autoconverted, such as converting Ints to Doubles for intsAndDoubles, but often there is no obvious generalizable type. For example, today is a List[Any], which means it is a List of Anys — and Any is the most general type in Scala, the supertype of all types. It’s sort of like saying “Yeah, I have a list of… well, you know… stuff.”

Lists can also contain Lists (and Lists of Lists, and Lists of Lists of Lists…).

[sourcecode language=”scala”]
scala> val embedded = List(List(1,2,3), List(10,30,50), List(200,400), List(1000))
embedded: List[List[Int]] = List(List(1, 2, 3), List(10, 30, 50), List(200, 400), List(1000))
[/sourcecode]

The type of embedded is List[List[Int]], which you can read as “a List of Lists of Ints.”

List methods

Okay, so now that we have some lists, what can we do with them? A lot, actually. One of the most basic properties of a list is its length, which you can get by using “.length” after the variable that refers to the list.

[sourcecode language=”scala”]
scala> groceryList.length
res19: Int = 3

scala> odds.length
res20: Int = 5

scala> embedded.length
res21: Int = 4
[/sourcecode]

Notice that the length of embedded is 4, which is the number of Lists it contains (not the number of elements in those lists).

The notation variable.method indicates that you are invoking a function that is specific to the type of that variable on the value in that variable. Okay, that was a mouthful. Scala is an object-oriented language, which means that every value has a set of actions that comes with it. Which actions are available depends on its type. So, above, we called the length method that is available to Lists on each of the list values given above. You didn’t realize it in the previous tutorial, but you were using methods when you added Ints or concatenated Strings — it’s just that Scala allows us to go without “.” and paretheses in certain cases. If we don’t drop them, here’s what it looks like.

[sourcecode language=”scala”]
scala> (2).+(3)
res25: Int = 5

scala> "Portis".+("head")
res26: java.lang.String = Portishead
[/sourcecode]

What is going on is that Ints have a method called “+” and Strings have a different method called “+“. They could have been called “bill” and “bob”, but that would be harder to remember, among other things. Ints have other methods, such as ““, “*“, and “/“, that Strings don’t have. (Note: I’m now returning to omitting the “.” and paretheses.)

[sourcecode language=”scala”]
scala> 5-3
res27: Int = 2

scala> "walked" – "ed"
<console>:8: error: value – is not a member of java.lang.String
"walked" – "ed"
[/sourcecode]

Scala complains that we tried to use the “” method on a String, since Strings don’t have such a method. On the other hand, Ints don’t have a method called length, while Strings do.

[sourcecode language=”scala”]
scala> 5.length
<console>:8: error: value length is not a member of Int
5.length
^

scala> "walked".length
res31: Int = 6
[/sourcecode]

With Strings, length returns the number of characters, whereas with Lists, it is the number of elements. The String length method could have been called “numberOfCharacters”, but “length” is easier to remember and it allows us to treat Strings like other sequences and think of them similarly.

Lets return to Lists and what we can do with them. “Addition” of two lists is their concatenation and is indicated with “++“.

[sourcecode language=”scala”]
scala> val evens = List(2,4,6,8)
evens: List[Int] = List(2, 4, 6, 8)

scala> val nums = odds ++ evens
nums: List[Int] = List(1, 3, 5, 7, 9, 2, 4, 6, 8)
[/sourcecode]

We can append a single item to the front of a List with “::“.

[sourcecode language=”scala”]
scala> val zeroToNine = 0 :: nums
zeroToNine: List[Int] = List(0, 1, 3, 5, 7, 9, 2, 4, 6, 8)
[/sourcecode]

And sort a list with sorted, and reverse it with reverse, and do both in sequence.

[sourcecode language=”scala”]
scala> zeroToNine.sorted
res42: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

scala> zeroToNine.reverse
res43: List[Int] = List(8, 6, 4, 2, 9, 7, 5, 3, 1, 0)

scala> zeroToNine.sorted.reverse
res44: List[Int] = List(9, 8, 7, 6, 5, 4, 3, 2, 1, 0)
[/sourcecode]

What the last line says is “take zeroToNine, get a new sorted list from it, and then reverse that list.” Notice that calling these functions never changes zeroToNine itself! That is because List is immutable: you cannot change it, so all of these operations return new Lists. This property of Lists brings with it many benefits that we’ll return to later.

Note: immutability is different from the val/var distinction. It is common to think that a val variable is immutable, but it is not — it is fixed and cannot be reassigned. The following examples all involve immutable Lists, but the fixed variable is a val while the reassignable variable is a var.

[sourcecode language=”scala”]
scala> val fixed = List(1,2)
fixed: List[Int] = List(1, 2)

scala> fixed = List(3,4)
<console>:8: error: reassignment to val
fixed = List(3,4)
^

scala> var reassignable = List(5,6)
reassignable: List[Int] = List(5, 6)

scala> reassignable = List(7,8)
reassignable: List[Int] = List(7, 8)
[/sourcecode]

One of the things one frequently wants to do with a list is access its elements directly. This is done via indexation into the list, starting with 0 for the first element, 1 for the second element, and so on.

[sourcecode language=”scala”]
scala> odds
res48: List[Int] = List(1, 3, 5, 7, 9)

scala> odds(0)
res49: Int = 1

scala> odds(1)
res50: Int = 3
[/sourcecode]

Starting with 0 for the index of the first element is standard practice in computer science. It might seem strange at first, but you’ll get used to it fairly quickly.

We can of course use any Int expression to access an item in a list.

[sourcecode language=”scala”]
scala> zeroToNine(3)
res63: Int = 5

scala> zeroToNine(5-2)
res64: Int = 5

scala> val index = 3
index: Int = 3

scala> zeroToNine(index)
res65: Int = 5
[/sourcecode]

If we ask for an index that is equal to or greater than the number of elements in the list, we get an error.

[sourcecode language=”scala”]
scala> odds(10)
java.lang.IndexOutOfBoundsException: 10
at scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:51)
at scala.collection.immutable.List.apply(List.scala:45)
at .<init>(<console>:9)
at .<clinit>(<console>)
at .<init>(<console>:11)
at .<clinit>(<console>)
at $export(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:592)
at scala.tools.nsc.interpreter.IMain$Request$$anonfun$10.apply(IMain.scala:828)
at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
at scala.tools.nsc.io.package$$anon$2.run(package.scala:31)
at java.lang.Thread.run(Thread.java:680)
[/sourcecode]

Looking at all that, you might be thinking “WTF?” It’s called the stack trace, and it gives you a detailed breakdown of where problems happened in a bit of code. For beginning programmers, this is likely to look overwhelming and intimidating — you can safely glaze over it for now, but before long, it will be necessary to be able to use the stack trace to identify problems in your code and address them.

Another useful method is slice, which gives you a sublist from one index up to, but not including, another.

[sourcecode language=”scala”]
scala> zeroToNine
res55: List[Int] = List(0, 1, 3, 5, 7, 9, 2, 4, 6, 8)

scala> zeroToNine.slice(2,6)
res56: List[Int] = List(3, 5, 7, 9)
[/sourcecode]

So, the slice gave us a list with the elements from index 2 (the third element) up to index 5 (the sixth element).

Returning briefly to Strings — other List methods than length work with them too.

[sourcecode language=”scala”]
scala> val artist = "DJ Shadow"
artist: java.lang.String = DJ Shadow

scala> artist(3)
res0: Char = S

scala> artist.slice(3,6)
res1: String = Sha

scala> artist.reverse
res2: String = wodahS JD

scala> artist.sorted
res3: String = " DJSadhow"
[/sourcecode]

On lists that contain numbers, we can use the sum method.

[sourcecode language=”scala”]
scala> odds.sum
res59: Int = 25

scala> multinomial.sum
res60: Double = 1.0
[/sourcecode]

However, if the list contains non-numeric values, sum isn’t valid.

[sourcecode language=”scala”]
scala> groceryList.sum
<console>:9: error: could not find implicit value for parameter num: Numeric[java.lang.String]
groceryList.sum
^
[/sourcecode]

What is going on is some very cool and useful automagical behavior by Scala involving implicits. We’ll come back to that later, but for now you can happily use sum on Lists of Ints and Doubles.

One thing we often want to do with lists is obtain a String representation of their contents in some visually useful way. For example, we might want a grocery list to be a String with one item per line, or a list of Ints to have a comma between each element. The mkString method does just what we need.

[sourcecode language=”scala”]
scala> groceryList.mkString("n")
res22: String =
apples
milk
butter

scala> odds.mkString(",")
res23: String = 1,3,5,7,9
[/sourcecode]

Want to know if a list contains a particular element? Use contains on the list.

[sourcecode language=”scala”]
scala> groceryList.contains("milk")
res4: Boolean = true

scala> groceryList.contains("coffee")
res5: Boolean = false
[/sourcecode]

And now we arrive at Booleans, another of the most important basic types. They play a major role in conditional execution, which we’ll cover in the next tutorial.

There are actually many more methods available for lists, which you can see by going to the entry for List in the Scala API. API stands for Application Programming Interface — in other words a collection of specifications for what you can do with various components of the Scala programming language. I’m going to do my best to give you the methods you need for now, but eventually you will need to be able to look at the API entries for Scala types to see what methods are available, what they do and how to use them.

Some of the most important methods on Lists we haven’t covered are map, filter, foldLeft, and reduce. We’ll come back to them in detail later, but for now here is a teaser that should give you an intuitive sense of what they do.

[sourcecode language=”scala”]
scala> val odds = List(1,3,5,7,9)
odds: List[Int] = List(1, 3, 5, 7, 9)

scala> odds.map(1+)
res6: List[Int] = List(2, 4, 6, 8, 10)

scala> odds.filter(4<)
res7: List[Int] = List(5, 7, 9)

scala> odds.foldLeft(10)(_ + _)
res8: Int = 35

scala> odds.filter(6>).map(_.toString).reduce(_ + "," + _)
res9: java.lang.String = 1,3,5
[/sourcecode]

Now we’re getting functional. 🙂

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.