Bcomposes is now self-hosted!

I finally got around to moving my blog from the free WordPress world to my own self-hosted WP instance! I’ve changed the theme and plugins, and the code examples broke, so I’ll be fixing those over time. (I already fixed the TensorFlow tutorial since that is my most popular post these days.)

Now that I’m officially gone from the University of Texas at Austin (to focus on my startup, People Pattern), and am no longer conducting courses at the university, I will probably return to blogging a bit more to satisfy my teaching itch. I may even be updating and packaging up some old posts and homeworks as books and/or courses. Stay tuned, and let me know in the comments section below if there is anything in particular you are interested in!

Simple end-to-end TensorFlow examples

A walk-through with code for using TensorFlow on some simple simulated data sets.

I’ve been reading papers about deep learning for several years now, but until recently hadn’t dug in and implemented any models using deep learning techniques for myself. To remedy this, I started experimenting with Deeplearning4J a few weeks ago, but with limited success. I read more books, primers and tutorials, especially the amazing series of blog posts by Chris Olah and Denny Britz. Then, with incredible timing for me, Google released TensorFlow to much general excitement. So, I figured I’d give it a go, especially given Delip Rao’s enthusiasm for it—he even compared the move from Theano to TensorFlow feeling like changing from “a Honda Civic to a Ferrari.”

Here’s a quick prelude before getting to my initial simple explorations with TensorFlow. As most people (hopefully) know, deep learning encompasses ideas going back many decades (done under the names of connectionism and neural networks) that only became viable at scale in the past decade with the advent of faster machines and some algorithmic innovations. I was first introduced to them in a class taught by my PhD advisor, Mark Steedman, at the University of Pennsylvania in 1997. He was especially interested in how they could be applied to language understanding, which he wrote about in his 1999 paper “Connectionist Sentence Processing in Perspective.” I wish I understood more about that topic (and many others) back then, but then again that’s the nature of being a young grad student. Anyway, Mark’s interest in connectionist language processing arose in part from being on the dissertation committee of James Henderson, who completed his thesis “Description Based Parsing in a Connectionist Network” in 1994. James was a post-doc in the Institute for Research in Cognitive Science at Penn when I arrived in 1996. As a young grad student, I had little idea of what connectionist parsing entailed, and my understanding from more senior (and far more knowledgeable) students was that James’ parsers were really interesting but that he had trouble getting the models to scale to larger data sets—at least compared to the data-driven parsers that others like Mike Collins and Adwait Ratnarparkhi were building at Penn in the mid-1990s. (Side note: for all the kids using logistic regression for NLP out there, you probably don’t know that Adwait was the one who first applied LR/MaxEnt to several NLP problems in his 1998 dissertation “Maximum Entropy Models for Natural Language Ambiguity Resolution“, in which he demonstrated how amazingly effective it was for everything from classification to part-of-speech tagging to parsing.)

Back to TensorFlow and the present day. I flew from Austin to Washington DC last week, and the morning before my flight I downloaded TensorFlow, made sure everything compiled, downloaded the necessary datasets, and opened up a bunch of tabs with TensorFlow tutorials. My goal was, while on the airplane, to run the tutorials, get a feel for the flow of TensorFlow, and then implement my own networks for doing some made-up classification problems. I came away from the exercise extremely pleased. This post explains what I did and gives pointers to the code to make it happen. My goal is to help out people who could use a bit more explicit instruction and guidance using a complete end-to-end example with easy to understand data. I won’t give lots of code examples in this post as there are several tutorials that already do that quite well—the value here is in the simple end-to-end implementations, the data to go with them, and a bit of explanation along the way.

As a preliminary, I recommend going to the excellent TensorFlow documentation, downloading it, and running the first example. If you can do that, you should be able to run the code I’ve provided to go along with this post in my try-tf repository on Github.

Simulated data

As a researcher who works primarily on empirical methods in natural language processing, my usual tendency is to try new software and ideas out on language data sets, e.g. text classification problems and the like. However, after hanging out with a statistician like James Scott for many years, I’ve come to appreciate the value of using simulated datasets early on to reduce the number of unknowns while getting the basics right. So, when sitting down with TensorFlow, I wanted to try three simulated data sets: linearly separable data, moon data and saturn data. The first is data that linear classifiers can handle easily, while the latter two require the introduction of non-linearities enabled by models like multi-layer neural networks. Here’s what they look like, with brief descriptions.

The linear data has two clusters that can be separated by a diagonal line from top left to bottom right:

linear_data_train.jpg

Linear classifiers like perceptrons, logistic regression, linear discriminant analysis, support vector machines and others do well with this kind of data because learning these lines (hyperplanes) is exactly what they do.

The moon data has two clusters in crescent shapes that are tangled up such that no line can keep all the orange dots on one side without also including blue dots.

moon_data_train.jpg

Note: see Implementing a Neural Network from Scratch in Python for a discussion working with the moon data using Theano.

The saturn data has a core cluster representing one class and a ring cluster representing the other.saturn_data_train.jpg

With the saturn data, a line is catastrophically bad. Perhaps the best one can do is draw a line that has all the orange points to one side. This ensures a small, entirely blue side, but it leaves the majority of blue dots in orange terroritory.

Example data has been generated in try-tf/simdata for each of these datasets, including a training set and test set for each. These are for the two dimensional cases visualized above, but you can use the scripts in that directory to generate data with other parameters, including more dimensions, greater variances, etc. See the commented out code for help to visualize the outputs, or adapt plot_data.R, which visualizes 2-d data in CSV format. See the  README for instructions.

Related: check out Delip Rao’s post on learning arbitrary lambda expressions.

Softmax regression

Let’s start with a network that can handle the linear data, which I’ve written in softmax.py. The TensorFlow page has pretty good instructions for how to define a single layer network for MNIST, but no end-to-end code that defines the network, reads in data (consisting of label plus features), trains and evaluates the model. I found writing this to be a good way to familiarize myself with the TensorFlow Python API, so I recommend trying it yourself before looking at my code and then referring to it if you get stuck.

Let’s run it and see what we get.

$ python softmax.py --train simdata/linear_data_train.csv --test simdata/linear_data_eval.csv
Accuracy: 0.99

This performs one pass (epoch) over the training data, so parameters were only updated once per example. 99% is good held-out accuracy, but allowing two training epochs gets us to 100%.

$ python softmax.py --train simdata/linear_data_train.csv --test simdata/linear_data_eval.csv --num_epochs 2
Accuracy: 1.0

There’s a bit of code in softmax.py to handle options and read in data. The most important lines are the ones that define the input data, the model, and the training step. I simply adapted these from the MNIST beginners tutorial, but softmax.py puts it all together and provides a basis for transitioning to the network with a hidden layer discussed later in this post.

To see a little more, let’s turn on the verbose flag and run for 5 epochs.

$ python softmax.py --train simdata/linear_data_train.csv --test simdata/linear_data_eval.csv --num_epochs 5 --verbose True
Initialized!

Training.
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49

Weight matrix.
[[-1.87038445 1.87038457]
[-2.23716712 2.23716712]]

Bias vector.
[ 1.57296884 -1.57296848]

Applying model to first test instance.
Point = [[ 0.14756215 0.24351828]]
Wx+b = [[ 0.7521798 -0.75217938]]
softmax(Wx+b) = [[ 0.81822371 0.18177626]]

Accuracy: 1.0

Consider first the weights and bias. Intuitively, the classifier should find a separating hyperplane between the two classes, and it probably isn’t immediately obvious how W and b define that. For now, consider only the first column with w1=-1.87038457, w2=-2.23716712 and b=1.57296848. Recall that w1 is the parameter for the `x` dimension and w2 is for the `y` dimension. The separating hyperplane satisfies Wx+b=0; from which we get the standard y=mx+b form.

Wx + b = 0
w1*x + w2*y + b = 0
w2*y = -w1*x – b
y = (-w1/w2)*x – b/w2

For the parameters learned above, we have the line:

y = -0.8360504*x + 0.7031074

Here’s the plot with the line, showing it is an excellent fit for the training data.

.linear_data_hyperplane.jpg

The second column of weights and bias separate the data points at the same place as the first, but mirrored 180 degrees from the first column. Strictly speaking, it is redundant to have two output nodes since a multinomial distribution with n outputs can be represented with n-1 parameters (see section 9.3 of Andrew Ng’s notes on supervised learning for details). Nonetheless, it’s convenient to define the network this way.

Finally, let’s try the softmax network on the moon and saturn data.

python softmax.py --train simdata/moon_data_train.csv --test simdata/moon_data_eval.csv --num_epochs 2
Accuracy: 0.856

$ python softmax.py --train simdata/saturn_data_train.csv --test simdata/saturn_data_eval.csv --num_epochs 2
Accuracy: 0.45

As expected, it doesn’t work very well!

Network with a hidden layer

The program hidden.py implements a network with a single hidden layer, and you can set the size of the hidden layer from the command line. Let’s try first with a two-node hidden layer on the moon data.

$ python hidden.py --train simdata/moon_data_train.csv --test simdata/moon_data_eval.csv --num_epochs 100 --num_hidden 2
Accuracy: 0.88

So,that was an improvement over the softmax network. Let’s run it again, exactly the same way.

$ python hidden.py --train simdata/moon_data_train.csv --test simdata/moon_data_eval.csv --num_epochs 100 --num_hidden 2
Accuracy: 0.967

Very different! What we are seeing is the effect of random initialization, which has a large effect on the learned parameters given the small, low-dimensional data we are dealing with here. (The network uses Xavier initialization for the weights.) Let’s try again but using three nodes.

$ python hidden.py --train simdata/moon_data_train.csv --test simdata/moon_data_eval.csv --num_epochs 100 --num_hidden 3
Accuracy: 0.969

If you run this several times, the results don’t vary much and hover around 97%. The additional node increases the representational capacity and makes the network less sensitive to initial weight settings.

Adding more nodes doesn’t change results much—see the WildML post using the moon data for some nice visualizations of the boundaries being learned between the two classes for different hidden layer sizes.

So, a hidden layer does the trick! Let’s see what happens with the saturn data.

$ python hidden.py --train simdata/saturn_data_train.csv --test simdata/saturn_data_eval.csv --num_epochs 50 --num_hidden 2
Accuracy: 0.76

With just two hidden nodes, we already have a substantial boost from the 45% achieved by softmax regression. With 15 hidden nodes, we get 100% accuracy. There is considerable variation from run to run (due to random initialization). As with the moon data, there is less variation as nodes are added. Here’s a plot showing the increase in performance from 1 to 15 nodes, including ten accuracy measurements for each node count.

hidden_node_curve.jpg

The line through the middle is the average accuracy measurement for each node count.

Initialization and activation functions are important

My first attempt at doing a network with a hidden layer was to merge what I had done in softmax.py with the network in mnist.py, provided with TensorFlow tutorials. This was a useful exercise to get a better feel for the TensorFlow Python API, and helped me understand the programming model much better. However, I found that I needed to have upwards of 25 or more hidden nodes in order to reliably get >96% accuracy on the moon data.

I then looked back at the WildML moon example and figured something was quite wrong since just three hidden nodes were sufficient there. The differences were that the MNIST example initializes its hidden layers with truncated normals instead of normals divided by the square root of the input size, initializes biases at 0.1 instead of 0 and uses ReLU activations instead of tanh. By switching to Xavier initialization (using Delip’s handy function), 0 biases, and tanh, everything worked as in the WildML example. I’m including my initial version in the repo as truncnorm_hidden.py so that others can see the difference and play around with it. (It turns out that what matters most is the initialization of the weights.)

This is a simple example of what is often discussed with deep learning methods: they can work amazingly well, but they are very sensitive to initialization and choices about the sizes of layers, activation functions, and the influence of these choices on each other. They are a very powerful set of techniques, but they (still) require finesse and understanding, compared to, say, many linear modeling toolkits that can effectively be used as black boxes these days.

Conclusion

I walked away from this exercise very encouraged! I’ve been programming in Scala mostly for the last five years, so it required dusting off my Python (which I taught in my classes at UT Austin from 2005-2011, e.g. Computational Linguistics I and Natural Language Processing), but I found it quite straightforward. Since I work primarily with language processing tasks, I’m perfectly happy with Python since it’s a great language for munging language data into the inputs needed by packages like TensorFlow. Also, Python works well as a DSL for working with deep learning (it seems like there is a new Python deep learning package announced every week these days). It took me less than four hours to go through initial examples, and then build the softmax and hidden networks and apply them to the three data sets. (And a bunch of that time was me remembering how to do things in Python.)

I’m now looking forward to trying deep learning models, especially convnets and LSTM’s, on language and image tasks. I’m also going to go back to my Scala code for trying out Deeplearning4J to see if I can get these simulation examples to run as I’ve shown here with TensorFlow. (I would welcome pull requests if someone else gets to that first!) As a person who works primarily on the JVM, it would be very handy to be able to work with DL4J as well.

After that, maybe I’ll write out the re-occurring rant going on in my head about deep learning not removing the need for feature engineering (as many backpropagandists seem to like to claim), but instead changing the nature of feature engineering, as well as providing a really cool set of new capabilities and tricks.

Using Twitter4j with Scala to perform user actions

Topics: twitter,twitter4j,word clouds

Introduction

My previous post showed how to use Twitter4j in Scala to access Twitter streams. This post shows how to control a Twitter user’s actions using Twitter4j. The primary purpose of this functionality is perhaps to create interfaces for Twitter like TweetDeck, but it can also be used to create bots that take automated actions on Twitter (one bot I’m playing around with is @tshrdlu, using the code in this tutorial and the code in the tshrdlu repository).

This post will only cover a small portion of the things you can do, but they are some of the more common things and I include a couple of simple but interesting use cases. Once you have these things in place, it is straightforward to figure out how to use the Twitter4j API docs (and Stack Overflow) to do the rest.

Getting set up: code and authorization

Rather than having the reader build the code up while going through the tutorial, I’ve set up the code in the repository twitter4j-tutorial. The version needed for this tutorial as v0.2.0. You can download a tarball of that version, which may be easier to work with if there have been further developments to the repository since the writing of this tutorial. Checkout or download that code now. The main file of interest is:

  • src/main/scala/TwitterUser.scala

This tutorial is mainly a walk through for that file in blog form, with some additional pointers and explanations here and there.

You also need to set up the authorization details. See “Setting up authorization” section of the previous post to do this if you haven’t already.

READ THE FOLLOWING

IMPORTANT: for this tutorial you must set the permissions for your application to be “Read and Write“. This does NOT mean to use ‘chmod’. It means going to the Twitter developers application site, signing in with your Twitter account, clicking on “Settings” and setting the permissions to read and write.

OKAY, THANKS FOR PAYING ATTENTION

In the previous tutorial, authorization details were put into code. This time, we’ll use a twitter4j.properties file. This is easy: just add a file with that name to the twitter4j-tutorial directory with the following contents, substituting your details as appropriate.

[sourcecode language=”bash”]
oauth.consumerKey=[your consumer key here]
oauth.consumerSecret=[your consumer secret here]
oauth.accessToken=[your access token here]
oauth.accessTokenSecret=[your access token secret here]
[/sourcecode]

Rate limits and a note of caution

Unlike streaming access to Twitter, performing user actions via the API is subject to rate limits. Once you hit your limit, Twitter will throw an exception and refuse to comply with your requests until a period of time has passed (usually 15 minutes). Twitter does this to limit bad bots and also preserve their computational resources. For more information on rate limits, see Twitter’s page about rate limiting.

I’ll discuss how to manage rate limits later in the post, but I mention them up front in case you exceed them while messing around with things early on.

A word of caution is also in order: since you are going to be able to take actions automatically, like following users, posting a status, and retweeting, you could end up doing many of these actions in rapid succession. This will (a) use up your rate limit very quickly, (b) probably not be interesting behavior, and (c) could get your account suspended. Make sure to follow the rules, especially those on following users.

If you are going to mess around quite a bit with actual posting, you may also want to consider creating an account that is not your primary Twitter account so that you don’t annoy your actual followers. (Suggestion: see the paragraph on “Create account” in part one of project phase one of my Applied NLP course for tips on how to add multiple accounts with the same gmail address.)

Basic interactions: searching, timelines, posting

All of the examples belowe are implemented as objects with main methods that do something using a twitter4j.Twitter object. To make it so we don’t have to call the TwitterFactory repeatedly, we first define a trait that gets a Twitter instance set up and ready to use.

[sourcecode language=”scala”]
trait TwitterInstance {
val twitter = new TwitterFactory().getInstance
}
[/sourcecode]

By extending this trait, our objects can access the twitter object conveniently.

As a first simple example, we can search for tweets that match a query by using the search method. The following object takes a query string given on the command line query, searches for tweets using that query, and prints them.

[sourcecode language=”scala”]
object QuerySearch extends TwitterInstance {

def main(args: Array[String]) {
val statuses = twitter.search(new Query(args(0))).getTweets
statuses.foreach(status => println(status.getText + "n"))
}

}
[/sourcecode]

Note that this uses a Query object, whereas with using a TwitterStream, a FilterQuery was needed. Also, for this to work, we must have the following import available:

[sourcecode language=”scala”]
import collection.JavaConversions._
[/sourcecode]

This ensures that we can use the java.util.List returned by the getTweets method (of twitter4j.QueryResult) as if it were a Scala collection with the method foreach (and map, filter, etc). This is done via implicit conversions that make working with Java libraries far nicer than it would be otherwise.

To run this, go to the twitter4j-tutorial directory, and do the following (some example output shown):

[sourcecode]
$ ./build
> run-main bcomposes.twitter.QuerySearch scala
[info] Running bcomposes.twitter.QuerySearch scala
E’ avvilente non sentirsi all’altezza di qualcosa o qualcuno, se non si possiede quella scala interiore sulla quale l’autostima pu? issarsi

Scala workshop will run with ECOOP, July 2nd in Montpellier, South of France. Call for papers is out. http://t.co/3WS6tHQyiF

#scala http://t.co/JwNrzXTwm8 Even two of them in #cologne #germany . #thumbsup

RT @MILLIB2DAL: @djcameo Birthday bash 30th march @ Scala nightclub 100 artists including myself make sur u reach its gonna be #Legendary

@kot_2010 I think it’s the same case with Scala: with macros it will tend to "outsource" things to macro libs, keeping a small lang core.

RT @waxzce: #scala hiring or job ? go there : http://t.co/NeEjoqwqwT

@esten That’s not only a front-end problem. Scala devs should use scalaz.Equal and === for type safe equality. /cc @sharonw

<…more…>

[success] Total time: 1 s, completed Feb 26, 2013 1:54:44 PM
[/sourcecode]

You might see some extra communications from SBT, which will probably need to download dependencies and compile the code. For the rest of the examples below, you can run them in a similar manner, substituting the right object name and providing any necessary arguments.

There are various timelines available for each user, including the home timeline, mentions timeline, and user timeline. They are accessible as twitter4j.api.TimelineResources. For example, the following object shows the most recent statuses on the authenticating user’s home timeline (which are the tweets by people the user follows).

[sourcecode language=”scala”]
object GetHomeTimeline extends TwitterInstance {

def main(args: Array[String]) {
val num = if (args.length == 1) args(0).toInt else 10
val statuses = twitter.getHomeTimeline.take(num)
statuses.foreach(status => println(status.getText + "n"))
}

}
[/sourcecode]

The number of tweets to show is given as the command-line argument.

You can also update the status of the authenticating user from the command line using the following object. Calling it will post to the authenticating user’s account (so only do it if you are comfortable with the command-line argument you give it going onto your timeline).

[sourcecode language=”scala”]
object UpdateStatus extends TwitterInstance {
def main(args: Array[String]) {
twitter.updateStatus(new StatusUpdate(args(0)))
}
}
[/sourcecode]

There are plenty of other useful methods that you can use to interact with Twitter, and if you have successfully run the above three, you should be able to look at the Twitter4j javadocs and start using them. Some examples doing more interesting things are given below.

Replying to tweets written to you

The following object goes through the most recent tweets that have mentioned the authenticating user, and replies “OK.” to them. It includes the author of the original tweet and any other entities that were mentioned in it.

[sourcecode language=”scala”]
object ReplyOK extends TwitterInstance {

def main(args: Array[String]) {
val num = if (args.length == 1) args(0).toInt else 10
val userName = twitter.getScreenName
val statuses = twitter.getMentionsTimeline.take(num)
statuses.foreach { status => {
val statusAuthor = status.getUser.getScreenName
val mentionedEntities = status.getUserMentionEntities.map(_.getScreenName).toList
val participants = (statusAuthor :: mentionedEntities).toSet – userName
val text = participants.map(p=>"@"+p).mkString(" ") + " OK."
val reply = new StatusUpdate(text).inReplyToStatusId(status.getId)
println("Replying: " + text)
twitter.updateStatus(reply)
}}
}

}
[/sourcecode]

This should be mostly self-explanatory, but there are a couple of things to note. First, you can find all the entities that have been mentioned (via @-mentions) in the tweet via the method getUserMentionEntities of the twitter4j.Status class. The code ensures that the author of the original tweet (who isn’t necessarily mentioned in it) is included as a participant for the response, and also we take out the authenticating user. So, if the message “@tshrdlu What do you think of @tshrdlc?” is sent from @jasonbaldridge, the response will be “@jasonbaldridge @tshrdlc OK.” Note how the screen names do not have the @ symbol, so that must be added in the tweet text of the reply.

Second, notice that StatusUpdate objects can be created by chaining methods that add more information to them, e.g. setInReplyToStatusId and setLocation, which incrementally build up the StatusUpdate object that gets actually posted. (This is a common Java strategy that basically helps get around the fact that parameters to classes can neither be specified by name in Java nor have defaults, the way Scala does.)

Checking and managing rate limit information

None of the above code makes many requests from Twitter, so there was little danger of exceeding rate limits. These limits are a mixture of both time and number of requests: you basically get a certain number of requests every hour (currently 350) per authenticating user. Because of these limits, you should consider accessing tweets, timelines, and such using the streaming methods when you can.

Every response you get from Twitter comes back as a sub-class of twitter4j.TwitterResponse, which not only gives you what you want (like a QueryResult) but also gives you information about your connection to Twitter. For rate limit information, you can use the getRateLimitStatus method, which can then inform you about the number of requests you can still make and the time until your limit resets.

The trait RateChecker below has a function checkAndWait that, when given a TwitterResponse object, checks whether the rate limit has been exceeded and wait if it has. When the rate is exceeded, it finds out how much time remains until the rate limit is reset and makes the thread sleep until that time (plus 10 seconds) has passed.

[sourcecode language=”scala”]
trait RateChecker {

def checkAndWait(response: TwitterResponse, verbose: Boolean = false) {
val rateLimitStatus = response.getRateLimitStatus
if (verbose) println("RLS: " + rateLimitStatus)

if (rateLimitStatus != null && rateLimitStatus.getRemaining == 0) {
println("*** You hit your rate limit. ***")
val waitTime = rateLimitStatus.getSecondsUntilReset + 10
println("Waiting " + waitTime + " seconds ( " + waitTime/60.0 + " minutes) for rate limit reset.")
Thread.sleep(waitTime*1000)
}
}

}
[/sourcecode]

Using rate limits is actually more complex than this. For example, this strategy ignores the fact that different request types have different limits, but it keeps things simple. This is surely not an optimal solution, but it does the trick for present purposes.

Note also that you can directly ask for rate limit information from the twitter4j.Twitter instance itself, using the getRateLimitStatus method. Unlike the results for the same method on a TwitterResponse, this gives a Map from various request types to the current rate limit statuses for each one. In a real application, you’d want to control each of these different limits at a more fine-grained level using this information.

Not all of the methods of Twitter4j classes actually hit the Twitter API. To see whether a given method does, look at its Javadoc: if it’s description says “This method calls http://api.twitter.com/1.1/some/method.json“, then it does hit the API. Otherwise, it doesn’t and you don’t need to guard it.

Examples using the checkAndWait function are given below.

Creating a word cloud from followers’ descriptions

Here’s a more interesting task: given a Twitter user, compute the counts of the words in the descriptions given in the bios of their followers and build a word cloud from them. The following code does this, outputing the resulting counts in a file, the contents of which can be pasted into Wordle’s advanced word cloud input.

[sourcecode language=”scala”]
object DescribeFollowers extends TwitterInstance with RateChecker {

def main(args: Array[String]) {
val screenName = args(0)
val maxUsers = if (args.length==2) args(1).toInt else 500
val followerIds = twitter.getFollowersIDs(screenName,-1).getIDs

val descriptions = followerIds.take(maxUsers).flatMap { id => {
val user = twitter.showUser(id)
checkAndWait(user)
if (user.isProtected) None else Some(user.getDescription)
}}

val tword = """(?i)[a-z#@]+""".r.pattern
val words = descriptions.flatMap(_.toLowerCase.split("\s+"))
val filtered = words.filter(_.length > 3).filter(tword.matcher(_).matches)
val counts = filtered.groupBy(x=>x).mapValues(_.length)
val rankedCounts = counts.toSeq.sortBy(- _._2)

import java.io._
val wordcountFile = "/tmp/follower_wordcount.txt"
val writer = new BufferedWriter(new FileWriter(wordcountFile))
for ((w,c) <- rankedCounts)
writer.write(w+":"+c+"n")
writer.flush
writer.close
}

}
[/sourcecode]

The thing to consider is that if you are pointing this at a person with several hundred followers, you will exceed the rate limit. The call to getFollowersIDs is a single hit, and then each call to showUser is a hit. Because the showUser calls come in rapid succession, we check the rate limit status after each one using checkAndWait (which is available because we mixed in the RateChecker trait) and it waits for the limit to reset as previously discussed, keeping us from exceeding the rate limit and getting an exception from Twitter.

The number of users returned by getFollowersIDs is at most 5000. If you run this on a user who has more followers, followers beyond 5000 won’t be considered. If you want to tackle such a user, you’ll need to use the cursor, which is the integer provided as the argument to getFollowersIDs, and make multiple calls while incrementing that cursor to get more.

Most of the rest of the code is just standard Scala stuff for getting the word counts and outputting them to a file. Note that a small effort is done to reduce the non-alphabetic characters (but allowing # and @) and filtering out short words.

As an example of the output, when put into Wordle, here is the word cloud for my followers.

jasonbaldridge_wordcloud

This looks about right for me—completely expected in fact—but it is still cool that it comes out of my followers’ self descriptions. One could start thinking of some fun algorithms for exploiting this kind of representation of a user to look into how well different users align or don’t align with their followers, or to look for clusters of different types of followers, etc.

Retweeting automatically

Tired of actually reading those tweets in your timeline and retweeting some of them? The following code gets some of the accounts the authenticating user follows, grabs twenty of those users, filters them to get interesting ones, and then takes up to 10 of the remaining ones and retweets their most recent statuses (provided they aren’t replies to someone else).

[sourcecode language=”scala”]
object RetweetFriends extends TwitterInstance with RateChecker {

def main(args: Array[String]) {
val friendIds = twitter.getFriendsIDs(-1).getIDs
val friends = friendIds.take(20).map { id => {
val user = twitter.showUser(id)
checkAndWait(user)
user
}}

val filtered = friends.filter(admissable)
val ranked = filtered.map(f => (f.getFollowersCount, f)).sortBy(- _._1).map(_._2)

ranked.take(10).foreach { friend => {
val status = friend.getStatus
if (status!=null && status.getInReplyToStatusId == -1) {
println("nRetweeting " + friend.getName + ":n" + status.getText)
twitter.retweetStatus(status.getId)
Thread.sleep(30000)
}
}}
}

def admissable(user: User) = {
val ratio = user.getFollowersCount.toDouble/user.getFriendsCount
user.getFriendsCount < 1000 && ratio > 0.5
}

}
[/sourcecode]

The getFriendsIDs method is used to get the users that the authenticating user is following (but who do not necessarily follow the authenticating user, despite the use of the word “friend”). We again take care with the rate limiting on gathering the users. We filter these users, looking for those who follow fewer than 1000 users and those who have a follower/friend ratio of greater than .5, in a simple attempt to filter out some less interesting (or spammy) accounts. The remaining users are then ranked according to their number of followers (most first). Finally, we take (up to) 10 of these (the take method returns 3 things if you ask for 10 but there are just 3), look at their most recent status, and if it is not null and isn’t a reply to someone, we retweet it. Between each of these, we wait for 30 seconds so that anyone following our account doesn’t get an avalanche of retweets.

Conclusion

This post and the related code should provide enough to get a decent feel for working with Twitter4j, including necessary setup and using some of the methods to start creating applications with it in Scala. See project phase three of my Applied NLP course to see exercises and code that takes this further to do interesting things for automated bots, including mixing streaming access and user access to get more complex behaviors.

Using twitter4j with Scala to access streaming tweets

Topics: twitter, twitter4j, sbt

Introduction

My previous post provided a walk-through for using the Twitter streaming API from the command line, but tweets can be more flexibly obtained and processed using an API for accessing Twitter using your programming language of choice. In this tutorial, I walk-through basic setup and some simple uses of the twitter4j library with Scala. Much of what I show here should be useful for those using other JVM languages like Clojure and Java. If you haven’t gone through the previous tutorial, have a look now before going on as this tutorial covers much of the same material but using twitter4j rather than HTTP requests.

I’ll introduce code, bit by bit, for accessing the Twitter data in different ways. If you get lost with what should go where, all of the code necessary to run the commands is available in this github gist, so you can compare to that as you move through the tutorial.

Update: The tutorial is set up to take you from nothing to being able to obtain tweets in various ways, but you can also get all the relevant code by looking at the twitter4j-tutorial repository. For this tutorial, the tag is v0.1.0, and you can also download a tarball of that version.

Getting set up

An easy way to use the twitter4j library in the context of a tutorial like this is for the reader to set up a new SBT project, declare it as a dependency, and then compile and run code within SBT. (See my tutorial on using Jerkson for processing JSON with Scala for another example of this.) This sorts out the process of obtaining external libraries and setting up the classpath so that they are available. Follow the instructions in this section to do so.

[sourcecode language=”bash”]
$ mkdir ~/twitter4j-tutorial
$ cd ~/twitter4j-tutorial/
$ wget http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.12.2/sbt-launch.jar
[/sourcecode]

Now, save the following as the file ~/twitter4j-tutorial/build.sbt. Be aware that it is important to keep the empty lines between each of the declarations.

[sourcecode language=”scala”]
name := "twitter4j-tutorial"

version := "0.1.0 "

scalaVersion := "2.10.0"

libraryDependencies += "org.twitter4j" % "twitter4j-stream" % "3.0.3"
[/sourcecode]

Then save the following as the file ~/twitter4j-tutorial/build.

[sourcecode language=”bash”]
java -Xms512M -Xmx1536M -Xss1M -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=384M -jar `dirname $0`/sbt-launch.jar "$@"
[/sourcecode]

Make that file executable and run it, which will show SBT doing a bunch of work and then leave you with the SBT prompt. At the SBT prompt, invoke the update command.

[sourcecode language=”bash”]
$ cd ~/twitter4j-tutorial
$ chmod a+x build
$ ./build
[info] Set current project to twitter4j-tutorial (in build file:/Users/jbaldrid/twitter4j-tutorial/)
> update
[info] Updating {file:/Users/jbaldrid/twitter4j-tutorial/}default-570731…
[info] Resolving org.twitter4j#twitter4j-core;3.0.3 …
[info] Done updating.
[success] Total time: 1 s, completed Feb 8, 2013 12:55:41 PM
[/sourcecode]

To test whether you have access to twitter4j now, go to the SBT console and import the classes from the main twitter4j package.

[sourcecode language=”scala”]
> console
[info] Starting scala interpreter…
[info]
Welcome to Scala version 2.10.0 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_37).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import twitter4j._
import twitter4j._
[/sourcecode]

If nothing further is output, then you are all set (exit the console using CTRL-D). If things are amiss (or if you are running in the default Scala REPL), you’ll instead see something like the following.

[sourcecode language=”scala”]
scala> import twitter4j._
<console>:7: error: not found: value twitter4j
import twitter4j._
^
[/sourcecode]

If this is what you got, try to follow the instructions above again to make sure that your setup is exactly as above (check the versions, etc).

If you just want to see some examples of using twitter4j as an API and are happy adding its jars by hand to your classpath or are using an IDE like Eclipse, then it is unnecessary to do the SBT setup — just read on and adapt the examples as necessary.

Write, compile and run a simple main method

To set the stage for how we’ll run programs in this tutorial, let’s create a simple main method and ensure it can be run in SBT. Do the following:

[sourcecode language=”bash”]
$ mkdir -p ~/twitter4j-tutorial/src/main/scala/
[/sourcecode]

Next, save the following code as ~/twitter4j-tutorial/src/main/scala/TwitterStream.scala.

[sourcecode language=”scala”]
package bcomposes.twitter

import twitter4j._

object StatusStreamer {
def main(args: Array[String]) {
println("hi")
}
}
[/sourcecode]

Next, at the SBT prompt for the twitter4j-tutorial project, use the run-main command as follows.

[sourcecode language=”scala”]
> run-main bcomposes.twitter.StatusStreamer
[info] Compiling 1 Scala source to /Users/jbaldrid/twitter4j-tutorial/target/scala-2.10/classes…
[info] Running bcomposes.twitter.StatusStreamer
hi
[success] Total time: 2 s, completed Feb 8, 2013 1:36:32 PM
[/sourcecode]

SBT compiles the code, and then runs it. This is a generally handy way of running code with all the dependencies available without having to worry about explicitly handling the classpath.

In what comes below, we’ll flesh out that main method so that it does more interesting work.

Setting up authorization

When using the Twitter streaming API to access tweets via HTTP requests, you must supply your Twitter username and password. To use twitter4j, you also must provide authentication details; however, for this you need to set up OAuth authentication. This is straightforward:

  1. Go to https://dev.twitter.com/apps and click on the button that says “Create a new application” (of course, you’ll need to log in with your Twitter username and password in order to do this)
  2. Fill in the name, description and website fields. Don’t worry too much about this: put in whatever you like for the name and description (e.g. “My example application” and “Tutorial app for me”). For the website, give the URL of your Twitter account if you don’t have anything better to use.
  3. A new screen will come up for your application. Click on the button at the bottom that says “Create my access token”.
  4. Click on the “OAuth tool” tab and you’ll see four fields for authentication which you need in order to use twitter4j to access tweets and other information from Twitter: Consumer key, Consumer secret, Access token, and Access token secret.

Based on these authorization details, you now need to create a twitter4j.conf.Configuration object that will allow twitter4j to access the Twitter API on your behalf. This can be done in a number of different ways, including environment variables, properties files, and in code. To keep it as simple as possible for this tutorial, we’ll go with the latter option.

Add the following object after the definition of StatusStreamer, providing your details rather than the descriptions given below.

[sourcecode language=”scala”]
object Util {
val config = new twitter4j.conf.ConfigurationBuilder()
.setOAuthConsumerKey("[your consumer key here]")
.setOAuthConsumerSecret("[your consumer secret here]")
.setOAuthAccessToken("[your access token here]")
.setOAuthAccessTokenSecret("[your access token secret here]")
.build
}
[/sourcecode]

You should of course be careful not to let your details be known to others, so make sure that this code stays on your machine. When you start developing for real, you’ll use other means to get the authorization information injected into your application.

Pulling tweets from the sample stream

In the previous tutorial, the most basic sort of access was to get a random sample of tweets from https://stream.twitter.com/1/statuses/sample.json, so let’s use twitter4j to do the same.

To do this, we are going to create a TwitterStream instance that gives us an authorized connection to the Twitter API. To see all the methods associated with the TwitterStream class, see the API documentation for TwitterStream.  A TwitterStream instance is able to get tweets (and other information) and then provide them to any listeners that have registered with it. So, in order to do something useful with the tweets, you need to implement the StatusListener interface and connect it to the TwitterStream.

Before showing the code for creating and using the stream, let’s create a StatusListener that will perform a simple action based on tweets streaming in. Add the following code to the Util object created earlier.

[sourcecode language=”scala”]
def simpleStatusListener = new StatusListener() {
def onStatus(status: Status) { println(status.getText) }
def onDeletionNotice(statusDeletionNotice: StatusDeletionNotice) {}
def onTrackLimitationNotice(numberOfLimitedStatuses: Int) {}
def onException(ex: Exception) { ex.printStackTrace }
def onScrubGeo(arg0: Long, arg1: Long) {}
def onStallWarning(warning: StallWarning) {}
}
[/sourcecode]

This method creates objects that implement StatusListener (though it only does something useful for the onStatus method and otherwise ignores all other events sent to it). Clearly, what it is going to do is take a Twitter status (which is all of the information associated with a tweet, including author, retweets, geographic coordinates, etc) and output the text of the status—i.e., what we usually think of as a “tweet”.

The following code puts it all together. We create a TwitterStream object by using the TwitterStreamFactory and the configuration, add a simpleStatusListener to the stream, and then call the sample method of TwitterStream to start receiving tweets. If that were the last line of the program, it would just keep receiving tweets until the process was killed. Here, I’ve added a 2 second sleep so that we can see some tweets, then clean up the connection and shut it down cleanly. (We could let it run indefinitely, but then to kill the process, we would need to use CTRL-C, which will kill not only that process, but also the process that is running SBT.)

[sourcecode language=”scala”]
object StatusStreamer {
def main(args: Array[String]) {
val twitterStream = new TwitterStreamFactory(Util.config).getInstance
twitterStream.addListener(Util.simpleStatusListener)
twitterStream.sample
Thread.sleep(2000)
twitterStream.cleanUp
twitterStream.shutdown
}
}
[/sourcecode]

To run this code, simply put in the same run-main command in SBT as before.

[sourcecode language=”scala”]
> run-main bcomposes.twitter.StatusStreamer
[/sourcecode]

You should see tweets stream by for a couple of seconds and then you’ll be returned to the SBT prompt.

Pulling tweets with specific properties

As with the HTTP streaming, it’s easy to use twitter4j to follow a particular set of users, particular search terms, or tweets produced within certain geographic regions. All that is required is creating appropriate FilterQuery objects and then using the filter method of TwitterStream rather than the sample method.

FilterQuery has several constructors, one of which allows an Array of Long values to be provided, each of which is the id of a Twitter user who is to be followed by the stream. (See the previous tutorial to see one easy way to get the id of a user based on their username.)

[sourcecode language=”scala”]
object FollowIdsStreamer {
def main(args: Array[String]) {
val twitterStream = new TwitterStreamFactory(Util.config).getInstance
twitterStream.addListener(Util.simpleStatusListener)
twitterStream.filter(new FilterQuery(Array(1344951,5988062,807095,3108351)))
Thread.sleep(10000)
twitterStream.cleanUp
twitterStream.shutdown
}
}
[/sourcecode]

These are the IDs for Wired Magazine (@wired), The Economist (@theeconomist), the New York Times (@nytimes), and the Wall Street Journal (@wsj). Add the code to TwitterStream.scala and then run it in SBT. Note that I’ve made the program sleep for 10 seconds in order to give more time for tweets to arrive (since these are just four accounts and will have varying activity). If you are not seeing anything show up, increase the sleep time.

[sourcecode language=”scala”]
> run-main bcomposes.twitter.FollowIdsStreamer
[/sourcecode]

To track tweets that contain particular terms, create a FilterQuery with the default constructor and then call the track method with an Array of strings that contains the query terms you are interested in. The object below does this, and uses the args Array as the container for the query terms.

[sourcecode language=”scala”]
object SearchStreamer {
def main(args: Array[String]) {
val twitterStream = new TwitterStreamFactory(Util.config).getInstance
twitterStream.addListener(Util.simpleStatusListener)
twitterStream.filter(new FilterQuery().track(args))
Thread.sleep(10000)
twitterStream.cleanUp
twitterStream.shutdown
}
}
[/sourcecode]

With things set up this way, you can track arbitrary queries by specifying them on the command line.

[sourcecode language=”scala”]
> run-main bcomposes.twitter.SearchStreamer scala
> run-main bcomposes.twitter.SearchStreamer scala python java
> run-main bcomposes.twitter.SearchStreamer "sentiment analysis" "machine learning" "text analytics"
[/sourcecode]

If the search terms are not particularly common, you’ll need to increase the sleep time.

To filter by location, again create a FilterQuery with the default constructor, but then use the locations method, with an Array[Array[Double]] argument — basically an Array of two-element Arrays, each of which contains the latitude and longitude of a corner of a bounding box. Here’s an example that creates bounding box for Austin and uses it.

[sourcecode language=”scala”]
object AustinStreamer {
def main(args: Array[String]) {
val twitterStream = new TwitterStreamFactory(Util.config).getInstance
twitterStream.addListener(Util.simpleStatusListener)
val austinBox = Array(Array(-97.8,30.25),Array(-97.65,30.35))
twitterStream.filter(new FilterQuery().locations(austinBox))
Thread.sleep(10000)
twitterStream.cleanUp
twitterStream.shutdown
}
}
[/sourcecode]

To make things more flexible, we can take the bounding box information on the command line, convert the Strings into Doubles and pair them up.

[sourcecode language=”scala”]
object LocationStreamer {
def main(args: Array[String]) {
val boundingBoxes = args.map(_.toDouble).grouped(2).toArray
val twitterStream = new TwitterStreamFactory(Util.config).getInstance
twitterStream.addListener(Util.simpleStatusListener)
twitterStream.filter(new FilterQuery().locations(boundingBoxes))
Thread.sleep(10000)
twitterStream.cleanUp
twitterStream.shutdown
}
}
[/sourcecode]

We can call LocationStreamer with multiple bounding boxes, e.g. as follows for Austin, San Francisco, and New York City.

[sourcecode language=”scala”]
> run-main bcomposes.twitter.LocationStreamer -97.8 30.25 -97.65 30.35 -122.75 36.8 -121.75 37.8 -74 40 -73 41
[/sourcecode]

Conclusion

This shows the start of how you can use twitter4j with Scala for streaming. It also supports programmatic access to the actions that any Twitter user can take, including posting messages, retweeting, following, and more. I’ll cover that in a later tutorial. Also, some examples of using twitter4j will start showing up soon in the tshrldu project.

Unix pipelines for basic spelling error detection

Topics: Unix,spelling,tr,sort,uniq,find,awk

Introduction

We can of course write programs to do most anything we want, but often the Unix command line has everything we need to perform a series of useful operations without writing a line of code. In my Applied NLP class today, I show how one can get a high-confidence dictionary out of a body of raw text with a series of Unix pipes, and I’m posting that here so students can refer back to it later and see some pointers to other useful Unix resources.

Note: for help with any of the commands, just type “man <command>” at the Unix prompt.

Checking for spelling errors

We are working on automated spelling correction as an in-class exercise, with a particular emphasis on the following sentence:

This Facebook app shows that she is there favorite acress in tonw

So, this has a contextual spelling error (there), an error that could be a valid English word but isn’t (acress) and an error that violates English sound patterns (tonw).

One of the key ingredients for spelling correction is a dictionary of words known to be valid in the language. Let’s assume we are working with English here. On most Unix systems, you can pick up an English dictionary in /usr/share/dict/words, though the words you find may vary from one platform to another. If you can’t find anything there, there are many word lists available online, e.g. check out the Wordlist project for downloads and links.

We can easily use the dictionary and Unix to check for words in the above sentence that don’t occur in the dictionary. First, save the sentence to a file.

[sourcecode language=”bash”]
$ echo "This Facebook app shows that she is there favorite acress in tonw" > sentence.txt
[/sourcecode]

Next, we need to get the unique word types (rather than tokens) is sorted lexicographic order. The following Unix pipeline accomplishes this.

[sourcecode language=”bash”]
$ cat sentence.txt | tr ‘ ‘ ‘n’ | sort | uniq > words.txt
[/sourcecode]

To break it down:

  •  The cat command spills the file to standard output.
  • The tr command “translates” all spaces to new lines. So, this gives us one word per line.
  • The sort command sorts the lines lexicographically.
  • The uniq command makes those lines uniq by making adjacent duplicates disappear. (This doesn’t do anything for this particular sentence, but I’m putting it in there in case you try other sentences that have multiple tokens of the type “the”, for example.)

You can see these effects by doing each in turn, building up the pipeline incrementally.

[sourcecode language=”bash”]
$ cat sentence.txt
This Facebook app shows that she is there favorite acress in tonw
$ cat sentence.txt | tr ‘ ‘ ‘n’
This
Facebook
app
shows
that
she
is
there
favorite
acress
in
tonw
$ cat sentence.txt | tr ‘ ‘ ‘n’ | sort
Facebook
This
acress
app
favorite
in
is
she
shows
that
there
tonw
[/sourcecode]

Note: the use of cat above is a UUOC (unnecessary use of cat) that is dispreferred to sending the input directly into tr at the start. I do it this way in the tutorial so that everything flows left-to-right. However, if you want to avoid cat abuse, here’s how you’d do it.

[sourcecode language=”bash”]

$ tr ‘ ‘ ‘n’ < sentence.txt | sort | uniq
[/sourcecode]

We can now use the comm command to compare the file words.txt and the dictionary. It produces three columns of output: the first gives the lines only in the first file, the second are lines only in the second file, and the third are those in common. So, the first column has what we need, because those are words in our sentence that are not found in the dictionary. Here’s the command to get that.

[sourcecode language=”bash”]
$ comm -23 words.txt /usr/share/dict/words
Facebook
This
acress
app
shows
tonw
[/sourcecode]

The -23 options indicate we should suppress columns 2 and 3 and show only column 1. If we just use -2, we get the words in the sentence with the non-dictionary words on the left and the dictionary words on the right (try it).

The problem of course is that any word list will have gaps. This dictionary doesn’t have more recent terms like Facebook and app. It also doesn’t have upper-case This. You can ignore case with comm using the -i option and this goes away. It doesn’t have shows, which is not in the dictionary since it is an inflected form of the verb stem show. We could fix this with some morphological analysis, but instead of that, let’s go the lazy route and just grab a larger list of words.

Extracting a high-confidence dictionary from a corpus

Raw text often contains spelling errors, but errors don’t tend to happen with very high frequency, so we can often get pretty good expanded word lists by computing frequencies of word types on lots of text and then applying reasonable cutoffs. (There are much more refined methods, but this will suffice for current purposes.)

First, let’s get some data. The Open American National Corpus has just released v3.0.0 of its Manually Annotated Sub-Corpus (MASC), which you can get from this link.

– http://www.anc.org/masc/MASC-3.0.0.tgz

Do the following to get it and set things up for further processing:

[sourcecode language=”bash”]
$ mkdir masc
$ cd masc
$ wget http://www.anc.org/masc/MASC-3.0.0.tgz
$ tar xzf MASC-3.0.0.tgz
[/sourcecode]

(If you don’t have wget, you can just download the MASC file in your browser and then move it over.)

Next, we want all the text from the data/written directory. The find command is very handy for this.

[sourcecode language=”bash”]
$ find data/written -name "*.txt" -exec cat {} ; > all-written.txt
[/sourcecode]

To see how much is there, use the wc command.

[sourcecode language=”bash”]
$ wc all-written.txt
43061 400169 2557685 all-written.txt
[/sourcecode]

So, there are 43k lines, and 400k tokens. That’s a bit small for what we are trying to do, but it will suffice for the example.

Again, I’ll build up a Unix pipeline to extract the high-confidence word types from this corpus. I’ll use the head command to show just part of the output at each stage.

Here are the raw contents.

[sourcecode language=”bash”]
$ cat all-written.txt | head

I can’t believe I wrote all that last year.
Acephalous

Friday, 07 May 2010

[/sourcecode]

Now, get one word per line.

[sourcecode language=”bash”]
$ cat all-written.txt | tr -cs ‘A-Za-z’ ‘n’ | head

I
can
t
believe
I
wrote
all
that
last
[/sourcecode]

The tr translator is used very crudely: basically, anything that is not an ASCII letter character is turned into a new line. The -cs options indicate to take the complement (opposite) of the ‘A-Za-z’ argument and to squeeze duplicates (e.g. A42, becomes A with a single new line rather than three).

Next, we sort and uniq, as above, except that we use the -c option to uniq so that it produces counts.

[sourcecode language=”bash”]
$ cat all-written.txt | tr -cs ‘A-Za-z’ ‘n’ | sort | uniq -c | head
1
737 A
22 AA
1 AAA
1 AAF
1 AAPs
21 AB
3 ABC
1 ABDULWAHAB
1 ABLE
[/sourcecode]

Because the MASC corpus includes tweets and blogs and other unedited text, we don’t trust words that have low counts, e.g. four or fewer tokens of that type. We can use awk to filter those out.

[sourcecode language=”bash”]
$ cat all-written.txt | tr -cs ‘A-Za-z’ ‘n’ | sort | uniq -c | awk ‘{ if($1>4) print $2 }’ | head
A
AA
AB
ACR
ADDRESS
ADP
ADPNP
AER
AIG
ALAN
[/sourcecode]

Awk makes it easy to process lines of files, and gives you indexes into the first column ($1), second ($2), and so on. There’s much more you can do, but this shows how you can conditionally output some information from each line using awk.

You can of course change the threshold. You can also turn all words to lower-case by inserting another tr call into the pipe, e.g.:

[sourcecode language=”bash”]
$ cat all-written.txt | tr ‘A-Z’ ‘a-z’ | tr -cs ‘a-z’ ‘n’ | sort | uniq -c | awk ‘{ if($1>8) print $2 }’ | head
a
aa
ab
abandoned
abbey
ability
able
abnormal
abnormalities
aboard
[/sourcecode]

It all comes down to what you need out of the text.

Combining and using the dictionaries

Let’s do the check on the sentence above, but using both the standard dictionary and the one derived from MASC. Run the following command first.

[sourcecode language=”bash”]
$ cat all-written.txt | tr -cs ‘A-Za-z’ ‘n’ | sort | uniq -c | awk ‘{ if($1>4) print $2 }’ > /tmp/masc_vocab.txt
[/sourcecode]

Then in the directory where you saved words.txt, do the following.

[sourcecode language=”bash”]
$ cat /usr/share/dict/words /tmp/masc_vocab.txt | sort | uniq > big_vocab.txt
$ comm -23 words.txt big_vocab.txt
acress
tonw
[/sourcecode]

Ta-da! The MASC corpus provided us with enough examples of other words that This, Facebook, app, and shows are no longer detected as errors. Of course, detecting there as an error is much more difficult and requires language models and more.

Conclusion

Learn to use the Unix command line! This post is just a start into many cool things you can do with Unix pipes. Here are some other resources:

Happy (Unix) hacking!

Processing JSON in Scala with Jerkson

Topics: JSON, Jerkson, SBT quick start, running the Scala REPL in SBT, Java implicit conversions, @transient annotation, SBT run and run-main, Avro

Introduction

The previous tutorial covered basic XML processing in Scala, but as I noted, XML is not the primary choice for data serialization these days. Instead, JSON (JavaScript Object Notation) is more widely used for data interchange, in part because it is less verbose and better captures the core data structures (such as lists and maps) that are used in defining many objects. It was originally designed for working with JavaScript, but turned out to be quite effective as a language neutral format. A very nice feature of it is that it is straightforward to translate objects as defined in languages like Java and Scala into JSON and back again, as I’ll show in this tutorial. If the class definitions and the JSON structures are appropriately aligned, this transformation turns out to be entirely trivial to do — given a suitable JSON processing library.

In this tutorial, I cover basic JSON processing in Scala using the Jerkson library, which itself is essentially a Scala wrapper around the Jackson library (written in Java).  Note that other libraries like lift-json are perfectly good alternatives, but Jerkson seems to have some efficiency advantages for streaming JSON due to Jackson’s performance. Of course, since Scala plays nicely with Java, you can directly use whichever JVM-based JSON library you like, including Jackson.

This post also shows how to do a quick start with SBT that will allow you to easily access third-party libraries as dependencies and start writing code that uses them and can be compiled with SBT.

Note: As a “Jason” I insist that JSON should be pronounced Jay-SAHN (with stress on the second syllable) to distinguish it from the name. 🙂

Getting set up

An easy way to use the Jerkson library in the context of a tutorial like this is for the reader to set up a new SBT project, declare Jerkson as a dependency, and then fire up the Scala REPL using SBT’s console action. This sorts out the process of obtaining external libraries and setting up the classpath so that they are available in an SBT-initiated Scala REPL. Follow the instructions in this section to do so.

Note: if you have already been working with Scalabha version 0.2.5 (or later), skip to the bottom of this section to see how to run the REPL using Scalabha’s build. Alternatively, if you have an existing project of your own, you can of course just add Jerkson as a dependency, import its classes as necessary and use it in your normal programming setup. The examples below will then help as some straightforward recipes for using it in your project.

First, create a directory to work in and download the SBT launch jar.

$ mkdir ~/json-tutorial
$ cd ~/json-tutorial/
$ wget http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.11.3/sbt-launch.jar

Note: If you don’t have wget installed on your machine, you can download the above sbt-launch.jar file in your browser and move it to the ~/json-tutorial directory.

Now, save the following as the file ~/json-tutorial/build.sbt. Be aware that it is important to keep the empty lines between each of the declarations.

name := "json-tutorial"

version := "0.1.0 "

scalaVersion := "2.9.2"

resolvers += "repo.codahale.com" at "http://repo.codahale.com"

libraryDependencies += "com.codahale" % "jerkson_2.9.1" % "0.5.0"

Then save the following in the file ~/json-tutorial/runSbt.

java -Xms512M -Xmx1536M -Xss1M -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=384M -jar `dirname $0`/sbt-launch.jar "$@"

Make that file executable and run it, which will show SBT doing a bunch of work and then leave you with the SBT prompt.

$ cd ~/json-tutorial
$ chmod a+x runSbt
$ ./runSbt update
Getting org.scala-sbt sbt_2.9.1 0.11.3 ...
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt_2.9.1/0.11.3/jars/sbt_2.9.1.jar ...
[SUCCESSFUL ] org.scala-sbt#sbt_2.9.1;0.11.3!sbt_2.9.1.jar (307ms)
...
... more stuff including getting the the Jerkson library ...
...
[success] Total time: 25 s, completed May 11, 2012 10:22:42 AM
$

You should be back in the Unix shell at this point, and now we are ready to run the Scala REPL using SBT. The important thing is that this instance of the REPL will have the Jerkson library and its dependencies in the classpath so that we can import the classes we need.

./runSbt console
[info] Set current project to json-tutorial (in build file:/Users/jbaldrid/json-tutorial/)
[info] Starting scala interpreter...
[info]
Welcome to Scala version 2.9.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_31).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import com.codahale.jerkson.Json._
import com.codahale.jerkson.Json._

If nothing further is output, then you are all set. If things are amiss (or if you are running in the default Scala REPL), you’ll instead see something like the following.

scala> import com.codahale.jerkson.Json._
:7: error: object codahale is not a member of package com
import com.codahale.jerkson.Json._

If this is what you got, try to follow the instructions above again to make sure that your setup is exactly as above. However, if you continue to experience problems, an alternative is to get version 0.2.5 of Scalabha (which already has Jerkson as a dependency), follow the instructions for setting it up and then run the following commands.

$ cd $SCALABHA_DIR
$ scalabha build console

If you just want to see some examples of using Jerkson as an API and not use it interactively, then it is entirely unnecessary to do the SBT setup — just read on and adapt the examples as necessary.

Processing a simple JSON example

As usual, let’s begin with a very simple example that shows some of the basic properties of JSON.

{"foo": 42
"bar": ["a","b","c"],
"baz": { "x": 1, "y": 2 }}

This describes a data structure with three fields, foo, bar and baz. The field foo‘s value is the integer 42, bar‘s value is a list of strings, and baz‘s value is a map from strings to integers. These are language neutral (but universal) types.

Let’s first consider deserializing each of these values individually as Scala objects, using Jerkson’s parse method. Keep in mind that JSON in a file is a string, so the inputs in all of these cases are strings (at times I’ll use triple-quoted strings when there are quotes themselves in the JSON). In each case, we tell the parse method what type we expect by providing a type specification before the argument.

scala> parse[Int]("42")
res0: Int = 42

scala> parse[List[String]]("""["a","b","c"]""")
res1: List[String] = List(a, b, c)

scala> parse[Map[String,Int]]("""{ "x": 1, "y": 2 }""")
res2: Map[String,Int] = Map(x -> 1, y -> 2)

So, in each case, the string representation is turned into a Scala object of the appropriate type. If we aren’t sure what the type is or if we know for example that a List is heterogeneous, we can use Any as the expected type.

scala> parse[Any]("42")
res3: Any = 42

scala> parse[List[Any]]("""["a",1]""")
res4: List[Any] = List(a, 1)

If you give an expect type that can’t be parsed as such, you’ll get an error.

scala> parse[List[Int]]("""["a",1]""")
com.codahale.jerkson.ParsingException: Can not construct instance of int from String value 'a': not a valid Integer value
at [Source: java.io.StringReader@2bc5aea; line: 1, column: 2]
<...many more lines of stack trace...>

How about parsing all of the attributes and values together? Save the whole thing in a variable simpleJson as follows.

scala> :paste
// Entering paste mode (ctrl-D to finish)

val simpleJson = """{"foo": 42,
"bar": ["a","b","c"],
"baz": { "x": 1, "y": 2 }}"""

// Exiting paste mode, now interpreting.

simpleJson: java.lang.String =
{"foo": 42,
"bar": ["a","b","c"],
"baz": { "x": 1, "y": 2 }}

Since it is a Map from Strings to different types of values, the best we can do is deserialize it as a Map[String, Any].

scala> val simple = parse[Map[String,Any]](simpleJson)
simple: Map[String,Any] = Map(bar -> [a, b, c], baz -> {x=1, y=2}, foo -> 42)

To get these out as more specific types than Any, you need to cast them to the appropriate types.

scala> val fooValue = simple("foo").asInstanceOf[Int]
fooValue: Int = 42

scala> val barValue = simple("bar").asInstanceOf[java.util.ArrayList[String]]
barValue: java.util.ArrayList[String] = [a, b, c]

scala> val bazValue = simple("baz").asInstanceOf[java.util.LinkedHashMap[String,Int]]
bazValue: java.util.LinkedHashMap[String,Int] = {x=1, y=2}

Of course, you might want to be working with Scala types, which is easy if you import the implicit conversions from Java types to Scala types.

scala> import scala.collection.JavaConversions._
import scala.collection.JavaConversions._

scala> val barValue = simple("bar").asInstanceOf[java.util.ArrayList[String]].toList
barValue: List[String] = List(a, b, c)

scala> val bazValue = simple("baz").asInstanceOf[java.util.LinkedHashMap[String,Int]].toMap
bazValue: scala.collection.immutable.Map[String,Int] = Map(x -> 1, y -> 2)

Voila! When you are working with Java libraries in Scala, the JavaConversions usually prove to be extremely handy.

Deserializing into user-defined types

Though we were able to parse the simple JSON expression above and even cast values into appropriate types, things were still a bit clunky. Fortunately, if you have defined your own case class with the appropriate fields, you can provide that as the expected type instead. For example, here’s a simple case class that will do the trick.

case class Simple(val foo: String, val bar: List[String], val baz: Map[String,Int])

Clearly this has all the right fields (with variables named the same as the fields in the JSON example), and the variables have the types we’d like them to have.

Unfortunately, due to class loading issues with SBT, we cannot carry on the rest of this exercise solely in the REPL and must define this class in code. This code can be compiled and then used in the REPL or by other code. To do this, save the following as ~/json-tutorial/Simple.scala.

case class Simple(val foo: String, val bar: List[String], val baz: Map[String,Int])

object SimpleExample {
def main(args: Array[String]) {
import com.codahale.jerkson.Json._
val simpleJson = """{"foo":42, "bar":["a","b","c"], "baz":{"x":1,"y":2}}"""
val simpleObject = parse[Simple](simpleJson)
println(simpleObject)
}
}

Then exit the Scala REPL session you were in for the previous section using the command :quit, and do the following. (If anything has gone amiss you can restart SBT (with runSbt) and do the following commands.)

> compile
[info] Compiling 1 Scala source to /Users/jbaldrid/json-tutorial/target/scala-2.9.2/classes...
[success] Total time: 2 s, completed May 11, 2012 9:24:00 PM
> run
[info] Running SimpleExample SimpleExample
Simple(42,List(a, b, c),Map(x -> 1, y -> 2))
[success] Total time: 1 s, completed May 11, 2012 9:24:03 PM

You can make changes to the code in Simple.scala, compile it again (you don’t need to exit SBT to do so), and run it again. Also, now that you’ve compiled, if you start up the Scala REPL using the console action, then the Simple class is now available to you and you can carry on working in the REPL. For example, here are the same statements that are used in the SimpleExample main method given previously.

scala> import com.codahale.jerkson.Json._
import com.codahale.jerkson.Json._

scala> val simpleJson = """{"foo":42, "bar":["a","b","c"], "baz":{"x":1,"y":2}}"""
simpleJson: java.lang.String = {"foo":42, "bar":["a","b","c"], "baz":{"x":1,"y":2}}

scala> val simpleObject = parse[Simple](simpleJson)
simpleObject: Simple = Simple(42,List(a, b, c),Map(x -> 1, y -> 2))

scala> println(simpleObject)
Simple(42,List(a, b, c),Map(x -> 1, y -> 2))

Another nice feature of JSON serialization is that if the JSON string has more information than you need to construct the object want to build from it, it is ignored. For example, consider deserializing the following example, which has an extra field eca in the JSON representation.

scala> val ecaJson = """{"foo":42, "bar":["a","b","c"], "baz":{"x":1,"y":2}, "eca": true}"""
ecaJson: java.lang.String = {"foo":42, "bar":["a","b","c"], "baz":{"x":1,"y":2}, "eca": true}

scala> val noEcaSimpleObject = parse[Simple](ecaJson)
noEcaSimpleObject: Simple = Simple(42,List(a, b, c),Map(x -> 1, y -> 2))

The eca information silently slips away and we still get a Simple object with all the information we need. This property is very handy for ignoring irrelevant information, which I’ll show to be quite useful in a follow-up post on processing JSON formatted tweets from Twitter’s API.

Another thing to note about the above example is that the Boolean values true and false are valid JSON (they are not quoted strings, but actual Boolean values). Parsing a Boolean is even quite forgiving as Jerkson will give you a Boolean even when it is defined as a String.

scala> parse[Map[String,Boolean]]("""{"eca":true}""")
res0: Map[String,Boolean] = Map(eca -> true)

scala> parse[Map[String,Boolean]]("""{"eca":"true"}""")
res1: Map[String,Boolean] = Map(eca -> true)

And it will convert a Boolean into a String if you happen to ask it to do so.

scala> parse[Map[String,String]]("""{"eca":true}""")
res2: Map[String,String] = Map(eca -> true)

But it (sensibly) won’t convert any String other than true or false into a Boolean.

scala> parse[Map[String,Boolean]]("""{"eca":"brillig"}""")
com.codahale.jerkson.ParsingException: Can not construct instance of boolean from String value 'brillig': only "true" or "false" recognized
at [Source: java.io.StringReader@6b2739b8; line: 1, column: 2]
<...stacktrace...>

And it doesn’t admit unquoted values other than a select few, including true and false.

scala> parse[Map[String,String]]("""{"eca":brillig}""")
com.codahale.jerkson.ParsingException: Malformed JSON. Unexpected character ('b' (code 98)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') at character offset 7.
<...stacktrace...>

In other words, your JSON needs to be grammatical.

Generating JSON from an object

If you have an object in hand, it is very easy to create JSON from it (serialize) using the generate method.

scala> val simpleJsonString = generate(simpleObject)
simpleJsonString: String = {"foo":"42","bar":["a","b","c"],"baz":{"x":1,"y":2}}

This is much easier than the XML solution, which required explicitly declaring how an object was to be turned into XML elements. The restriction is that any such objects must be instances of a case class. If you don’t have a case class, you’ll need to do some special handling (not discussed in this tutorial).

A richer JSON example

In the vein of the previous tutorial on XML, I’ve created the JSON corresponding to the music XML example used there. You can find it as the Github gist music.json:

https://gist.github.com/2668632

Save that file as /tmp/music.json.

Tip: you can easily format condensed JSON to be more human-readable by using the mjson tool in Python.

$ cat /tmp/music.json | python -mjson.tool
[
{
"albums": [
{
"description": "ntThe King of Limbs is the eighth studio album by English rock band Radiohead, produced by Nigel Godrich. It was self-released on 18 February 2011 as a download in MP3 and WAV formats, followed by physical CD and 12" vinyl releases on 28 March, a wider digital release via AWAL, and a special "newspaper" edition on 9 May 2011. The physical editions were released through the band's Ticker Tape imprint on XL in the United Kingdom, TBD in the United States, and Hostess Entertainment in Japan.n      ",
"songs": [
{
"length": "5:15",
"title": "Bloom"
},
<...etc...>

Next, save the following code as ~/json-tutorial/MusicJson.scala.

package music {

case class Song(val title: String, val length: String) {
@transient lazy val time = {
val Array(minutes, seconds) = length.split(":")
minutes.toInt*60 + seconds.toInt
}
}

case class Album(val title: String, val songs: Seq[Song], val description: String) {
@transient lazy val time = songs.map(_.time).sum
@transient lazy val length = (time / 60)+":"+(time % 60)
}

case class Artist(val name: String, val albums: Seq[Album])
}

object MusicJson {
def main(args: Array[String]) {
import com.codahale.jerkson.Json._
import music._
val jsonInput = io.Source.fromFile("/tmp/music.json").mkString
val musicObj = parse[List[Artist]](jsonInput)
println(musicObj)
}
}

A couple of quick notes. The Song, Album, and Artist classes are the same as I used in the previous tutorial on XML processing, with two changes. The first is that I’ve wrapped them in a package music. This is only necessary to get around an issue with running Jerkson in SBT as we are doing here. The other is that the fields that are not in the constructor are marked as @transient: this ensures that they are not included in the output when we generate JSON from objects of these classes. An example showing how this matters is the way that I created the music.json file: I read in the XML as in the previous tutorial and then use Jerkson to generate the JSON — without the @transient annotation, those fields are included in the output. For reference, here’s the code to do the conversion from XML to JSON (which you can add to MusicJson.scala if you like).

object ConvertXmlToJson {
def main(args: Array[String]) {
import com.codahale.jerkson.Json._
import music._
val musicElem = scala.xml.XML.loadFile("/tmp/music.xml")

val artists = (musicElem "artist").map { artist =>
val name = (artist "@name").text
val albums = (artist "album").map { album =>
val title = (album "@title").text
val description = (album "description").text
val songList = (album "song").map { song =>
Song((song "@title").text, (song "@length").text)
}
Album(title, songList, description)
}
Artist(name, albums)
}

val musicJson = generate(artists)
val output = new java.io.BufferedWriter(new java.io.FileWriter(new java.io.File("/tmp/music.json")))
output.write(musicJson)
output.flush
output.close
}
}

There are other serialization strategies (e.g. binary serialization of objects), and the @transient annotation is similarly respected by them.

Given the code in MusicJson.scala, we can now compile and run it. In SBT, you can either do run or run-main. If you choose run and there are more than one main methods in your project, SBT will give you a choice.

> run

Multiple main classes detected, select one to run:

[1] SimpleExample
[2] MusicJson
[3] ConvertXmlToJson

Enter number: 2

[info] Running MusicJson
List(Artist(Radiohead,List(Album(The King of Limbs,List(Song(Bloom,5:15), Song(Morning Mr Magpie,4:41), Song(Little by Little,4:27), Song(Feral,3:13), Song(Lotus Flower,5:01), Song(Codex,4:47), Song(Give Up the Ghost,4:50), Song(Separator,5:20)),
The King of Limbs is the eighth studio album by English rock band Radiohead, produced by Nigel Godrich. It was self-released on 18 February 2011 as a download in MP3 and WAV formats, followed by physical CD and 12" vinyl releases on 28 March, a wider digital release via AWAL, and a special "newspaper" edition on 9 May 2011. The physical editions were released through the band's Ticker Tape imprint on XL in the United Kingdom, TBD in the United States, and Hostess Entertainment in Japan.
), Album(OK Computer,List(Song(Airbag,4:44), Song(Paranoid
<...more printed output...>
[success] Total time: 3 s, completed May 12, 2012 11:52:06 AM

With run-main, you just explicitly provide the name of the object whose main method you wish to run.

> run-main MusicJson
[info] Running MusicJson
<...same output as above...>

So, either way, we have successfully de-serialized the JSON description of the music data. (You can also get the same result by entering the code of the main method of MusicJson into the REPL when you run it from the SBT console.)

Conclusion

This tutorial has shown how easy it is to serialize (generate) and deserialize (parse) objects to and from JSON format. Hopefully, this has demonstrated the relative ease of doing this with the Jerkson library and Scala, and especially the relative ease in comparison with working with XML for similar purposes.

In addition to this ease, JSON is generally more compact than the equivalent XML. However, it still is far from being a truly compressed format, and there is a lot of obvious “waste”, like having the field names repeated again and again for each object. This matters a lot when data is represented as JSON strings and is being sent over networks and/or used in distributed processing frameworks like Hadoop. The Avro file format is an evolution of JSON that performs such compression: it includes a schema with each file and then each object is represented in a binary format that only specifies the data and not the field names. In addition to being more compact, it retains the properties of being easily splittable, which matters a great deal for processing large files in Hadoop.

Copyright 2012 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Basic XML processing with Scala

Topics: XML, Scala XML API, XML literals, marshalling

Introduction

Pretty much everybody knows what XML is: it is a structured, machine-readable text format for representing information that can be easily checked for the “grammaticality” of the tags, attributes, and their relationship to each other (e.g. using DTD’s). This contrasts with HTML, which can have elements that don’t close (e.g. <p>foo<p>bar rather than <p>foo</p><p>bar</p>) and still be processed. XML was only ever meant to be a format for machines, but it morphed into a data representation that many people ended up (unfortunately, for them) editing by hand. However, even as a machine readable format it has problems, such as being far more verbose than is really required, which matters quite a bit when you need to transfer lots of data from machine to machine — in the next post, I’ll discuss JSON and Avro, which can be viewed as evolutions of what XML was intended for and which work much better for lots of the applications that matter in the “big data” context. Regardless, there is plenty of legacy data that was produced as XML, and there are many communities (e.g. the digital humanities community) who still seem to adore XML, so people doing any reasonable amount of text analysis work will likely find themselves eventually needing to work with XML-encoded data.

There are a lot of tutorials on XML and Scala — just do a web search for “Scala XML” and you’ll get them. As with other blog posts, this one is aimed at being very explicit so that beginners can see examples with all the steps in them, and I’ll use it to set up a JSON processing post.

A simple example of XML

To start things off, let’s consider a very basic example of creating and processing a bit of XML.

The first thing to know about XML in Scala is that Scala can process XML literals. That is, you don’t need to put quotes around XML strings — instead, you can just write them directly, and Scala will automatically interpret them as XML elements (of type scala.xml.Element).

scala> val foo = <foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>
foo: scala.xml.Elem = <foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>

Now let’s do a little bit of processing on this. You can get all the text by using the text method.

scala> foo.text
res0: String = hi1yellow

So, that munged all the text together. To get them printed out with spaces between, let’s first get all the bar nodes and then get their texts and use mkString on that sequence. To get the bar nodes, we can use the selector.

scala> foo  "bar"
res1: scala.xml.NodeSeq = NodeSeq(<bar type="greet">hi</bar>, <bar type="count">1</bar>, <bar type="color">yellow</bar>)

This gives us back a sequence of the bar nodes that occur directly under the foo node. Note that the operator (selector) is just a mirror image of the / selector used in XPath.

Of course, now that we have such a sequence, we can map over it to get what we want. Since the text method returns the text under a node, we can do the following.

scala> (foo  "bar").map(_.text).mkString(" ")
res2: String = hi 1 yellow

To grab the value of the type attribute on each node, we can use the selector followed by “@type”.

scala> (foo  "bar").map(_  "@type")
res3: scala.collection.immutable.Seq[scala.xml.NodeSeq] = List(greet, count, color)

(foo  "bar").map(barNode => (barNode  "@type", barNode.text))
res4: scala.collection.immutable.Seq[(scala.xml.NodeSeq, String)] = List((greet,hi), (count,1), (color,yellow))

Note that the selector can only retrieve children of the node you are selecting from. To dig arbitrarily deep to pull out all nodes of a given type no matter where they are, use the \ selector. Consider the following (bizarre) XML snippet with ‘z’ nodes at different levels of embedding.

<a>
  <z x="1"/>
  <b>
    <z x="2"/>
    <c>
      <z x="3"/>
    </c>
    <z x="4"/>
  </b>
</a>

Let’s first put it into the REPL.

scala> val baz = <a><z x="1"/><b><z x="2"/><c><z x="3"/></c><z x="4"/></b></a>
baz: scala.xml.Elem = <a><z x="1"></z><b><z x="2"></z><c><z x="3"></z></c><z x="4"></z></b></a>

If we want to get all of the ‘z’ nodes, we do the following.

scala> baz \ "z"
res5: scala.xml.NodeSeq = NodeSeq(<z x="1"></z>, <z x="2"></z>, <z x="3"></z>, <z x="4"></z>)

And we can of course easily dig out the values of the x attributes on each of the z’s.

scala> (baz \ "z").map(_  "@x")
res6: scala.collection.immutable.Seq[scala.xml.NodeSeq] = List(1, 2, 3, 4)

Throughout all of the above, we have used XML literals — that is, expressions typed directly into Scala, which interprets them as XML types. However, we usually need to process XML that is saved in a file, or a string, so the scala.xml.XML object has several methods for creating scala.xml.Elem objects from other sources. For example, the following allows us to create XML from a string.

scala> val fooString = """<foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>"""
fooString: java.lang.String = <foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>

scala> val fooElemFromString = scala.xml.XML.loadString(fooString)
fooElemFromString: scala.xml.Elem = <foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>

This Elem is the same as the one created using the XML literal, as shown by the following test.

scala> foo == fooElemFromString
res7: Boolean = true

See the Scala XML object for other ways to create XML elements, e.g. from InputStreams, Files, etc.

A richer XML example

As a more interesting example of some XML to process, I’ve created the following short XML string describing artist, albums, and songs, which you can see in the github gist music.xml.

https://gist.github.com/2597611

I haven’t put any special care into this, other than to make sure it has embedded tags, some of which have attributes, and some reasonably interesting content (and some great songs).

You should save this in a file called /tmp/music.xml. Once you’ve done that, you can run the following code, which just prints out each artist, album and song, with an indent for each level.

val musicElem = scala.xml.XML.loadFile("/tmp/music.xml")

(musicElem  "artist").foreach { artist =>
  println((artist  "@name").text + "n")
  val albums = (artist  "album").foreach { album =>
    println("  " + (album  "@title").text + "n")
    val songs = (album  "song").foreach { song =>
      println("    " + (song  "@title").text)
    }
  println
  }
}

Converting objects to and from XML

One of the use cases for XML is to provide a machine-readable serialization format for objects that can still be easily read, and at times edited, by humans. The process of shuffling objects from memory into a disk-format like XML is called marshalling. We’ve started with some XML, so what we’ll do is define some classes and “unmarshall” the XML into objects of those classes. Put the following into the REPL. (Tip: You can use “:paste” to enter multi-line statements like those below. These will work without paste, but it is necessary to use it in some contexts, e.g. if you define Artist before Song.)

case class Song(val title: String, val length: String) {
  lazy val time = {
    val Array(minutes, seconds) = length.split(":")
    minutes.toInt*60 + seconds.toInt
  }
}

case class Album(val title: String, val songs: Seq[Song], val description: String) {
  lazy val time = songs.map(_.time).sum
  lazy val length = (time / 60)+":"+(time % 60)
}

case class Artist(val name: String, val albums: Seq[Album])

Pretty simple and straightforward. Note the use of lazy vals for defining things like the time (length in seconds) of a song. The reason for this is that if we create a Song object but never ask for its time, then the code needed to compute it from a string like “4:38” is never run; however, if we had left lazy off, then it would be computed when the Song object is created. Also, we don’t want to use a def here (i.e. make time a method) because its value is fixed based on the length string; using a method would mean recomputing time every time it is asked for of a particular object.

Given the classes above, we can create and use objects from them by hand.

scala> val foobar = Song("Foo Bar", "3:29")
foobar: Song = Song(Foo Bar,3:29)

scala> foobar.time
res0: Int = 209

Using the native Scala XML API

Of course, we’re more interested in constructing Artist, Album, and Song objects from information specified in files like the music example. Though I don’t show the REPL output here, you should enter all of the commands below into it to see what happens.

To start off, make sure you have loaded the file.

val musicElem = scala.xml.XML.loadFile("/tmp/music.xml")

Now we can work with the file to select various elements, or create objects of the classes defined above. Let’s start with just Songs. We can ignore all the artists and albums and dig straight in with the \ operator.

val songs = (musicElem \ "song").map { song =>
  Song((song  "@title").text, (song  "@length").text)
}

scala> songs.map(_.time).sum
res1: Int = 11311

And, we can go all the way and construct Artist, Album and Song objects that directly mirror the data stored in the XML file.

val artists = (musicElem  "artist").map { artist =>
  val name = (artist  "@name").text
  val albums = (artist  "album").map { album =>
    val title = (album  "@title").text
    val description = (album  "description").text
    val songList = (album  "song").map { song =>
      Song((song  "@title").text, (song  "@length").text)
    }
    Album(title, songList, description)
  }
  Artist(name, albums)
}

With the artists sequence in hand, we can do things like showing the length of each album.

val albumLengths = artists.flatMap { artist =>
  artist.albums.map(album => (artist.name, album.title, album.length))
}
albumLengths.foreach(println)

Which gives the following output.

(Radiohead,The King of Limbs,37:34)
(Radiohead,OK Computer,53:21)
(Portished,Dummy,48:46)
(Portished,Third,48:50)

Marshalling objects to XML

In addition to constructing objects from XML specifications (also referred to as de-serializing and un-marshalling), it is often necessary to marshal objects one has constructed in code to XML (or other formats). The use of XML literals is actually quite handy in this regard. To see this, let’s start with the first song of the first album of the first album (Bloom, by Radiohead).

scala> val bloom = artists(0).albums(0).songs(0)
bloom: Song = Song(Bloom,5:15)

We can construct an Elem from this as follows.

scala> val bloomXml = <song title={bloom.title} length={bloom.length}/>
bloomXml: scala.xml.Elem = <song length="5:15" title="Bloom"></song>

The thing to note here is that an XML literal is used, but when we want to use values from variables, we can escape from literal-mode with curly brackets. So, {bloom.title} becomes “Bloom”, and so on. In contrast, one could do it via a String as follows.

scala> val bloomXmlString = "<song title=""+bloom.title+"" length=""+bloom.length+""/>"
bloomXmlString: java.lang.String = <song title="Bloom" length="5:15"/>

scala> val bloomXmlFromString = scala.xml.XML.loadString(bloomXmlString)
bloomXmlFromString: scala.xml.Elem = <song length="5:15" title="Bloom"></song>

So, the use of literals is a bit more readable (though it comes at the cost of making it hard in Scala to use “<” as an operator for many use cases, which is one of the reasons XML literals are considered by many to be not a great idea).

We can create the whole XML for all of the artists and albums in one fell swoop. Note that one can have XML literals in the escaped bracketed portions of an XML literal, which allows the following to work. Note: you need to use the :paste mode in the REPL in order for this to work.

val marshalled =
  <music>
  { artists.map { artist =>
    <artist name={artist.name}>
    { artist.albums.map { album =>
      <album title={album.title}>
      { album.songs.map(song => <song title={song.title} length={song.length}/>) }
      <description>{album.description}</description>
      </album>
    }}
    </artist>
  }}
</music>

Note that in this case, the for-yield syntax is perhaps a bit more readable since it doesn’t require the extra curly braces.

val marshalledYield =
<music>
  { for (artist <- artists) yield
    <artist name={artist.name}>
    { for (album <- artist.albums) yield
      <album title={album.title}>
      { for (song <- album.songs) yield <song title={song.title} length={song.length}/> }
        <description>{album.description}</description>
      </album>
    }
    </artist>
  }
</music>

One could of course instead add a toXml method to each of the Song, Album, and Artist classes such that at the top level you’d have something like the following.

val marshalledWithToXml =  <music> { artists.map(_.toXml) } </music>

This is a fairly common strategy. However, note that the problem with this solution is that it produces a very tight coupling between the program logic (e.g. of what things like Songs, Albums and Artists can do) with other, orthogonal logic, like serializing them. To see a way of decoupling such different needs, check out Dan Rosen’s excellent tutorial on type classes.

Conclusion

The standard Scala XML API comes packaged with Scala, and it is actually quite nice for some basic XML processing. However, it caused some “controversy” in that it was felt by many that the core language has no business providing specialized processing for a format like XML. Also, there are some efficiency issues. Anti-XML is a library that seeks to do a better job of processing XML (especially in being more scalable and more flexible in allowing programmatic editing of XML). As I understand things, Anti-XML may become a sort of official XML processing library in the future, with the current standard XML library being phased out. Nonetheless, many of the ways of interacting with an XML document shown above are similar, so being familiar with the standard Scala XML API provides the core concepts you’ll need for other such libraries.

Copyright 2012 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Incorporating and using OpenNLP in Scalabha’s SBT build system

Topics: natural language processing, OpenNLP, SBT, Maven, resources, sentence detection, tokenization, part-of-speech tagging

Introduction

Natural language processing involves a wide-range of methods and tasks. However, we usually start with some raw text and start by demarcating what the sentences and the tokens are. We then go on to further levels of processing, such as predicting part-of-speech tags, syntactic chunks, named entities, syntactic structures, and more.

This tutorial has two goals. First, it shows how to use the OpenNLP Tools as an API for doing sentence detection, tokenization, and part-of-speech tagging. It also shows how to add new dependencies and resources to a system like Scalabha and then using those to add new functionality. As prerequisites, see previous tutorials on getting used to working with the SBT build system of Scalabha and adding new code to the existing build system. To see the other tutorials in this series, check out the list on the links page of my Applied Text Analysis course. Of particular relevance is the one on SBT, Scalabha, packages, and build systems.

To do this tutorial, you should be working with Scalabha version 0.2.3. By the end, you should have recreated version 0.2.4, allowing you to check your progress if you run into any problems.

Adding OpenNLP Tools as a dependency

To use OpenNLP’s API, we need to have access to its jar (Java ARchive) files such that our code can compile using classes from the API and then later be executed. It is important at this point to distinguish between explicitly putting a jar file in your build system versus making it available as a managed dependency. To see some explicitly added (unmanaged) dependencies in Scalabha, look at $SCALABHA_DIR/lib.

[sourcecode lang=”bash”]
$ ls $SCALABHA_DIR/lib
Jama-1.0.2.jar          pca_transform-0.7.2.jar
crunch-0.2.0.jar        scrunch-0.1.0.jar
[/sourcecode]

These have been added to the Scalabha repository and are available even before you do any compilation. You can even see them listed in the Scalabha repository on Github.

In contrast, there are many managed dependencies. When you first download Scalabha, you won’t see them, but once you compile, you can look in the lib_managed directory and will find it is populated

[sourcecode lang=”bash”]
$ ls $SCALABHA_DIR/lib_managed
bundles jars    poms
[/sourcecode]

You can go looking into the jars sub-directory to see some of the jars that have been brought in.

To see where these came from, look in the file $SCALABHA_DIR/build.sbt, which declares much of the information that the SBT program needs in order to build the Scalabha system. The dependencies are given in the following declaration.

[sourcecode lang=”scala”]
libraryDependencies ++= Seq(
"org.apache.opennlp" % "opennlp-maxent" % "3.0.2-incubating",
"org.clapper" %% "argot" % "0.3.8",
"org.apache.commons" % "commons-lang3" % "3.0.1",
"commons-logging" % "commons-logging" % "1.1.1",
"log4j" % "log4j" % "1.2.16",
"org.scalatest" % "scalatest_2.9.0" % "1.6.1" % "test",
"junit" % "junit" % "4.10" % "test",
"com.novocode" % "junit-interface" % "0.6" % "test->default") //switch to ScalaTest at some point…
[/sourcecode]

Notice that the OpenNLP Maxent toolkit is in there (along with others), but not the OpenNLP Tools. The Maxent toolkit is used by the OpenNLP Tools (and is part of the same software group/effort), but it can be used independently of it. For example, it is used for the classification homework for the Applied Text Analysis class I’m teaching this semester, which is in fact why the dependency is already in Scalabha v0.2.3.

So, how does one know to write the following to get the OpenNLP Maxent Toolkit as a dependency?

[sourcecode lang=”scala”]
"org.apache.opennlp" % "opennlp-maxent" % "3.0.2-incubating",
[/sourcecode]

I’m not going to go into lots of detail on this, but basically this is what is known as a Maven dependency. On the OpenNLP home page, there is a page for the OpenNLP Maven dependency.  Look on that page to where it defines the OpenNLP Maxent Dependency, repeated here.

[sourcecode lang=”xml”]
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-maxent</artifactId>
<version>3.0.2-incubating</version>
</dependency>
[/sourcecode]

The group ID indicates the organization that is responsible for the artifact (e.g. a given organization can have many different systems that it develops and deploys in this manner). The artifact ID is the name of that particular artifact to distinguish it from others by the same organization, and the version is obviously the particular version number of that artifact. (This makes it possible to use older versions as and when needed.)

The XML above is what one needs if one is using the Maven build system, which many Java projects use. SBT is compatible with such dependencies, but the terser format given above is used instead of XML.

We now want to add the OpenNLP Tools as a dependency. From the OpenNLP dependencies page we see that it is declared this way in XML.

[sourcecode lang=”xml”]
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.5.2-incubating</version>
</dependency>
[/sourcecode]

That means we just need to add the following line to build.sbt in the libraryDependencies declaration.

[sourcecode lang=”scala”]
"org.apache.opennlp" % "opennlp-tools" % "1.5.2-incubating",
[/sourcecode]

And, we can remove the Maxent declaration because OpenNLP Tools depends on it (though it isn’t necessarily a problem if it stays in Scalabha’s build.sbt). The library dependencies should now look as follows.

[sourcecode lang=”scala”]
libraryDependencies ++= Seq(
"org.apache.opennlp" % "opennlp-tools" % "1.5.2-incubating",
"org.clapper" %% "argot" % "0.3.8",
"org.apache.commons" % "commons-lang3" % "3.0.1",
"commons-logging" % "commons-logging" % "1.1.1",
"log4j" % "log4j" % "1.2.16",
"org.scalatest" % "scalatest_2.9.0" % "1.6.1" % "test",
"junit" % "junit" % "4.10" % "test",
"com.novocode" % "junit-interface" % "0.6" % "test->default") //switch to ScalaTest at some point…
[/sourcecode]

The next time you run scalabha build, SBT will read the new dependency declaration and retrieve the dependency. At this point, you might say “What?” How is that sufficient to get the required jars? Here’s how, briefly and at a high level. The OpenNLP artifacts are available on the Maven2 site, and SBT already knows to look there. Put simply, it knows to check this site:

http://repo1.maven.org/maven2

And given that the organization is org.apache.opennlp it knows to then look in this directory:

http://repo1.maven.org/maven2/org/apache/opennlp/

Given that we want the opennlp-tools artifact, it looks here:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/

And finally, given that we want the version 1.5.2-incubating, it looks here:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/1.5.2-incubating/

In that directory are all the files that SBT needs to pull down to your local machine, plus information about any dependencies of OpenNLP Tools that it needs to grab. Here is the main jar:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/1.5.2-incubating/opennlp-tools-1.5.2-incubating.jar

And here is the POM (“Project Object Model”), for OpenNLP Tools:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/1.5.2-incubating/opennlp-tools-1.5.2-incubating.pom

Notice that it includes a reference to OpenNLP Maxent in it, which is why Scalabha’s build.sbt no longer needs to include it explicitly. In fact, it is better to not have it in Scalabha’s build.sbt so that we ensure that the version used by OpenNLP Tools is the one we are using (which matters when we update to, say, a later version of OpenNLP Tools).

In many cases, such artifacts are not hosted at repo1.maven.org. In such cases, you must add a “resolver” that points to another site that contains artifacts. This is done by adding to the resolvers declaration, which is shown here for Scalabha v0.2.3.

[sourcecode lang=”scala”]
resolvers ++= Seq(
"Cloudera Hadoop Releases" at "https://repository.cloudera.com/content/repositories/releases/",
"Thrift location" at "http://people.apache.org/~rawson/repo/"
)
[/sourcecode]

So, when dependencies are declared, SBT will also search through those locations, in addition to its defaults, to find them and pull them down to your machine. As it turns out, OpenNLP has a dependency on the Java WordNet Library, which is hosted on a non-standard Maven repository (which is associated with OpenNLP’s old development site on Sourceforge). You should update build.sbt to be the following:

[sourcecode lang=”scala”]
resolvers ++= Seq(
"Cloudera Hadoop Releases" at "https://repository.cloudera.com/content/repositories/releases/",
"Thrift location" at "http://people.apache.org/~rawson/repo/",
"opennlp sourceforge repo" at "http://opennlp.sourceforge.net/maven2"
)
[/sourcecode]

That was a lot of description, but note that it was a simple change to build.sbt and now we can use the OpenNLP Tools API.

Tip: if you already had SBT running (e.g. via scalabha build) then you must use the reload command at the SBT command after you change build.sbt in order for SBT to know about the changes.

What do you do if the library you want to use isn’t available as a Maven artifact? In that case, you need to put the jar (or jars) for that library, plus any jars it depends on, into the $SCALABHA_DIR/lib directory. Then SBT will see that they are there and add them to your classpath, enabling you to use them just as if they were a managed dependency. The downside is that you must put it there explicitly, which means a bit more hassle when you want to update to later versions, and a fair amount more hassle if that library has lots of dependencies that you also need to manage.

Obtaining and installing the OpenNLP sentence detector model

Now on to the processing of language. Sentence detection simply refers to the basic process of taking a text and identifying the character positions that indicate sentence breaks. As a running example, we’ll use the first several sentences from the Penn Treebank.

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.

Note that the “.” character is not a reliable indicator of what is the end of a sentence. While one can build a regular expression based sentence detector, machine learned models are typically used to figure this out, based on some reasonable numbers of example sentences identified as such by a human.

Roughly and somewhat crudely speaking, a machine learned model is a set of features that are associated with real-valued weights which have been determined from some training material. Once these weights have been learned, the model can be saved and reused (e.g. see the classification homework for Applied Text Analysis).

OpenNLP has pretrained models available for several NLP tasks, including sentence detection. Note also that there is an effort I’m heading to make it possible to distribute and, where possible, rebuild models — see the OpenNLP Models Github repository.

We want to do English sentence detection, so the model we need right now is the en | Sentence Detector. Rather than putting it in some random place on your computer, we’ll add it as part of the Scalabha build system and exploit this to simplify the loading of models (more on this later). Recall that the $SCALABHA_DIR/src/main/scala directory is where the actual code of Scalabha is kept (and is also where you can add additional code to do your own tasks, as covered in the previous tutorials). If you look at the $SCALABHA_DIR/src/main directory, you’ll see an additional resources directory. Go there and list the directory contents:

[sourcecode lang=”bash”]
$ cd $SCALABHA_DIR/src/main/resources
$ ls
log4j.properties
[/sourcecode]

All that is there now is a properties file that defines default logging behavior (which is a good way to output debugging information, e.g. as it is done in the opennlp.scalabha.cluster package used in the clustering homework of Applied Text Analysis). What is very nice about the resources directory is that any files in it are accessible in the classpath of the application we are building. That won’t make total sense right away, but it will be clear as we go along — the end result is that it simplifies a number of things a great deal, so bear with me.

What we are going to do now is place the sentence detector model in a subdirectory of resources that will give us access to it, and also organize things for future additions (wrt languages and systems). So, do the following:

[sourcecode lang=”bash”]
$ cd $SCALABHA_DIR/src/main/resources
$ mkdir -p lang/eng/opennlp
$ cd lang/eng/opennlp/
$ wget http://opennlp.sourceforge.net/models-1.5/en-sent.bin
–2012-04-10 12:24:42–  http://opennlp.sourceforge.net/models-1.5/en-sent.bin
Resolving opennlp.sourceforge.net… 216.34.181.96
Connecting to opennlp.sourceforge.net|216.34.181.96|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 98533 (96K) [application/octet-stream]
Saving to: `en-sent.bin’

100%[======================================>] 98,533       411K/s   in 0.2s

2012-04-10 12:24:43 (411 KB/s) – `en-sent.bin’ saved [98533/98533]
[/sourcecode]

Note: the last command uses the program wget, which may not be available on your machine. If that is the case, you can download en-sent.bin in your browser (using the link given after wget above) and move it to the directory $SCALABHA_DIR/src/main/resources/lang/eng/opennlp. (Better yet, install wget since it is so useful…)

Status check: you should now see en-sent.bin when you do the following:

[sourcecode lang=”bash”]
$ ls $SCALABHA_DIR/src/main/resources/lang/eng/opennlp
en-sent.bin
[/sourcecode]

Using the sentence detector

Let’s now use the model! That requires create an example application that will read in the model, construct a sentence detector object from it, and then apply it to some example text. Do the following:

[sourcecode lang=”bash”]
$ touch $SCALABHA_DIR/src/main/scala/opennlp/scalabha/tag/OpenNlpTagger.scala
[/sourcecode]

This creates an empty file at that location that you should now open in a text editor. Add the following Scala code (to be explained) to that file:.

[sourcecode lang=”scala”]
package opennlp.scalabha.tag

object OpenNlpTagger {
import opennlp.tools.sentdetect.SentenceDetectorME
import opennlp.tools.sentdetect.SentenceModel

lazy val sentenceDetector =
new SentenceDetectorME(
new SentenceModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-sent.bin")))

def main (args: Array[String]) {
val test = io.Source.fromFile(args(0)).mkString
sentenceDetector.sentDetect(test).foreach(println)
}

}
[/sourcecode]

Here are the relevant bits of explanation needed to understand what is going on. We need to import the SentenceDetectorME and SentenceModel classes (you should verify that you can find them in the OpenNLP API). The former is a class for sentence detectors that are based on trained maximum entropy models, and the latter is for holding such models. We then must create our sentence detector. This is where we get the advantage of having put it into the resources directory of Scalabha. We obtain it by getting the Class of the object (via this.getClass) and then using the getResourceAsStream method of the Class class. That’s a bit meta, but it boils down to enabling you to just follow this recipe for getting the resource. The return value of getResourceAsStream is an InputStream, which is what is needed to construct a SentenceModel.

Once we have a SentenceModel, that can be used to create a SentenceDetectorME. Note that the sentenceDetector object is declared as a lazy val. By doing this, the model is only loaded when we need it. For a small program like this one, this doesn’t matter much, but in a larger system with many components, using lazy vals allows the application to get fired up much more quickly and then load thing like models on demand. (You’ll actually see a nice, concrete example of this by the end of the tutorial.) In general, using lazy vals is a good idea.

We then just need to get some text and use the sentence detector. The application gets a file name from the command line and then reads in its contents. The sentence detector has a method sentDectect (see the API) that takes a String and returns an Array[String], where each element of the Array is a sentence. So, we run sentDetect on the input text and then print out each line.

Once you have added the above code to OpenNlpTagger.scala, you should compile in SBT (I recommend using ~compile so that it compiles every time you make a change). Then, do the following:

[sourcecode lang=”bash”]
$ cd /tmp
$ echo "Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate." > vinken.txt
$ scalabha run opennlp.scalabha.tag.OpenNlpTagger vinken.txt
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.
[/sourcecode]

So, the model does perfectly on these sentences (but don’t expect it to do quite so well on other domains, such as Twitter). We are now ready to do the next step of splitting up the characters in each sentence into tokens.

Tokenizing

Once we have identified the sentences, we need to tokenize them to turn them into a sequence of tokens where each token is a symbol or word (conforming to some predefined notion of what is a “word”). For example, the tokens for the first sentence of the running example are the following, where a token is indicated via space:

Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .

Most NLP tools then build on these units.

To enable tokenization, we must first make the English tokenizer available as a resource in Scalabha.

[sourcecode lang=”bash”]
$ cd $SCALABHA_DIR/src/main/resources/lang/eng/opennlp
$ wget http://opennlp.sourceforge.net/models-1.5/en-token.bin
–2012-04-10 14:21:14–  http://opennlp.sourceforge.net/models-1.5/en-token.bin
Resolving opennlp.sourceforge.net… 216.34.181.96
Connecting to opennlp.sourceforge.net|216.34.181.96|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 439890 (430K) [application/octet-stream]
Saving to: `en-token.bin’

100%[========================================================================>] 439,890      592K/s   in 0.7s

2012-04-10 14:21:16 (592 KB/s) – `en-token.bin’ saved [439890/439890]
[/sourcecode]

Then, change OpenNlpTagger.scala to have the following contents.

[sourcecode lang=”scala”]
package opennlp.scalabha.tag

object OpenNlpTagger {
import opennlp.tools.sentdetect.SentenceDetectorME
import opennlp.tools.sentdetect.SentenceModel
import opennlp.tools.tokenize.TokenizerME
import opennlp.tools.tokenize.TokenizerModel

lazy val sentenceDetector =
new SentenceDetectorME(
new SentenceModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-sent.bin")))

lazy val tokenizer =
new TokenizerME(
new TokenizerModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-token.bin")))

def main (args: Array[String]) {
val test = io.Source.fromFile(args(0)).mkString
val sentences = sentenceDetector.sentDetect(test)
val tokenizedSentences = sentences.map(tokenizer.tokenize(_))
tokenizedSentences.foreach(tokens => println(tokens.mkString(" ")))
}

}
[/sourcecode]

The process is very similar to what was done for the sentence detector. The only difference is that we now use the tokenizer’s tokenize method on each sentence. This method returns an Array[String], where each element is a token. We thus map the Array[String] of sentences to the Array[Array[String]] of tokenizedSentences. Simple!

Make sure to test that everything is working.

[sourcecode lang=”bash”]
$ cd /tmp
$ scalabha run opennlp.scalabha.tag.OpenNlpTagger vinken.txt
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate .
[/sourcecode]

Now that we have these tokens, the input is ready for part-of-speech tagging.

Part-of-speech tagging

Part-of-speech (POS) tagging involves identifing whether each token is a noun, verb, determiner and so on. Some part-of-speech tag sets have more detail, such as NN for a singular noun and NNS for a plural one. See the previous tutorial on iteration for more details and pointers.

The OpenNLP POS tagger is trained on the Penn Treebank, so it uses that tagset. As with the other models, we must download it and place it in the resources directory.

[sourcecode lang=”bash”]
$ cd $SCALABHA_DIR/src/main/resources/lang/eng/opennlp
$ wget http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin
–2012-04-10 14:31:33–  http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin
Resolving opennlp.sourceforge.net… 216.34.181.96
Connecting to opennlp.sourceforge.net|216.34.181.96|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 5696197 (5.4M) [application/octet-stream]
Saving to: `en-pos-maxent.bin’

100%[========================================================================>] 5,696,197    671K/s   in 8.2s

2012-04-10 14:31:42 (681 KB/s) – `en-pos-maxent.bin’ saved [5696197/5696197]
[/sourcecode]

Then, update OpenNlpTagger.scala to have the following contents, which involve some additional output over what you saw the previous times.

[sourcecode lang=”scala”]
package opennlp.scalabha.tag

object OpenNlpTagger {

import opennlp.tools.sentdetect.SentenceDetectorME
import opennlp.tools.sentdetect.SentenceModel
import opennlp.tools.tokenize.TokenizerME
import opennlp.tools.tokenize.TokenizerModel
import opennlp.tools.postag.POSTaggerME
import opennlp.tools.postag.POSModel

lazy val sentenceDetector =
new SentenceDetectorME(
new SentenceModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-sent.bin")))

lazy val tokenizer =
new TokenizerME(
new TokenizerModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-token.bin")))

lazy val tagger =
new POSTaggerME(
new POSModel(
this.getClass.getResourceAsStream("/lang/eng/opennlp/en-pos-maxent.bin")))

def main (args: Array[String]) {

val test = io.Source.fromFile(args(0)).mkString

println("n*********************")
println("Showing sentences.")
println("*********************")
val sentences = sentenceDetector.sentDetect(test)
sentences.foreach(println)

println("n*********************")
println("Showing tokens.")
println("*********************")
val tokenizedSentences = sentences.map(tokenizer.tokenize(_))
tokenizedSentences.foreach(tokens => println(tokens.mkString(" ")))

println("n*********************")
println("Showing POS.")
println("*********************")
val postaggedSentences = tokenizedSentences.map(tagger.tag(_))
postaggedSentences.foreach(postags => println(postags.mkString(" ")))

println("n*********************")
println("Zipping tokens and tags.")
println("*********************")
val tokposSentences =
tokenizedSentences.zip(postaggedSentences).map { case(tokens, postags) =>
tokens.zip(postags).map { case(tok,pos) => tok + "/" + pos }
}
tokposSentences.foreach(tokposSentence => println(tokposSentence.mkString(" ")))

}

}
[/sourcecode]

Everything is as before, so it should be pretty much self-explanatory. Just note that the tagger’s tag method takes a token sequence (Array[String], written as String[] in OpenNLP’s Javadoc) as its input and it returns an Array[String] of the tags for each token. Thus, when we output the postaggedSentences in the “Showing POS” part, it prints only the tags. We can then bring the tokens and their corresponding tags together by zipping the tokenizedSentences with the postaggedSentences and then zipping the word and POS tokens in each sentence together, as shown in the “Zipping tokens and tags” portion.

When this is run, you should get the following output.

[sourcecode lang=”bash”]
$ cd /tmp
$ scalabha run opennlp.scalabha.tag.OpenNlpTagger vinken.txt

*********************
Showing sentences.
*********************
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.

*********************
Showing tokens.
*********************
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate .

*********************
Showing POS.
*********************
NNP NNP , CD NNS JJ , MD VB DT NN IN DT JJ NN NNP CD .
NNP NNP VBZ NN IN NNP NNP , DT JJ NN NN .
NNP NNP , CD NNS JJ CC JJ NN IN NNP NNP NNP NNP , VBD VBN DT NN IN DT JJ JJ NN .

*********************
Zipping tokens and tags.
*********************
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
Mr./NNP Vinken/NNP is/VBZ chairman/NN of/IN Elsevier/NNP N.V./NNP ,/, the/DT Dutch/JJ publishing/NN group/NN ./.
Rudolph/NNP Agnew/NNP ,/, 55/CD years/NNS old/JJ and/CC former/JJ chairman/NN of/IN Consolidated/NNP Gold/NNP Fields/NNP PLC/NNP ,/, was/VBD named/VBN a/DT director/NN of/IN this/DT British/JJ industrial/JJ conglomerate/NN ./.
[/sourcecode]

Note: You’ll probably notice a pause just after it says “Showing POS” — that is because the tagger is defined as a lazy val, so the model is loaded at that time since it is the first point where it is needed. Try removing “lazy” from the declarations of sentenceDetector, tokenizer, and tagger, recompiling and then running it again — you’ll now see that the pause before anything is done is greater, but that once it starts processing everything goes very quickly. That’s a fairly good way of seeing part of why lazy values are quite handy.

And that’s it. To see the output on a longer example, you can run it on any text you like, e.g. the ones in the Scalabha’s data directory, like the Federalist Papers:

[sourcecode lang=”bash”]

$ scalabha run opennlp.scalabha.tag.OpenNlpTagger $SCALABHA_DIR/data/cluster/federalist/federalist.txt

[/sourcecode]

Now as an exercise, turn the standalone application, defined as the object OpenNlpTagger, into a class, OpenNlpTagger, that takes a raw text as input (not via the command line, but as an argument to a method and returns a List[List[(String,String)]] that contains the sentences and for each sentence a sequence of (token,tag) pairs. For example, after running it on the Vinken text, you should produce the following.

[sourcecode lang=”scala”]
List(List((Pierre,NNP), (Vinken,NNP), (,,,), (61,CD), (years,NNS), (old,JJ), (,,,), (will,MD), (join,VB), (the,DT), (board,NN), (as,IN), (a,DT), (nonexecutive,JJ), (director,NN), (Nov.,NNP), (29,CD), (.,.)), List((Mr.,NNP), (Vinken,NNP), (is,VBZ), (chairman,NN), (of,IN), (Elsevier,NNP), (N.V.,NNP), (,,,), (the,DT), (Dutch,JJ), (publishing,NN), (group,NN), (.,.)), List((Rudolph,NNP), (Agnew,NNP), (,,,), (55,CD), (years,NNS), (old,JJ), (and,CC), (former,JJ), (chairman,NN), (of,IN), (Consolidated,NNP), (Gold,NNP), (Fields,NNP), (PLC,NNP), (,,,), (was,VBD), (named,VBN), (a,DT), (director,NN), (of,IN), (this,DT), (British,JJ), (industrial,JJ), (conglomerate,NN), (.,.)))
[/sourcecode]

Spans

You may notice that the sentence detector and tokenizer APIs both include methods that return Array[Span] (note: Span[] in OpenNLP’s Javadoc). These are preferable in many contexts since they don’t lose information from the original text, unlike the ones we used above which turned the original text into sequences of portions of the original. Spans just record the character offsets at which the sentences start and end, or at which tokens start and end. This is quite handy for further processing and is what is generally used in non-trivial applications. But, for many cases, the methods that return Array[String] will be just fine and require learning a bit less.

Conclusion

This tutorial has taken you from a version of Scalabha that does not have the OpenNLP Tools API available to a version which does have it and also has several pretrained models available and an example application to use the API for part-of-speech tagging. You can of course follow similar recipes for bringing in other libraries and using them in your code, so this setup gives you a lot of power and is easy to use once you’ve done it a few times. If you have any trouble, or want to check it against a definitely working version, get Scalabha v0.2.4, which differs from v0.2.3 primarily only with respect to this tutorial.

A final note: you may be wondering what the heck OpenNLP is, given that Scalabha’s classpath starts with opennlp.scalabha, but we were adding the OpenNLP Tools as a dependency. Basically, Gann Bierner and I started OpenNLP in 1999, and part of the goal of that was to provide a high-level organizational domain name so that we could ensure uniqueness in classpaths. So, we have opennlp.tools, opennlp.maxent, opennlp.scalabha, and there are others. These are thus clearly different, in terms of their unique classpaths, from foo.tools, foo.maxent, and so on. So, when I started Scalabha, I used opennlp.scalabha (though in all likelihood, no one else would pick scalabha as a top-level for a class path). Nonetheless, when one speaks of OpenNLP generally, it usually refers to the OpenNLP Tools, the first of the projects to be in the OpenNLP “family”.

Copyright 2012 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Student Questions about Scala, Part 2

Topics: toMap, accessing directory contents, calling R from Scala, Java/Scala comparisons and interop, supporting libraries, object-oriented + functional programming, NLP and Scala

Preface

This is the second post answering questions from students in my course on Applied Text Analysis. You can see the first one here. This post generally covers higher level questions, starting off with one basic question that didn’t make it into the first post.

Basic Question

Q. When I was working with Maps for the homework and tried to turn a List[List[Int]] into a map, I often got the error message that Scala “cannot prove that Int<:<(T,U)”. What does that mean?

A. So, you were trying to do the following.

[sourcecode lang=”scala”]
scala> val foo = List(List(1,2),List(3,4))
foo: List[List[Int]] = List(List(1, 2), List(3, 4))

scala> foo.toMap
<console>:9: error: Cannot prove that List[Int] <:< (T, U).
foo.toMap
^
[/sourcecode]

This happens because you are trying to do the following at the level of a single two-element list, which can be more easily seen in the following.

[sourcecode lang=”scala”]
scala> List(1,2).toMap
<console>:8: error: Cannot prove that Int <:< (T, U).
List(1,2).toMap
^
[/sourcecode]

So, you need to convert each two-element list to a tuple, and then you can call toMap on the list of tuples.

[sourcecode lang=”scala”]
scala> foo.map{case List(a,b)=>(a,b)}.toMap
<console>:9: warning: match is not exhaustive!
missing combination            Nil

foo.map{case List(a,b)=>(a,b)}.toMap
^
res3: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 3 -> 4)
[/sourcecode]

You can avoid the warning messages by flatMapping (which is safer anyway).

[sourcecode lang=”scala”]
scala> foo.flatMap{case List(a,b)=>Some(a,b); case _ => None}.toMap
res4: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 3 -> 4)
[/sourcecode]

If you need to do this sort of thing a lot, you could use implicits to make the conversion from two-element Lists into Tuples, as discussed in the previous post about student questions.

File system access

Q. How can I make a script or program pull in every file (or every file in a certain format) from a directory that is given as a command line argument and perform operations on it?

A. Easy. Let’s say you have a directory example_dir with the following files.

[sourcecode lang=”bash”]
$ ls example_dir/
file1.txt      file2.txt      file3.txt      program1.scala program2.scala program3.py    program4.py
[/sourcecode]

I created these with some simple contents. Here’s a bash command that will print out each file and its contents so you can recreate them (and also see a handy command line for loop).

[sourcecode lang=”bash”]
$ for i in `ls example_dir`; do echo "File: $i"; cat example_dir/$i; echo; done
File: file1.txt
Hello.

File: file2.txt
Nice to meet you.

File: file3.txt
Goodbye.

File: program1.scala
println("Hello.")

File: program2.scala
println("Goodbye.")

File: program3.py
print("Hello.")

File: program4.py
print("Goodbye.")
[/sourcecode]

So, here’s how we can do the same using Scala. In the same directory that contains example_dir, save the following as ListDir.scala.

[sourcecode lang=”scala”]
val mydir = new java.io.File(args(0))
val allfiles = mydir.listFiles
val contents = allfiles.map { file => io.Source.fromFile(file).mkString }

allfiles.zip(contents).foreach { case(file,content) =>
println("File: " + file.getName)
println(content)
}
[/sourcecode]

You can now run it as scala ListDir.scala example_dir.

If you want to look at only files of a particular type, use filter on the list of files returned by mydir.listfiles. For example, the following gets the Scala files and prints their names.

[sourcecode lang=”scala”]
val scalaFiles = mydir.listFiles.filter(_.getName.endsWith(".scala"))
println(scalaFiles.mkString("n"))
[/sourcecode]

As an exercise, now consider what you would need to do to recursively explore a directory that has directories and list the contents of all the files that are in it. Tip: you’ll need to use the isDirectory() method of java.io.File.

Q. Is it possible to run an R program within a Scala program? Like write a Scala program that performs R operations using R. If so, how? Are there directory requirement of some sort?

A.Though I haven’t used them, you could look at the JRI (Java-R Interface) or RCaller.

For some simple things, you can always take the strategy of saving some data to a file, calling an R program that processes that file and produces some output in one or more files, which you then read back into Scala. This is useful for other things you might want to do, including invoking arbitrary applications to compute and output some values based on data created by your program.

Here’s an example of doing something like this. Save the following as something like CallR.scala, and then run scala CallR.scala. It assumes you have R installed.

[sourcecode lang=”scala”]
import java.io._

val data = List((4,1000), (3,1500), (2,1500), (2,6000), (1,14000), (0,18000))

val outputFilename = "vague.dat"
val bwriter = new BufferedWriter(new FileWriter(outputFilename))

val dataLine = data.map {
case(numAdjectives, price) => "c("+numAdjectives+","+price+")"
}.mkString(",")

bwriter.write(
"""data = rbind(""" + dataLine + ")" + "n" +
"""pdf("out.pdf")""" + "n" +
"""plot(data)"""+ "n" +
"""data.lm = lm(data[,2] ~ data[,1])""" +  "n" +
"""abline(data.lm)""" +  "n" +
"""dev.off()""" + "n")
bwriter.flush
bwriter.close

val command = List("R", "-f", outputFilename)
scala.sys.process.stringSeqToProcess(command).lines.foreach(println)
[/sourcecode]

It takes a set of points as a Scala List[(Int,Int)] and creates a set of R commands to plot the points, fit a linear regression model to the points, plot the regression line, and then output a PDF. I took the particular set of points used here from the example in Jurafsky and Martin in the chapter on maximum entropy (multinomial logistic regression), which is based on a study of how vague adjectives in a house listing affect its purchase price. For example, houses that had four vague adjectives in their listing sold for $1000 over their list price, while ones with one vague adjective sold for $14,000 over list price (read the book Freakonomics for some fascinating discussion of this).

Here’s the R code that is produced.

[sourcecode lang=”r”]
data = rbind(c(4,1000),c(3,1500),c(2,1500),c(2,6000),c(1,14000),c(0,18000))
pdf("vague_lm.pdf")
plot(data)
data.lm = lm(data[,2] ~ data[,1])
abline(data.lm)
dev.off()
[/sourcecode]

Here is the image produced in vague_lm.pdf.

To recap, the basic logic of this process is the following.

  1. Have or create some set of points in Scala (which, to be useful, would be based on some computation you ran and now need to go to R for to complete).
  2. Use this data to create an R script programatically using Scala code.
  3. Run the R script using scala.sys.process.

You could also have the R script output text information to a file which you could then read back into Scala and parse to get your results.

Note that this is not necessarily the most robust way to do this in general, but it does demonstrate a way to do things like calling system commands from within a Scala program.

Another alternative is to look at frameworks like ScalaLab, which aims to support a Matlab-like environment for Scala. It’s on my stack of things to look at, and it would allow one to use Scala to directly do much of what one would want to call out to R and other such languages for.

High level questions

Q. Since Scala runs over JVM, can we conclude that anything that was written in Scala, can be written in Java? (with loss of performance and may be with lengthly code).

A. For any two sufficiently expressive languages X and Y, one can write anything in X using Y and vice versa. So, yes. However, in terms of the ease of doing this, it is very easy to translate Java code to Scala, since the latter supports mutable, imperative programming of the kind usually done in Java. If you have Scala code that is functional in nature, it will be much harder to translate easily to Java (though it can of course be done).

Efficiency is a different question. Sometimes the functional style can be less efficient (especially if you are limiting yourself to a single machine), so at times it can be advantageous to use while loops and the like. However, for most cases, efficiency of programmer time matters more than efficiency of running time, so quickly putting together a solution using functional means and then optimizing it later — even at the “cost” of being less functional — is, in my mind, the right way to go. Josh Suereth has a nice blog post about this, Macro vs Micro Optimization, highlighting his experiences at Google.

Compared to Scala, the amount of code written will almost always be longer in Java, due both to the large amount of boilerplate code and to the higher-level nature of functional programming. I find that Scala programs (written in idiomatic, functional style) converted from Java are generally 1/4th to 1/3rd the number of characters of their Java counterparts. Going from Python to Scala also tends to produce less lengthy code, perhaps 3/4ths to 5/6ths or so in my experience. (Though this depends a great deal on what kind of Scala style you are using, functional or imperative or a mix).

Q. Scala seems to be relatively new — so, does it have supporting libraries for common tasks in NLP, like good JSON/XML parsers that you know of?

A. Sure. Basically anything that has been written for the JVM is quite straightforward to use with Scala. For natural language processing, we’ll be using the Apache OpenNLP library (which I and Gann Bierner began in 1999 while at the University of Edinburgh), but you can also use other toolkits like the Stanford NLP software, Mallet, Weka, and others. In fact, using Scala often makes it much easier to use these toolkits. There are also Scala specific toolkits that are beginning to appear, including Factorie, ScalaNLP, and Scalabha (which we are using in the class).

Scala has native XML support that I find pretty handy, though others wish it weren’t in the language. It is covered in most of the books on Scala, and Dan Spiewak has a nice blog post on it: Working with Scala’s XML Support.

The native JSON support isn’t great, but Java libraries for JSON work just fine.

Q. General question/comment: Scala lies in the region between object-oriented and functional programming language. My question is — Why? Is it because it makes coding a lot simpler and reduces the number of lines? In that case, I guess python achieves this goal reasonably well, and it has a rich library for processing strings. I am able to appreciate certain things, and ease of getting things done in Scala, but I am not exactly sure why this was even introduced, that too in a somewhat non-standard way (such a mixture of OOP and functional programming paradigm is the first that I have heard of).

I’ll defer to Odersky, the creator of Scala. This is from his blog post “Why Scala?“:

Scala took a risk in that, before it came out, the object-oriented and functional approaches to programming were largely disjoint; even today the two communities are still sometimes antagonistic to each other. But what the team and I have learned in our daily programming practice since then has fully confirmed our initial hopes. Objects and functions play extremely well together; they enable new, expressive programming styles which lend themselves to high-level domain modeling and and embedded domain-specific languages. Whether it’s log-analysis at NASA, contract modelling at EDF, or risk analysis at many of the largest financial institutions, Scala-based DSLs seem to spring up everywhere these days.

Here are some other interesting reads that touch on these questions:

Q. Do you see any distinct advantage of using Scala for NLP-related stuff? I know this is not a very specific question, but it would be great if you continue highlighting the difference between scala and other languages (like Java, Python) so that our understanding becomes clearer and clearer with more examples.

A. In many ways, such questions are a matter of personal taste. I used Python and Java before I switched to primarily using Scala. I liked Python for rapid prototyping, and Java for large-scala system development. I find Scala to be as good, or better, for prototyping than Python, and it is every bit as good, or better, than Java for large scale development. Now, I can use a single language — Scala — for most development. The exception is that I still use R for plotting data sets and also doing certain statistical analyses. The transition from Java to Scala was straightforward, and I went from writing Java-as-Scala to a more and more functional style as I got more comfortable with the language. The resulting code is far better designed, making it more robust, more extensible, and more fun.

Specifically with respect to NLP, a definite advantage of Scala is that, as mentioned previously, it is really easy to use existing Java libraries (or any JVM library, for that matter). Another is that as one uses a more functional style, that makes it easier to transition (in terms of both thinking and actual coding) to certain kinds of distributed computing architectures, such as MapReduce. As a really interesting example of Scala and distributed computing, check out Spark. With so much of text analytics being performed on massive datasets, this capability has become increasingly important. Another thing is that the actor-based computing model supported by the Akka library (which is closely tied to the core Scala libraries) also holds many attractions for building language processing systems that need to deal with asynchronous information flows and data processing (FWIW, Akka can be used from Java, though far less enjoyable than from Scala). It is also quite handy for creating distributed versions of many classes of machine learning algorithms that can take better advantage of the structure of the solution than the one-size-fits-all MapReduce strategy can. For examples, you can check out the Akka version of Modified Adsorption and the Hadoop version of the same algorithm in the Junto toolkit.

At the end of the day, though, whether one language is “better” than another will depend on a given programmer’s preferences and abilities. For example, a great alternative to Scala is Clojure, which is dynamically typed, also JVM-based, and also functional — even more so than Scala. So, when evaluating this or that language, ask whether you can get more done more quickly and more maintainably. The outcome will be a function of the capabilities of the language and your skill as a programmer.

Q. In C++ a class is just a blueprint of an object and it has a size of 1 no matter how many members it has. Does the size of a Scala class depend on its members? Also, is there anything corresponding to “sizeof” operator in Scala?

A. I don’t know the answer to this. Any useful responses from readers would be welcome, and I’ll add them to this answer if and when they come in.

Copyright 2012 Jason Baldridge

The text of this post is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original post.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Student Questions about Scala, Part 1

Topics: conventions, regexes, mapping, partitioning, vectors vs lists, overloaded constructors, case classes, traits, multiple inheritance, implicits

Preface

I’m currently teaching a course on Applied Text Analysis and am using Scala as the programming language taught and used in the course. Rather than creating more tutorials, I figured I’d take a page from Brian Dunning’s playbook on his Skeptoid podcast (highly recommended) when he takes student questions.  So, I had the students in the course submit questions about Scala that they had, based on the readings and assignments thus far. This post covers over half of them — the rest will be covered in a follow up post.

I start with some of the more basic questions, and the questions and/or answers progressively get into more intermediate level topics. Suggestions and comments to improve any of the answers are very welcome!

Basic Questions

Q. Concerning addressing parts of variables: To address individual parts of lists, the numbering of the items is (List 0,1,2 etc.) That is, the first element is called “0”. It seems to be the same for Arrays and Maps, but not for Tuples- to get the first element of a Tuple, I need to use Tuple._1. Why is that?

A. It’s just a matter of convention — tuples have used a 1-based index in other languages like Haskell, and it seems that Scala has adopted the same convention/tradition. See:

http://stackoverflow.com/questions/6241464/why-are-the-indexes-of-scala-tuples-1-based

Q. It seems that Scala doesn’t recognize the “b” boundary character as a regular expression.  Is there something similar in Scala?

A. Scala does recognize boundary characters. For example, the following REPL session declares a regex that finds “the” with boundaries, and successfully retrieves the three tokens of “the” in the example sentence.

[sourcecode lang=”scala”]
scala> val TheRE = """btheb""".r
TheRE: scala.util.matching.Regex = btheb

scala> val sentence = "She think the man is a stick-in-the-mud, but the man disagrees."
sentence: java.lang.String = She think the man is a stick-in-the-mud, but the man disagrees.

scala> TheRE.findAllIn(sentence).toList
res1: List[String] = List(the, the, the)
[/sourcecode]

Q. Why doesn’t the method “split” work on args? Example: val arg = args.split(” “). Args are strings right, so split should work?

A. The args variable is an Array, so split doesn’t work on them. Arrays are, in effect, already split.

Q. What is the major difference between foo.mapValues(x=>x.length) and foo.map(x=>x.length). Some places one works and one does not.

A. The map function works on all sequence types, including Seqs and Maps (note that Maps can be seen as sequences of Tuple2s). The mapValues function, however, only works on Maps. It is essentially a convenience function. As an example, let’s start with a simple Map from Ints to Ints.

[sourcecode lang=”scala”]
scala> val foo = List((1,2),(3,4)).toMap
foo: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 3 -> 4)
[/sourcecode]

Now consider the task of adding 2 to each value in the Map. This can be done with the map function as follows.

[sourcecode lang=”scala”]
scala> foo.map { case(key,value) => (key,value+2) }
res5: scala.collection.immutable.Map[Int,Int] = Map(1 -> 4, 3 -> 6)
[/sourcecode]

So, the map function iterates over key/value pairs. We need to match both of them, and then output the key and the changed value to create the new Map. The mapValues function makes this quite a bit easier.

[sourcecode lang=”scala”]
scala> foo.mapValues(2+)
res6: scala.collection.immutable.Map[Int,Int] = Map(1 -> 4, 3 -> 6)
[/sourcecode]

Returning to the question about computing the length using mapValues or map — then it is just a question of which values you are transforming, as in the following examples.

[sourcecode lang=”scala”]
scala> val sentence = "here is a sentence with some words".split(" ").toList
sentence: List[java.lang.String] = List(here, is, a, sentence, with, some, words)

scala> sentence.map(_.length)
res7: List[Int] = List(4, 2, 1, 8, 4, 4, 5)

scala> val firstCharTokens = sentence.groupBy(x=>x(0))
firstCharTokens: scala.collection.immutable.Map[Char,List[java.lang.String]] = Map(s -> List(sentence, some), a -> List(a), i -> List(is), h -> List(here), w -> List(with, words))

scala> firstCharTokens.mapValues(_.length)
res9: scala.collection.immutable.Map[Char,Int] = Map(s -> 2, a -> 1, i -> 1, h -> 1, w -> 2)
[/sourcecode]

Q. Is there any function that splits a list into two lists with the elements in the alternating positions of the original list? For example,

MainList =(1,2,3,4,5,6)

List1 = (1,3,5)
List2 = (2,4,6)

A. Given the exact main list you provided, one can use the partition function and use the modulo operation to see whether the value is divisible evenly by 2 or not.

[sourcecode lang=”scala”]
scala> val mainList = List(1,2,3,4,5,6)
mainList: List[Int] = List(1, 2, 3, 4, 5, 6)

scala> mainList.partition(_ % 2 == 0)
res0: (List[Int], List[Int]) = (List(2, 4, 6),List(1, 3, 5))
[/sourcecode]

So, partition returns a pair of Lists. The first has all the elements that match the condition and the second has all the ones that do not.

Of course, this wouldn’t work in general for Lists that have Strings, or that don’t have Ints in order, etc. However, the indices of a List are always well-behaved in this way, so we just need to do a bit more work by zipping each element with its index and then partitioning based on indices.

[sourcecode lang=”scala”]
scala> val unordered = List("b","2","a","4","z","8")
unordered: List[java.lang.String] = List(b, 2, a, 4, z, 8)

scala> unordered.zipWithIndex
res1: List[(java.lang.String, Int)] = List((b,0), (2,1), (a,2), (4,3), (z,4), (8,5))

scala> val (evens, odds) = unordered.zipWithIndex.partition(_._2 % 2 == 0)
evens: List[(java.lang.String, Int)] = List((b,0), (a,2), (z,4))
odds: List[(java.lang.String, Int)] = List((2,1), (4,3), (8,5))

scala> evens.map(_._1)
res2: List[java.lang.String] = List(b, a, z)

scala> odds.map(_._1)
res3: List[java.lang.String] = List(2, 4, 8)
[/sourcecode]

Based on this, you could of course write a function that does this for any arbitrary list.

Q. How to convert a List to a Vector and vice-versa?

A. Use toIndexSeq and toList.

[sourcecode lang=”scala”]
scala> val foo = List(1,2,3,4)
foo: List[Int] = List(1, 2, 3, 4)

scala> val bar = foo.toIndexedSeq
bar: scala.collection.immutable.IndexedSeq[Int] = Vector(1, 2, 3, 4)

scala> val baz = bar.toList
baz: List[Int] = List(1, 2, 3, 4)

scala> foo == baz
res0: Boolean = true
[/sourcecode]

Q. The advantage of a vector over a list is the constant time look-up. What is the advantage of using a list over a vector?

A. A List is slightly faster for operations at the head (front) of the sequence, so if all you are doing is doing a traversal (accessing each element in order, e.g. when mapping), then Lists are perfectly adequate and may be more efficient. They also have some nice pattern matching behavior for case statements.

However, common wisdom seems to be that you should default to using Vectors. See Daniel Spiewak’s nice answer on Stackoverflow:

http://stackoverflow.com/questions/6928327/when-should-i-choose-vector-in-scala

Q. With splitting strings, holmes.split(“\s”) – n and t just requires a single ” to recognize its special functionality but why two ”s are required for white space character?

A. That’s because n and t actually mean something in a String.

[sourcecode lang=”scala”]
scala> println("Here is a line with a tabtorttwo, followed byna new line.")
Here is a line with a tab    or    two, followed by
a new line.

scala> println("This will breaks.")
<console>:1: error: invalid escape character
println("This will breaks.")
[/sourcecode]

So, you are supplying a String argument to split, and it uses that to construct a regular expression. Given that s is not a string character, but is a regex metacharacter, you need to escape it. You can of course use split(“””s”””), though that isn’t exactly better in this case.

Q. I have long been programming in C++ and Java. Therefore, I put semicolon at the end of the line unconsciously. It seems that the standard coding style of Scala doesn’t recommend to use semicolons. However, I saw that there are some cases that require semicolons as you showed last class. Is there any specific reason why semicolon loses its role in Scala?

A. The main reason is to improve readability since the semicolon is rarely needed when writing standard code in editors (as opposed to one liners in the REPL). However, when you want to do something in a single line, like handling multiple cases, you need the semicolons.

[sourcecode lang=”scala”]
scala> val foo = List("a",1,"b",2)
foo: List[Any] = List(a, 1, b, 2)

scala> foo.map { case(x: String) => x; case(x: Int) => x.toString }
res5: List[String] = List(a, 1, b, 2)
[/sourcecode]

But, in general, it’s best to just split these cases over multiple lines in any actual code.

Q. Is there no way to use _ in map like methods for collections that consist of pairs? For example, List((1,1),(2,2)).map(e => e._1 + e._2) works, but List((1,1),(2,2)).map(_._1 + _._2) does not work.

A. The scope in which the _ remains unanambigious runs out past its first invocation, so you only get to use it once. It is better anyway to use a case statement that makes it clear what the members of the pairs are.

[sourcecode lang=”scala”]
scala>  List((1,1),(2,2)).map { case(num1, num2) => num1+num2 }
res6: List[Int] = List(2, 4)
[/sourcecode]

Q. I am unsure about the exact meaning of and the difference between “=>” and “->”. They both seem to mean something like “apply X to Y” and I see that each is used in a particular context, but what is the logic behind that?

A. The use of -> simply constructs a Tuple2, as is pretty clear in the following snippet.

[sourcecode lang=”scala”]
scala> val foo = (1,2)
foo: (Int, Int) = (1,2)

scala> val bar = 1->2
bar: (Int, Int) = (1,2)

scala> foo == bar
res11: Boolean = true
[/sourcecode]

Primarily, it is syntactic sugar that provides an intuitive symbol for creating elements of a a Map. Compare the following two ways of declaring the same Map.

[sourcecode lang=”scala”]
scala> Map(("a",1),("b",2))
res9: scala.collection.immutable.Map[java.lang.String,Int] = Map(a -> 1, b -> 2)

scala> Map("a"->1,"b"->2)
res10: scala.collection.immutable.Map[java.lang.String,Int] = Map(a -> 1, b -> 2)
[/sourcecode]

The second seems more readable to me.

The use of => indicates that you are defining a function. The basic form is ARGUMENTS => RESULT.

[sourcecode lang=”scala”]
scala> val addOne = (x: Int) => x+1
addOne: Int => Int = <function1>

scala> addOne(2)
res7: Int = 3

scala> val addTwoNumbers = (num1: Int, num2: Int) => num1+num2
addTwoNumbers: (Int, Int) => Int = <function2>

scala> addTwoNumbers(3,5)
res8: Int = 8
[/sourcecode]

Normally, you use it in defining anonymous functions as arguments to functions like map, filter, and such.

Q. Is there a more convenient way of expressing vowels as [AEIOUaeiou] and consonants as [BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz] in RegExes?

A. You can use Strings when defining regexes, so you can have a variable for vowels and one for consonants.

[sourcecode lang=”scala”]
scala> val vowel = "[AEIOUaeiou]"
vowel: java.lang.String = [AEIOUaeiou]

scala> val consonant = "[BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz]"
consonant: java.lang.String = [BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz]

scala> val MyRE = ("("+vowel+")("+consonant+")("+vowel+")").r
MyRE: scala.util.matching.Regex = ([AEIOUaeiou])([BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz])([AEIOUaeiou])

scala> val MyRE(x,y,z) = "aJE"
x: String = a
y: String = J
z: String = E
[/sourcecode]

Q. The “b” in RegExes marks a boundary, right? So, it also captures the “-“. But if I have a single string “sdnfeorgn”, it does NOT capture the boundaries of that, is that correct? And if so, why doesn’t it?

A. Because there are no boundaries in that string!

Intermediate questions

Q. The flatMap function takes lists of lists and merges them to single list. But in the example:

[sourcecode lang=”scala”]
scala> (1 to 10).toList.map(x=>squareOddNumber(x))
res16: List[Option[Int]] = List(Some(1), None, Some(9), None, Some(25), None, Some(49), None, Some(81), None)

scala> (1 to 10).toList.flatMap(x=>squareOddNumber(x))
res17: List[Int] = List(1, 9, 25, 49, 81)
[/sourcecode]

Here it is not list of list but just a list. In this case it expects the list to be Option list.
I tried running the code with function returning just number or None. It showed error. So is there any way to use flatmap without Option lists and just list. For example, List(1, None, 9, None, 25) should be returned as List(1, 9, 25).

A. No, this won’t work because List(1, None, 9, None, 25) mixes Options with Ints.

[sourcecode lang=”scala”]
scala> val mixedup = List(1, None, 9, None, 25)
mixedup: List[Any] = List(1, None, 9, None, 25)
[/sourcecode]

So, you should have your function return an Option which means returning Somes or Nones. Then flatMap will work happily.

One way of think of Options is that they are like Lists with zero or one element, as can be noted by the parallels in the following snippet.

[sourcecode lang=”scala”]
scala> val foo = List(List(1),Nil,List(3),List(6),Nil)
foo: List[List[Int]] = List(List(1), List(), List(3), List(6), List())

scala> foo.flatten
res12: List[Int] = List(1, 3, 6)

scala> val bar = List(Option(1),None,Option(3),Option(6),None)
bar: List[Option[Int]] = List(Some(1), None, Some(3), Some(6), None)

scala> bar.flatten
res13: List[Int] = List(1, 3, 6)
[/sourcecode]

Q. Does scala have generic templates (like C++, Java)? eg. in C++, we can use vector<int>, vector<string> etc. Is that possible in scala? If so, how?

A. Yes, every collection type is parameterized. Notice that each of the following variables is parameterized by the type of the elements they are initialized with.

[sourcecode lang=”scala”]
scala> val foo = List(1,2,3)
foo: List[Int] = List(1, 2, 3)

scala> val bar = List("a","b","c")
bar: List[java.lang.String] = List(a, b, c)

scala> val baz = List(true, false, true)
baz: List[Boolean] = List(true, false, true)
[/sourcecode]

You can create your own parameterized classes straightforwardly.

[sourcecode lang=”scala”]
scala> class Flexible[T] (val data: T)
defined class Flexible

scala> val foo = new Flexible(1)
foo: Flexible[Int] = Flexible@7cd0570e

scala> val bar = new Flexible("a")
bar: Flexible[java.lang.String] = Flexible@31b6956f

scala> val baz = new Flexible(true)
baz: Flexible[Boolean] = Flexible@5b58539f

scala> foo.data
res0: Int = 1

scala> bar.data
res1: java.lang.String = a

scala> baz.data
res2: Boolean = true
[/sourcecode]

Q. How can we easily create, initialize and work with multi-dimensional arrays (and dictionaries)?

A. Use the fill function of the Array object to create them.

[sourcecode lang=”scala”]
scala> Array.fill(2)(1.0)
res8: Array[Double] = Array(1.0, 1.0)

scala> Array.fill(2,3)(1.0)
res9: Array[Array[Double]] = Array(Array(1.0, 1.0, 1.0), Array(1.0, 1.0, 1.0))

scala> Array.fill(2,3,2)(1.0)
res10: Array[Array[Array[Double]]] = Array(Array(Array(1.0, 1.0), Array(1.0, 1.0), Array(1.0, 1.0)), Array(Array(1.0, 1.0), Array(1.0, 1.0), Array(1.0, 1.0)))
[/sourcecode]

Once you have these in hand, you can iterate over them as usual.

[sourcecode lang=”scala”]
scala> val my2d = Array.fill(2,3)(1.0)
my2d: Array[Array[Double]] = Array(Array(1.0, 1.0, 1.0), Array(1.0, 1.0, 1.0))

scala> my2d.map(row => row.map(x=>x+1))
res11: Array[Array[Double]] = Array(Array(2.0, 2.0, 2.0), Array(2.0, 2.0, 2.0))
[/sourcecode]

For dictionaries (Maps), you can use mutable HashMaps to create an empty Map and then add elements to it. For that, see this blog post:

http://bcomposes.wordpress.com/2011/09/19/first-steps-in-scala-for-beginning-programmers-part-8/

Q. Is the apply function similar to constructor in C++, Java? Where will the apply function be practically used? Is it for intialising values of attributes?

A. No, the apply function is like any other function except that it allows you to call it without writing out “apply”. Consider the following class.

[sourcecode lang=”scala”]
class AddX (x: Int) {
def apply(y: Int) = x+y
override def toString = "My number is " + x
}
[/sourcecode]

Here’s how we can use it.

[sourcecode lang=”scala”]
scala> val add1 = new AddX(1)
add1: AddX = My number is 1

scala> add1(4)
res0: Int = 5

scala> add1.apply(4)
res1: Int = 5

scala> add1.toString
res2: java.lang.String = My number is 1
[/sourcecode]

So, the apply method is just (very handy) syntactic sugar that allows you to specify one function as fundamental to a class you have designed (actually, you can have multiple apply methods as long as each one has a unique parameter list). For example, with Lists, the apply method returns the value at the index provided, and for Maps it returns the value associated with the given key.

[sourcecode lang=”scala”]
scala> val foo = List(1,2,3)
foo: List[Int] = List(1, 2, 3)

scala> foo(2)
res3: Int = 3

scala> foo.apply(2)
res4: Int = 3

scala> val bar = Map(1->2,3->4)
bar: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 3 -> 4)

scala> bar(1)
res5: Int = 2

scala> bar.apply(1)
res6: Int = 2
[/sourcecode]

Q. In the SBT tutorial you discuss “Node” and “Value” as being case classes. What is the alternative to a case class?

A. A normal class. Case classes are the special case. They do two things (and more) for you. The first is that you don’t have to use “new” to create a new object. Consider the following otherwise identical classes.

[sourcecode lang=”scala”]
scala> class NotACaseClass (val data: Int)
defined class NotACaseClass

scala> case class IsACaseClass (val data: Int)
defined class IsACaseClass

scala> val foo = new NotACaseClass(4)
foo: NotACaseClass = NotACaseClass@a5c0f8f

scala> val bar = IsACaseClass(4)
bar: IsACaseClass = IsACaseClass(4)
[/sourcecode]

That may seem like a little thing, but it can significantly improve code readability. Consider creating Lists within Lists within Lists if you had to use “new” all the time, for example. This is definitely true for Node and Value, which are used to build trees.

Case classes also support matching, as in the following.

[sourcecode lang=”scala”]
scala> val IsACaseClass(x) = bar
x: Int = 4
[/sourcecode]

A normal class cannot do this.

[sourcecode lang=”scala”]
scala> val NotACaseClass(x) = foo
<console>:13: error: not found: value NotACaseClass
val NotACaseClass(x) = foo
^
<console>:13: error: recursive value x needs type
val NotACaseClass(x) = foo
^
[/sourcecode]

If you mix the case class into a List and map over it, you can match it like you can with other classes, like Lists and Ints. Consider the following heterogeneous List.

[sourcecode lang=”scala”]
scala> val stuff = List(IsACaseClass(3), List(2,3), IsACaseClass(5), 4)
stuff: List[Any] = List(IsACaseClass(3), List(2, 3), IsACaseClass(5), 4)
[/sourcecode]

We can convert this to a List of Ints by processing each element according to its type by matching.

[sourcecode lang=”scala”]
scala> stuff.map { case List(x,y) => x; case IsACaseClass(x) => x; case x: Int => x }
<console>:13: warning: match is not exhaustive!
missing combination              *           Nil             *             *

stuff.map { case List(x,y) => x; case IsACaseClass(x) => x; case x: Int => x }
^

warning: there were 1 unchecked warnings; re-run with -unchecked for details
res10: List[Any] = List(3, 2, 5, 4)
[/sourcecode]

If you don’t want to see the warning in the REPL, add a case for things that don’t match that throws a MatchError.

[sourcecode lang=”scala”]
scala> stuff.map { case List(x,y) => x; case IsACaseClass(x) => x; case x: Int => x; case _ => throw new MatchError }
warning: there were 1 unchecked warnings; re-run with -unchecked for details
res13: List[Any] = List(3, 2, 5, 4)
[/sourcecode]

Better yet, return Options (using None for the unmatched case) and flatMapping instead.

[sourcecode lang=”scala”]
scala> stuff.flatMap { case List(x,y) => Some(x); case IsACaseClass(x) => Some(x); case x: Int => Some(x); case _ => None }
warning: there were 1 unchecked warnings; re-run with -unchecked for details
res14: List[Any] = List(3, 2, 5, 4)
[/sourcecode]

Q. In C++ the default access specifier is private; in Java one needs to specify private or public for each class member where as in Scala the default access specifier for a class is public. What could be the design motivation behind this when one of the purpose of the class is data hiding?

A. The reason is that Scala has a much more refined access specification scheme than Java that makes public the rational choice. See the discussion here:

http://stackoverflow.com/questions/4656698/default-public-access-in-scala

Another key aspecte of this is that the general emphasis in Scala is on using immutable data structures, so there isn’t any danger of someone changing the internal state of your objects if you have designed them in this way. This in turn gets rid of the ridiculous getter and setter methods that breed and multiply in Java programs. See “Why getters and setters are evil” for more discussion:

http://www.javaworld.com/javaworld/jw-09-2003/jw-0905-toolbox.html

After you get used to programming in Scala, the whole getter/setter thing that is so common in Java code is pretty much gag worthy.

In general, it is still a good idea to use private[this] as a modifier to methods and variables whenever they are only needed by an object itself.

Q. How do we define overloaded constructors in Scala?

Q. The way a class is defined in Scala introduced in the tutorial, seems to have only one constructor. Is there any way to provide multiple constructors like Java?

A. You can add additional constructors with this declarations.

[sourcecode lang=”scala”]
class SimpleTriple (x: Int, y: Int, z: String) {
def this (x: Int, z: String) = this(x,0,z)
def this (x: Int, y: Int) = this(x,y,"a")
override def toString = x + ":" + y + ":" + z
}

scala> val foo = new SimpleTriple(1,2,"hello")
foo: SimpleTriple = 1:2:hello

scala> val bar = new SimpleTriple(1,"goodbye")
bar: SimpleTriple = 1:0:goodbye

scala> val baz = new SimpleTriple(1,3)
baz: SimpleTriple = 1:3:a
[/sourcecode]

Notice that you must supply an initial value for every one of the parameters of the class. This contrasts with Java, which allows you to leave some fields uninitialized (and which tends to lead to nasty bugs and bad design).

Note that you can also provide defaults to parameters.

[sourcecode lang=”scala”]
class SimpleTripleWithDefaults (x: Int, y: Int = 0, z: String = "a") {
override def toString = x + ":" + y + ":" + z
}

scala> val foo = new SimpleTripleWithDefaults(1)
foo: SimpleTripleWithDefaults = 1:0:a

scala> val bar = new SimpleTripleWithDefaults(1,2)
bar: SimpleTripleWithDefaults = 1:2:a
[/sourcecode]

However, you can’t omit a middle parameter while specifying the last one.

[sourcecode lang=”scala”]
scala> val foo = new SimpleTripleWithDefaults(1,"xyz")
<console>:12: error: type mismatch;
found   : java.lang.String("xyz")
required: Int
Error occurred in an application involving default arguments.
val foo = new SimpleTripleWithDefaults(1,"xyz")
^
[/sourcecode]

But, you can name the parameters in the initialization if you want to be able to do this.

[sourcecode lang=”scala”]
scala> val foo = new SimpleTripleWithDefaults(1,z="xyz")
foo: SimpleTripleWithDefaults = 1:0:xyz
[/sourcecode]

You then have complete freedom to change the parameters around.

[sourcecode lang=”scala”]
scala> val foo = new SimpleTripleWithDefaults(z="xyz",x=42,y=3)
foo: SimpleTripleWithDefaults = 42:3:xyz
[/sourcecode]

Q. I’m still not clear on the difference between classes and traits.  I guess I see a conceptual difference but I don’t really understand what the functional difference is — how is creating a “trait” different from creating a class with maybe fewer methods associated with it?

A. Yes, they are different. First off, traits are abstract, which means you cannot create any members. Consider the following contrast.

[sourcecode lang=”scala”]
scala> class FooClass
defined class FooClass

scala> trait FooTrait
defined trait FooTrait

scala> val fclass = new FooClass
fclass: FooClass = FooClass@1b499616

scala> val ftrait = new FooTrait
<console>:8: error: trait FooTrait is abstract; cannot be instantiated
val ftrait = new FooTrait
^
[/sourcecode]

You can extend a trait to make a concrete class, however.

[sourcecode lang=”scala”]
scala> class FooTraitExtender extends FooTrait
defined class FooTraitExtender

scala> val ftraitExtender = new FooTraitExtender
ftraitExtender: FooTraitExtender = FooTraitExtender@53d26552
[/sourcecode]

This gets more interesting if the trait has some methods, of course. Here’s a trait, Animal, that declares two abstract methods, makeNoise and doBehavior.

[sourcecode lang=”scala”]
trait Animal {
def makeNoise: String
def doBehavior (other: Animal): String
}
[/sourcecode]

We can extend this trait with new class definitions; each extending class must implement both of these methods (or else be declared abstract).

[sourcecode lang=”scala”]
case class Bear (name: String, defaultBehavior: String = "Regard warily…") extends Animal {
def makeNoise = "ROAR!"
def doBehavior (other: Animal) = other match {
case b: Bear => makeNoise + " I’m " + name + "."
case m: Mouse => "Eat it!"
case _ => defaultBehavior
}
override def toString = name
}

case class Mouse (name: String) extends Animal {
def makeNoise = "Squeak?"
def doBehavior (other: Animal) = other match {
case b: Bear => "Run!!!"
case m: Mouse => makeNoise + " I’m " + name + "."
case _ => "Hide!"
}
override def toString = name
}
[/sourcecode]

Notice that Bear and Mouse have different parameter lists, but both can be Animals because they fully implement the Animal trait. We can now start creating objects of the Bear and Mouse classes and have them interact. We don’t need to use “new” because they are case classes (and this also allowed them to be used in the match statements of the doBehavior methods).

[sourcecode lang=”scala”]
val yogi = Bear("Yogi", "Hello!")
val baloo = Bear("Baloo", "Yawn…")
val grizzly = Bear("Grizzly")
val stuart = Mouse("Stuart")

println(yogi + ": " + yogi.makeNoise)
println(stuart + ": " + stuart.makeNoise)
println("Grizzly to Stuart: " + grizzly.doBehavior(stuart))
[/sourcecode]

We can also create a singleton object that is of the Animal type by using the following declaration.

[sourcecode lang=”scala”]
object John extends Animal {
def makeNoise = "Hullo!"
def doBehavior (other: Animal) = other match {
case b: Bear => "Nice bear… nice bear…"
case _ => makeNoise
}
override def toString = "John"
}
[/sourcecode]

Here, John is an object, not a class. Because this object implements the Animal trait, it successfully extends it and can act as an Animal. This means that a Bear like baloo can interact with John.

[sourcecode lang=”scala”]
println("Baloo to John: " + baloo.doBehavior(John))
[/sourcecode]

The output of the above code when run as a script is the following.

Yogi: ROAR!
Stuart: Squeak?
Grizzly to Stuart: Eat it!
Baloo to John: Yawn…

The closer distinction is between traits and abstract classes. In fact, everything shown above could have been done with Animal as an abstract class rather than as a trait. One difference is that an abstract class can have a constructor while traits cannot. Another key difference between them is that traits can be used to support limited multiple inheritance, as shown in the next question/answer.

Q. Does Scala support multiple inheritance?

A. Yes, via traits with implementations of some methods. Here’s an example, with a trait Clickable that has an abstract (unimplemented) method getMessage, an implemented method click, and a private, reassignable variable numTimesClicked (the latter two show clearly that traits are different from Java interfaces).

[sourcecode lang=”scala”]
trait Clickable {
private var numTimesClicked = 0
def getMessage: String
def click = {
val output = numTimesClicked + ": " + getMessage
numTimesClicked += 1
output
}
}
[/sourcecode]

Now let’s say we have a MessageBearer class (that we may have wanted for entirely different reasons having nothing to do with clicking).

[sourcecode lang=”scala”]
class MessageBearer (val message: String) {
override def toString = message
}
[/sourcecode]

A new class can be now created by extending MessageBearer and “mixing in” the Clickable trait.

[sourcecode lang=”scala”]
class ClickableMessageBearer(message: String) extends MessageBearer(message) with Clickable {
def getMessage = message
}
[/sourcecode]

ClickableMessageBearer now has the abilities of both MessageBearers (which is to be able to retrieve its message) and Clickables.

[sourcecode lang=”scala”]
scala> val cmb1 = new ClickableMessageBearer("I’m number one!")
cmb1: ClickableMessageBearer = I’m number one!

scala> val cmb2 = new ClickableMessageBearer("I’m number two!")
cmb2: ClickableMessageBearer = I’m number two!

scala> cmb1.click
res3: java.lang.String = 0: I’m number one!

scala> cmb1.message
res4: String = I’m number one!

scala> cmb1.click
res5: java.lang.String = 1: I’m number one!

scala> cmb2.click
res6: java.lang.String = 0: I’m number two!

scala> cmb1.click
res7: java.lang.String = 2: I’m number one!

scala> cmb2.click
res8: java.lang.String = 1: I’m number two!
[/sourcecode]

Q. Why are there toString, toInt, and toList functions, but there isn’t a toTuple function?

A. This is a basic question that leads directly to the more advanced topic of implicits. There are a number of reasons behind this. To start with, it is important to realize that there are many types of Tuples, starting with a Tuple with a single element (a Tuple1) up to 22 elements (a Tuple22). Note that when you use (,) to create a tuple, it is implicitly invoking a constructor for the corresponding TupleN of the correct arity.

[sourcecode lang=”scala”]
scala> val b = (1,2,3)
b: (Int, Int, Int) = (1,2,3)

scala> val c = Tuple3(1,2,3)
c: (Int, Int, Int) = (1,2,3)

scala> b==c
res4: Boolean = true
[/sourcecode]

Given this, it is obviously not meaningful to have a function toTuple on Seqs (sequences) that are longer than 22. This means there is no generic way to have, say a List or Array, and then call toTuple on it and expect reliable behavior to happen.

However, if you want this functionality (even though limited by the above constraint of 22 elements max), Scala allows you to “add” methods to existing classes by using implicit definitions. You can find lots of discussions about implicits by search for “scala implicits”. But, here’s an example that shows how it works for this particular case.

[sourcecode lang=”scala”]
val foo = List(1,2)
val bar = List(3,4,5)
val baz = List(6,7,8,9)

foo.toTuple

class TupleAble[X] (elements: Seq[X]) {
def toTuple = elements match {
case Seq(a) => Tuple1(a)
case Seq(a,b) => (a,b)
case Seq(a,b,c) => (a,b,c)
case _ => throw new RuntimeException("Sequence too long to be handled by toTuple: " + elements)
}
}

foo.toTuple

implicit def seqToTuple[X](x: Seq[X]) = new TupleAble(x)

foo.toTuple
bar.toTuple
baz.toTuple
[/sourcecode]

If you put this into the Scala REPL, you’ll see that the first invocation of foo.toTuple gets an error:

[sourcecode lang=”scala”]
scala> foo.toTuple
<console>:9: error: value toTuple is not a member of List[Int]
foo.toTuple
^
[/sourcecode]

Note that class TupleAble takes a Seq in its constructor and then provides the method toTuple, using that Seq. It is able to do so for Seqs with 1, 2 or 3 elements, and above that it throws an exception. (We could of course keeping listing more cases out and go up to 22 element tuples, but this shows the point.)

The second invocation of foo.toTuple still doesn’t work — and that is because foo is a List (a kind of Seq) and there isn’t a toTuple method for Lists. That’s where the implicit function seqToTuple comes in — once it is declared, Scala notes that you are trying to call toTuple on a Seq, notes that there is no such function for Seqs, but sees that there is an implicit conversion from Seqs to TupleAbles via seqToTuple, and then it sees that TupleAble has a toTuple method. Based on that, it compiles and the produces the desired behavior. This is a very handy ability of Scala that can really simplify your code if you use it well and with care.

Copyright 2012 Jason Baldridge

The text of this post is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original post.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.