Gibbs sampler for toy topic model example

Several years ago, I did an implementation of a Gibbs sampler in R for the artificial data of Steyvers and Griffiths (2007) “Probabilistic topic models” that I used for a class demo and have been meaning to post as a Github gist. Here it is:

The artificial problem provides a very nice, simple test case for seeing the inference of the topic-word and document-topic distributions using Gibbs sampling.  The code for the sampling is shorter than the setup code. There are comments in the code that should make everything self explanatory if you read Steyvers and Griffiths.

To run it, you can of course just paste it into an R session. You can also run it from the command line, e.g.:

[sourcecode lang=”bash”]

$ R –no-save < topics_gibbs_sg_example.R

[/sourcecode]

If you are interested in other tutorials that discuss Bayesian learning and samplers (with a definite slant toward natural language processing), check these out:

Happy Mothers Day to Academic Moms (We need more of you)

Happy Mother’s Day to all my female colleagues around the world, who produce amazing research and do great teaching while being moms!

Being an academic means a lot of hard (and rewarding) work, and being a parent on top of it brings an extensive set of challenges — especially as one effectively competes with others who don’t have kids! Compared to men, women face an additional set of challenges as academic parents, due to a wide variety of factors, including fixed biological ones (e.g. only they can actually bear children) and societal expectations which change ever so slowly (though thankfully generally for the better). It is important to have your perspectives as colleagues, teachers, and researchers, and I don’t think that academia does enough to allow you all to more easily balance the needs of work and family — much to our detriment. And there are still pay gaps between men and women, especially at more senior levels of academia. It all means that many women who may have provided fundamental insights into science sadly never go into academic work based on a very rational choice about the likely costs and benefits such a career brings. Many of my female colleagues feel they must wait until relatively late in their reproductive life to have children, often after tenure or after tenure is pretty much assured. This brings with it additional risks and challenges that women should not feel forced to take.

As it is we still have too few academic women, and even fewer academic moms. I believe the latter are an important group to support, since they are the ones who provide examples and can be role models for young women who are considering academic careers but who know they want children. Carlota Smith, a colleague in the UT Austin Linguistics department who sadly died five years ago, was a trailblazer who was a single mom academic in the 1970s and who I know directly inspired many of the female graduate students in our department. We need more Carlotas.

The less attractive it is to be an academic mom, the fewer women we’ll have in our midst, again to our detriment — this is especially true in fields like computer science. This has big effects on academic women who choose not to have children as it reduces the pool of potential female colleagues they could have. Even in our linguistics department, there are too few female graduate students who study computational linguistics, despite an otherwise reasonably balanced population of male and female graduate students.

So, knowing all the challenges you face on top of the usual ones — thanks again, and keep on being amazing. You all have my respect!

Processing JSON in Scala with Jerkson

Topics: JSON, Jerkson, SBT quick start, running the Scala REPL in SBT, Java implicit conversions, @transient annotation, SBT run and run-main, Avro

Introduction

The previous tutorial covered basic XML processing in Scala, but as I noted, XML is not the primary choice for data serialization these days. Instead, JSON (JavaScript Object Notation) is more widely used for data interchange, in part because it is less verbose and better captures the core data structures (such as lists and maps) that are used in defining many objects. It was originally designed for working with JavaScript, but turned out to be quite effective as a language neutral format. A very nice feature of it is that it is straightforward to translate objects as defined in languages like Java and Scala into JSON and back again, as I’ll show in this tutorial. If the class definitions and the JSON structures are appropriately aligned, this transformation turns out to be entirely trivial to do — given a suitable JSON processing library.

In this tutorial, I cover basic JSON processing in Scala using the Jerkson library, which itself is essentially a Scala wrapper around the Jackson library (written in Java).  Note that other libraries like lift-json are perfectly good alternatives, but Jerkson seems to have some efficiency advantages for streaming JSON due to Jackson’s performance. Of course, since Scala plays nicely with Java, you can directly use whichever JVM-based JSON library you like, including Jackson.

This post also shows how to do a quick start with SBT that will allow you to easily access third-party libraries as dependencies and start writing code that uses them and can be compiled with SBT.

Note: As a “Jason” I insist that JSON should be pronounced Jay-SAHN (with stress on the second syllable) to distinguish it from the name. 🙂

Getting set up

An easy way to use the Jerkson library in the context of a tutorial like this is for the reader to set up a new SBT project, declare Jerkson as a dependency, and then fire up the Scala REPL using SBT’s console action. This sorts out the process of obtaining external libraries and setting up the classpath so that they are available in an SBT-initiated Scala REPL. Follow the instructions in this section to do so.

Note: if you have already been working with Scalabha version 0.2.5 (or later), skip to the bottom of this section to see how to run the REPL using Scalabha’s build. Alternatively, if you have an existing project of your own, you can of course just add Jerkson as a dependency, import its classes as necessary and use it in your normal programming setup. The examples below will then help as some straightforward recipes for using it in your project.

First, create a directory to work in and download the SBT launch jar.

$ mkdir ~/json-tutorial
$ cd ~/json-tutorial/
$ wget http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.11.3/sbt-launch.jar

Note: If you don’t have wget installed on your machine, you can download the above sbt-launch.jar file in your browser and move it to the ~/json-tutorial directory.

Now, save the following as the file ~/json-tutorial/build.sbt. Be aware that it is important to keep the empty lines between each of the declarations.

name := "json-tutorial"

version := "0.1.0 "

scalaVersion := "2.9.2"

resolvers += "repo.codahale.com" at "http://repo.codahale.com"

libraryDependencies += "com.codahale" % "jerkson_2.9.1" % "0.5.0"

Then save the following in the file ~/json-tutorial/runSbt.

java -Xms512M -Xmx1536M -Xss1M -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=384M -jar `dirname $0`/sbt-launch.jar "$@"

Make that file executable and run it, which will show SBT doing a bunch of work and then leave you with the SBT prompt.

$ cd ~/json-tutorial
$ chmod a+x runSbt
$ ./runSbt update
Getting org.scala-sbt sbt_2.9.1 0.11.3 ...
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt_2.9.1/0.11.3/jars/sbt_2.9.1.jar ...
[SUCCESSFUL ] org.scala-sbt#sbt_2.9.1;0.11.3!sbt_2.9.1.jar (307ms)
...
... more stuff including getting the the Jerkson library ...
...
[success] Total time: 25 s, completed May 11, 2012 10:22:42 AM
$

You should be back in the Unix shell at this point, and now we are ready to run the Scala REPL using SBT. The important thing is that this instance of the REPL will have the Jerkson library and its dependencies in the classpath so that we can import the classes we need.

./runSbt console
[info] Set current project to json-tutorial (in build file:/Users/jbaldrid/json-tutorial/)
[info] Starting scala interpreter...
[info]
Welcome to Scala version 2.9.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_31).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import com.codahale.jerkson.Json._
import com.codahale.jerkson.Json._

If nothing further is output, then you are all set. If things are amiss (or if you are running in the default Scala REPL), you’ll instead see something like the following.

scala> import com.codahale.jerkson.Json._
:7: error: object codahale is not a member of package com
import com.codahale.jerkson.Json._

If this is what you got, try to follow the instructions above again to make sure that your setup is exactly as above. However, if you continue to experience problems, an alternative is to get version 0.2.5 of Scalabha (which already has Jerkson as a dependency), follow the instructions for setting it up and then run the following commands.

$ cd $SCALABHA_DIR
$ scalabha build console

If you just want to see some examples of using Jerkson as an API and not use it interactively, then it is entirely unnecessary to do the SBT setup — just read on and adapt the examples as necessary.

Processing a simple JSON example

As usual, let’s begin with a very simple example that shows some of the basic properties of JSON.

{"foo": 42
"bar": ["a","b","c"],
"baz": { "x": 1, "y": 2 }}

This describes a data structure with three fields, foo, bar and baz. The field foo‘s value is the integer 42, bar‘s value is a list of strings, and baz‘s value is a map from strings to integers. These are language neutral (but universal) types.

Let’s first consider deserializing each of these values individually as Scala objects, using Jerkson’s parse method. Keep in mind that JSON in a file is a string, so the inputs in all of these cases are strings (at times I’ll use triple-quoted strings when there are quotes themselves in the JSON). In each case, we tell the parse method what type we expect by providing a type specification before the argument.

scala> parse[Int]("42")
res0: Int = 42

scala> parse[List[String]]("""["a","b","c"]""")
res1: List[String] = List(a, b, c)

scala> parse[Map[String,Int]]("""{ "x": 1, "y": 2 }""")
res2: Map[String,Int] = Map(x -> 1, y -> 2)

So, in each case, the string representation is turned into a Scala object of the appropriate type. If we aren’t sure what the type is or if we know for example that a List is heterogeneous, we can use Any as the expected type.

scala> parse[Any]("42")
res3: Any = 42

scala> parse[List[Any]]("""["a",1]""")
res4: List[Any] = List(a, 1)

If you give an expect type that can’t be parsed as such, you’ll get an error.

scala> parse[List[Int]]("""["a",1]""")
com.codahale.jerkson.ParsingException: Can not construct instance of int from String value 'a': not a valid Integer value
at [Source: java.io.StringReader@2bc5aea; line: 1, column: 2]
<...many more lines of stack trace...>

How about parsing all of the attributes and values together? Save the whole thing in a variable simpleJson as follows.

scala> :paste
// Entering paste mode (ctrl-D to finish)

val simpleJson = """{"foo": 42,
"bar": ["a","b","c"],
"baz": { "x": 1, "y": 2 }}"""

// Exiting paste mode, now interpreting.

simpleJson: java.lang.String =
{"foo": 42,
"bar": ["a","b","c"],
"baz": { "x": 1, "y": 2 }}

Since it is a Map from Strings to different types of values, the best we can do is deserialize it as a Map[String, Any].

scala> val simple = parse[Map[String,Any]](simpleJson)
simple: Map[String,Any] = Map(bar -> [a, b, c], baz -> {x=1, y=2}, foo -> 42)

To get these out as more specific types than Any, you need to cast them to the appropriate types.

scala> val fooValue = simple("foo").asInstanceOf[Int]
fooValue: Int = 42

scala> val barValue = simple("bar").asInstanceOf[java.util.ArrayList[String]]
barValue: java.util.ArrayList[String] = [a, b, c]

scala> val bazValue = simple("baz").asInstanceOf[java.util.LinkedHashMap[String,Int]]
bazValue: java.util.LinkedHashMap[String,Int] = {x=1, y=2}

Of course, you might want to be working with Scala types, which is easy if you import the implicit conversions from Java types to Scala types.

scala> import scala.collection.JavaConversions._
import scala.collection.JavaConversions._

scala> val barValue = simple("bar").asInstanceOf[java.util.ArrayList[String]].toList
barValue: List[String] = List(a, b, c)

scala> val bazValue = simple("baz").asInstanceOf[java.util.LinkedHashMap[String,Int]].toMap
bazValue: scala.collection.immutable.Map[String,Int] = Map(x -> 1, y -> 2)

Voila! When you are working with Java libraries in Scala, the JavaConversions usually prove to be extremely handy.

Deserializing into user-defined types

Though we were able to parse the simple JSON expression above and even cast values into appropriate types, things were still a bit clunky. Fortunately, if you have defined your own case class with the appropriate fields, you can provide that as the expected type instead. For example, here’s a simple case class that will do the trick.

case class Simple(val foo: String, val bar: List[String], val baz: Map[String,Int])

Clearly this has all the right fields (with variables named the same as the fields in the JSON example), and the variables have the types we’d like them to have.

Unfortunately, due to class loading issues with SBT, we cannot carry on the rest of this exercise solely in the REPL and must define this class in code. This code can be compiled and then used in the REPL or by other code. To do this, save the following as ~/json-tutorial/Simple.scala.

case class Simple(val foo: String, val bar: List[String], val baz: Map[String,Int])

object SimpleExample {
def main(args: Array[String]) {
import com.codahale.jerkson.Json._
val simpleJson = """{"foo":42, "bar":["a","b","c"], "baz":{"x":1,"y":2}}"""
val simpleObject = parse[Simple](simpleJson)
println(simpleObject)
}
}

Then exit the Scala REPL session you were in for the previous section using the command :quit, and do the following. (If anything has gone amiss you can restart SBT (with runSbt) and do the following commands.)

> compile
[info] Compiling 1 Scala source to /Users/jbaldrid/json-tutorial/target/scala-2.9.2/classes...
[success] Total time: 2 s, completed May 11, 2012 9:24:00 PM
> run
[info] Running SimpleExample SimpleExample
Simple(42,List(a, b, c),Map(x -> 1, y -> 2))
[success] Total time: 1 s, completed May 11, 2012 9:24:03 PM

You can make changes to the code in Simple.scala, compile it again (you don’t need to exit SBT to do so), and run it again. Also, now that you’ve compiled, if you start up the Scala REPL using the console action, then the Simple class is now available to you and you can carry on working in the REPL. For example, here are the same statements that are used in the SimpleExample main method given previously.

scala> import com.codahale.jerkson.Json._
import com.codahale.jerkson.Json._

scala> val simpleJson = """{"foo":42, "bar":["a","b","c"], "baz":{"x":1,"y":2}}"""
simpleJson: java.lang.String = {"foo":42, "bar":["a","b","c"], "baz":{"x":1,"y":2}}

scala> val simpleObject = parse[Simple](simpleJson)
simpleObject: Simple = Simple(42,List(a, b, c),Map(x -> 1, y -> 2))

scala> println(simpleObject)
Simple(42,List(a, b, c),Map(x -> 1, y -> 2))

Another nice feature of JSON serialization is that if the JSON string has more information than you need to construct the object want to build from it, it is ignored. For example, consider deserializing the following example, which has an extra field eca in the JSON representation.

scala> val ecaJson = """{"foo":42, "bar":["a","b","c"], "baz":{"x":1,"y":2}, "eca": true}"""
ecaJson: java.lang.String = {"foo":42, "bar":["a","b","c"], "baz":{"x":1,"y":2}, "eca": true}

scala> val noEcaSimpleObject = parse[Simple](ecaJson)
noEcaSimpleObject: Simple = Simple(42,List(a, b, c),Map(x -> 1, y -> 2))

The eca information silently slips away and we still get a Simple object with all the information we need. This property is very handy for ignoring irrelevant information, which I’ll show to be quite useful in a follow-up post on processing JSON formatted tweets from Twitter’s API.

Another thing to note about the above example is that the Boolean values true and false are valid JSON (they are not quoted strings, but actual Boolean values). Parsing a Boolean is even quite forgiving as Jerkson will give you a Boolean even when it is defined as a String.

scala> parse[Map[String,Boolean]]("""{"eca":true}""")
res0: Map[String,Boolean] = Map(eca -> true)

scala> parse[Map[String,Boolean]]("""{"eca":"true"}""")
res1: Map[String,Boolean] = Map(eca -> true)

And it will convert a Boolean into a String if you happen to ask it to do so.

scala> parse[Map[String,String]]("""{"eca":true}""")
res2: Map[String,String] = Map(eca -> true)

But it (sensibly) won’t convert any String other than true or false into a Boolean.

scala> parse[Map[String,Boolean]]("""{"eca":"brillig"}""")
com.codahale.jerkson.ParsingException: Can not construct instance of boolean from String value 'brillig': only "true" or "false" recognized
at [Source: java.io.StringReader@6b2739b8; line: 1, column: 2]
<...stacktrace...>

And it doesn’t admit unquoted values other than a select few, including true and false.

scala> parse[Map[String,String]]("""{"eca":brillig}""")
com.codahale.jerkson.ParsingException: Malformed JSON. Unexpected character ('b' (code 98)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') at character offset 7.
<...stacktrace...>

In other words, your JSON needs to be grammatical.

Generating JSON from an object

If you have an object in hand, it is very easy to create JSON from it (serialize) using the generate method.

scala> val simpleJsonString = generate(simpleObject)
simpleJsonString: String = {"foo":"42","bar":["a","b","c"],"baz":{"x":1,"y":2}}

This is much easier than the XML solution, which required explicitly declaring how an object was to be turned into XML elements. The restriction is that any such objects must be instances of a case class. If you don’t have a case class, you’ll need to do some special handling (not discussed in this tutorial).

A richer JSON example

In the vein of the previous tutorial on XML, I’ve created the JSON corresponding to the music XML example used there. You can find it as the Github gist music.json:

https://gist.github.com/2668632

Save that file as /tmp/music.json.

Tip: you can easily format condensed JSON to be more human-readable by using the mjson tool in Python.

$ cat /tmp/music.json | python -mjson.tool
[
{
"albums": [
{
"description": "ntThe King of Limbs is the eighth studio album by English rock band Radiohead, produced by Nigel Godrich. It was self-released on 18 February 2011 as a download in MP3 and WAV formats, followed by physical CD and 12" vinyl releases on 28 March, a wider digital release via AWAL, and a special "newspaper" edition on 9 May 2011. The physical editions were released through the band's Ticker Tape imprint on XL in the United Kingdom, TBD in the United States, and Hostess Entertainment in Japan.n      ",
"songs": [
{
"length": "5:15",
"title": "Bloom"
},
<...etc...>

Next, save the following code as ~/json-tutorial/MusicJson.scala.

package music {

case class Song(val title: String, val length: String) {
@transient lazy val time = {
val Array(minutes, seconds) = length.split(":")
minutes.toInt*60 + seconds.toInt
}
}

case class Album(val title: String, val songs: Seq[Song], val description: String) {
@transient lazy val time = songs.map(_.time).sum
@transient lazy val length = (time / 60)+":"+(time % 60)
}

case class Artist(val name: String, val albums: Seq[Album])
}

object MusicJson {
def main(args: Array[String]) {
import com.codahale.jerkson.Json._
import music._
val jsonInput = io.Source.fromFile("/tmp/music.json").mkString
val musicObj = parse[List[Artist]](jsonInput)
println(musicObj)
}
}

A couple of quick notes. The Song, Album, and Artist classes are the same as I used in the previous tutorial on XML processing, with two changes. The first is that I’ve wrapped them in a package music. This is only necessary to get around an issue with running Jerkson in SBT as we are doing here. The other is that the fields that are not in the constructor are marked as @transient: this ensures that they are not included in the output when we generate JSON from objects of these classes. An example showing how this matters is the way that I created the music.json file: I read in the XML as in the previous tutorial and then use Jerkson to generate the JSON — without the @transient annotation, those fields are included in the output. For reference, here’s the code to do the conversion from XML to JSON (which you can add to MusicJson.scala if you like).

object ConvertXmlToJson {
def main(args: Array[String]) {
import com.codahale.jerkson.Json._
import music._
val musicElem = scala.xml.XML.loadFile("/tmp/music.xml")

val artists = (musicElem "artist").map { artist =>
val name = (artist "@name").text
val albums = (artist "album").map { album =>
val title = (album "@title").text
val description = (album "description").text
val songList = (album "song").map { song =>
Song((song "@title").text, (song "@length").text)
}
Album(title, songList, description)
}
Artist(name, albums)
}

val musicJson = generate(artists)
val output = new java.io.BufferedWriter(new java.io.FileWriter(new java.io.File("/tmp/music.json")))
output.write(musicJson)
output.flush
output.close
}
}

There are other serialization strategies (e.g. binary serialization of objects), and the @transient annotation is similarly respected by them.

Given the code in MusicJson.scala, we can now compile and run it. In SBT, you can either do run or run-main. If you choose run and there are more than one main methods in your project, SBT will give you a choice.

> run

Multiple main classes detected, select one to run:

[1] SimpleExample
[2] MusicJson
[3] ConvertXmlToJson

Enter number: 2

[info] Running MusicJson
List(Artist(Radiohead,List(Album(The King of Limbs,List(Song(Bloom,5:15), Song(Morning Mr Magpie,4:41), Song(Little by Little,4:27), Song(Feral,3:13), Song(Lotus Flower,5:01), Song(Codex,4:47), Song(Give Up the Ghost,4:50), Song(Separator,5:20)),
The King of Limbs is the eighth studio album by English rock band Radiohead, produced by Nigel Godrich. It was self-released on 18 February 2011 as a download in MP3 and WAV formats, followed by physical CD and 12" vinyl releases on 28 March, a wider digital release via AWAL, and a special "newspaper" edition on 9 May 2011. The physical editions were released through the band's Ticker Tape imprint on XL in the United Kingdom, TBD in the United States, and Hostess Entertainment in Japan.
), Album(OK Computer,List(Song(Airbag,4:44), Song(Paranoid
<...more printed output...>
[success] Total time: 3 s, completed May 12, 2012 11:52:06 AM

With run-main, you just explicitly provide the name of the object whose main method you wish to run.

> run-main MusicJson
[info] Running MusicJson
<...same output as above...>

So, either way, we have successfully de-serialized the JSON description of the music data. (You can also get the same result by entering the code of the main method of MusicJson into the REPL when you run it from the SBT console.)

Conclusion

This tutorial has shown how easy it is to serialize (generate) and deserialize (parse) objects to and from JSON format. Hopefully, this has demonstrated the relative ease of doing this with the Jerkson library and Scala, and especially the relative ease in comparison with working with XML for similar purposes.

In addition to this ease, JSON is generally more compact than the equivalent XML. However, it still is far from being a truly compressed format, and there is a lot of obvious “waste”, like having the field names repeated again and again for each object. This matters a lot when data is represented as JSON strings and is being sent over networks and/or used in distributed processing frameworks like Hadoop. The Avro file format is an evolution of JSON that performs such compression: it includes a schema with each file and then each object is represented in a binary format that only specifies the data and not the field names. In addition to being more compact, it retains the properties of being easily splittable, which matters a great deal for processing large files in Hadoop.

Copyright 2012 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Basic XML processing with Scala

Topics: XML, Scala XML API, XML literals, marshalling

Introduction

Pretty much everybody knows what XML is: it is a structured, machine-readable text format for representing information that can be easily checked for the “grammaticality” of the tags, attributes, and their relationship to each other (e.g. using DTD’s). This contrasts with HTML, which can have elements that don’t close (e.g. <p>foo<p>bar rather than <p>foo</p><p>bar</p>) and still be processed. XML was only ever meant to be a format for machines, but it morphed into a data representation that many people ended up (unfortunately, for them) editing by hand. However, even as a machine readable format it has problems, such as being far more verbose than is really required, which matters quite a bit when you need to transfer lots of data from machine to machine — in the next post, I’ll discuss JSON and Avro, which can be viewed as evolutions of what XML was intended for and which work much better for lots of the applications that matter in the “big data” context. Regardless, there is plenty of legacy data that was produced as XML, and there are many communities (e.g. the digital humanities community) who still seem to adore XML, so people doing any reasonable amount of text analysis work will likely find themselves eventually needing to work with XML-encoded data.

There are a lot of tutorials on XML and Scala — just do a web search for “Scala XML” and you’ll get them. As with other blog posts, this one is aimed at being very explicit so that beginners can see examples with all the steps in them, and I’ll use it to set up a JSON processing post.

A simple example of XML

To start things off, let’s consider a very basic example of creating and processing a bit of XML.

The first thing to know about XML in Scala is that Scala can process XML literals. That is, you don’t need to put quotes around XML strings — instead, you can just write them directly, and Scala will automatically interpret them as XML elements (of type scala.xml.Element).

scala> val foo = <foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>
foo: scala.xml.Elem = <foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>

Now let’s do a little bit of processing on this. You can get all the text by using the text method.

scala> foo.text
res0: String = hi1yellow

So, that munged all the text together. To get them printed out with spaces between, let’s first get all the bar nodes and then get their texts and use mkString on that sequence. To get the bar nodes, we can use the selector.

scala> foo  "bar"
res1: scala.xml.NodeSeq = NodeSeq(<bar type="greet">hi</bar>, <bar type="count">1</bar>, <bar type="color">yellow</bar>)

This gives us back a sequence of the bar nodes that occur directly under the foo node. Note that the operator (selector) is just a mirror image of the / selector used in XPath.

Of course, now that we have such a sequence, we can map over it to get what we want. Since the text method returns the text under a node, we can do the following.

scala> (foo  "bar").map(_.text).mkString(" ")
res2: String = hi 1 yellow

To grab the value of the type attribute on each node, we can use the selector followed by “@type”.

scala> (foo  "bar").map(_  "@type")
res3: scala.collection.immutable.Seq[scala.xml.NodeSeq] = List(greet, count, color)

(foo  "bar").map(barNode => (barNode  "@type", barNode.text))
res4: scala.collection.immutable.Seq[(scala.xml.NodeSeq, String)] = List((greet,hi), (count,1), (color,yellow))

Note that the selector can only retrieve children of the node you are selecting from. To dig arbitrarily deep to pull out all nodes of a given type no matter where they are, use the \ selector. Consider the following (bizarre) XML snippet with ‘z’ nodes at different levels of embedding.

<a>
  <z x="1"/>
  <b>
    <z x="2"/>
    <c>
      <z x="3"/>
    </c>
    <z x="4"/>
  </b>
</a>

Let’s first put it into the REPL.

scala> val baz = <a><z x="1"/><b><z x="2"/><c><z x="3"/></c><z x="4"/></b></a>
baz: scala.xml.Elem = <a><z x="1"></z><b><z x="2"></z><c><z x="3"></z></c><z x="4"></z></b></a>

If we want to get all of the ‘z’ nodes, we do the following.

scala> baz \ "z"
res5: scala.xml.NodeSeq = NodeSeq(<z x="1"></z>, <z x="2"></z>, <z x="3"></z>, <z x="4"></z>)

And we can of course easily dig out the values of the x attributes on each of the z’s.

scala> (baz \ "z").map(_  "@x")
res6: scala.collection.immutable.Seq[scala.xml.NodeSeq] = List(1, 2, 3, 4)

Throughout all of the above, we have used XML literals — that is, expressions typed directly into Scala, which interprets them as XML types. However, we usually need to process XML that is saved in a file, or a string, so the scala.xml.XML object has several methods for creating scala.xml.Elem objects from other sources. For example, the following allows us to create XML from a string.

scala> val fooString = """<foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>"""
fooString: java.lang.String = <foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>

scala> val fooElemFromString = scala.xml.XML.loadString(fooString)
fooElemFromString: scala.xml.Elem = <foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>

This Elem is the same as the one created using the XML literal, as shown by the following test.

scala> foo == fooElemFromString
res7: Boolean = true

See the Scala XML object for other ways to create XML elements, e.g. from InputStreams, Files, etc.

A richer XML example

As a more interesting example of some XML to process, I’ve created the following short XML string describing artist, albums, and songs, which you can see in the github gist music.xml.

https://gist.github.com/2597611

I haven’t put any special care into this, other than to make sure it has embedded tags, some of which have attributes, and some reasonably interesting content (and some great songs).

You should save this in a file called /tmp/music.xml. Once you’ve done that, you can run the following code, which just prints out each artist, album and song, with an indent for each level.

val musicElem = scala.xml.XML.loadFile("/tmp/music.xml")

(musicElem  "artist").foreach { artist =>
  println((artist  "@name").text + "n")
  val albums = (artist  "album").foreach { album =>
    println("  " + (album  "@title").text + "n")
    val songs = (album  "song").foreach { song =>
      println("    " + (song  "@title").text)
    }
  println
  }
}

Converting objects to and from XML

One of the use cases for XML is to provide a machine-readable serialization format for objects that can still be easily read, and at times edited, by humans. The process of shuffling objects from memory into a disk-format like XML is called marshalling. We’ve started with some XML, so what we’ll do is define some classes and “unmarshall” the XML into objects of those classes. Put the following into the REPL. (Tip: You can use “:paste” to enter multi-line statements like those below. These will work without paste, but it is necessary to use it in some contexts, e.g. if you define Artist before Song.)

case class Song(val title: String, val length: String) {
  lazy val time = {
    val Array(minutes, seconds) = length.split(":")
    minutes.toInt*60 + seconds.toInt
  }
}

case class Album(val title: String, val songs: Seq[Song], val description: String) {
  lazy val time = songs.map(_.time).sum
  lazy val length = (time / 60)+":"+(time % 60)
}

case class Artist(val name: String, val albums: Seq[Album])

Pretty simple and straightforward. Note the use of lazy vals for defining things like the time (length in seconds) of a song. The reason for this is that if we create a Song object but never ask for its time, then the code needed to compute it from a string like “4:38” is never run; however, if we had left lazy off, then it would be computed when the Song object is created. Also, we don’t want to use a def here (i.e. make time a method) because its value is fixed based on the length string; using a method would mean recomputing time every time it is asked for of a particular object.

Given the classes above, we can create and use objects from them by hand.

scala> val foobar = Song("Foo Bar", "3:29")
foobar: Song = Song(Foo Bar,3:29)

scala> foobar.time
res0: Int = 209

Using the native Scala XML API

Of course, we’re more interested in constructing Artist, Album, and Song objects from information specified in files like the music example. Though I don’t show the REPL output here, you should enter all of the commands below into it to see what happens.

To start off, make sure you have loaded the file.

val musicElem = scala.xml.XML.loadFile("/tmp/music.xml")

Now we can work with the file to select various elements, or create objects of the classes defined above. Let’s start with just Songs. We can ignore all the artists and albums and dig straight in with the \ operator.

val songs = (musicElem \ "song").map { song =>
  Song((song  "@title").text, (song  "@length").text)
}

scala> songs.map(_.time).sum
res1: Int = 11311

And, we can go all the way and construct Artist, Album and Song objects that directly mirror the data stored in the XML file.

val artists = (musicElem  "artist").map { artist =>
  val name = (artist  "@name").text
  val albums = (artist  "album").map { album =>
    val title = (album  "@title").text
    val description = (album  "description").text
    val songList = (album  "song").map { song =>
      Song((song  "@title").text, (song  "@length").text)
    }
    Album(title, songList, description)
  }
  Artist(name, albums)
}

With the artists sequence in hand, we can do things like showing the length of each album.

val albumLengths = artists.flatMap { artist =>
  artist.albums.map(album => (artist.name, album.title, album.length))
}
albumLengths.foreach(println)

Which gives the following output.

(Radiohead,The King of Limbs,37:34)
(Radiohead,OK Computer,53:21)
(Portished,Dummy,48:46)
(Portished,Third,48:50)

Marshalling objects to XML

In addition to constructing objects from XML specifications (also referred to as de-serializing and un-marshalling), it is often necessary to marshal objects one has constructed in code to XML (or other formats). The use of XML literals is actually quite handy in this regard. To see this, let’s start with the first song of the first album of the first album (Bloom, by Radiohead).

scala> val bloom = artists(0).albums(0).songs(0)
bloom: Song = Song(Bloom,5:15)

We can construct an Elem from this as follows.

scala> val bloomXml = <song title={bloom.title} length={bloom.length}/>
bloomXml: scala.xml.Elem = <song length="5:15" title="Bloom"></song>

The thing to note here is that an XML literal is used, but when we want to use values from variables, we can escape from literal-mode with curly brackets. So, {bloom.title} becomes “Bloom”, and so on. In contrast, one could do it via a String as follows.

scala> val bloomXmlString = "<song title=""+bloom.title+"" length=""+bloom.length+""/>"
bloomXmlString: java.lang.String = <song title="Bloom" length="5:15"/>

scala> val bloomXmlFromString = scala.xml.XML.loadString(bloomXmlString)
bloomXmlFromString: scala.xml.Elem = <song length="5:15" title="Bloom"></song>

So, the use of literals is a bit more readable (though it comes at the cost of making it hard in Scala to use “<” as an operator for many use cases, which is one of the reasons XML literals are considered by many to be not a great idea).

We can create the whole XML for all of the artists and albums in one fell swoop. Note that one can have XML literals in the escaped bracketed portions of an XML literal, which allows the following to work. Note: you need to use the :paste mode in the REPL in order for this to work.

val marshalled =
  <music>
  { artists.map { artist =>
    <artist name={artist.name}>
    { artist.albums.map { album =>
      <album title={album.title}>
      { album.songs.map(song => <song title={song.title} length={song.length}/>) }
      <description>{album.description}</description>
      </album>
    }}
    </artist>
  }}
</music>

Note that in this case, the for-yield syntax is perhaps a bit more readable since it doesn’t require the extra curly braces.

val marshalledYield =
<music>
  { for (artist <- artists) yield
    <artist name={artist.name}>
    { for (album <- artist.albums) yield
      <album title={album.title}>
      { for (song <- album.songs) yield <song title={song.title} length={song.length}/> }
        <description>{album.description}</description>
      </album>
    }
    </artist>
  }
</music>

One could of course instead add a toXml method to each of the Song, Album, and Artist classes such that at the top level you’d have something like the following.

val marshalledWithToXml =  <music> { artists.map(_.toXml) } </music>

This is a fairly common strategy. However, note that the problem with this solution is that it produces a very tight coupling between the program logic (e.g. of what things like Songs, Albums and Artists can do) with other, orthogonal logic, like serializing them. To see a way of decoupling such different needs, check out Dan Rosen’s excellent tutorial on type classes.

Conclusion

The standard Scala XML API comes packaged with Scala, and it is actually quite nice for some basic XML processing. However, it caused some “controversy” in that it was felt by many that the core language has no business providing specialized processing for a format like XML. Also, there are some efficiency issues. Anti-XML is a library that seeks to do a better job of processing XML (especially in being more scalable and more flexible in allowing programmatic editing of XML). As I understand things, Anti-XML may become a sort of official XML processing library in the future, with the current standard XML library being phased out. Nonetheless, many of the ways of interacting with an XML document shown above are similar, so being familiar with the standard Scala XML API provides the core concepts you’ll need for other such libraries.

Copyright 2012 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.