First steps in Scala for beginning programmers, Part 8

Topics: scala.io.Source, accessing files, flatMap, mutable Maps

Preface

This is part 8 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

This tutorial is about accessing the file system in order to work with text files. The previous tutorial showed how to build a Map that contains the counts of each word type in a given text. However, it was assumed that the text was available in a String variable, and typically we are interested in knowing things about files that live on the file system, or on the internet. This tutorial shows how to read a file’s contents into Scala for processing, both by building a single String for the file or by consuming it line-by-line in a streaming fashion. Along the way, immutable Maps are introduced as a way to enable word counting without reading an entire file into memory.

Word count on the contents of a file

As an example, we’ll use the complete Sherlock Holmes from project Gutenberg. Download it, put it into a directory, and then start up the Scala REPL in that directory. To access files, we’ll use the Source class, so to start you need to import it.

[sourcecode language=”scala”]
scala> import scala.io.Source
import scala.io.Source
[/sourcecode]

Source provides a number of ways to interact with files and make them accessible to you in your Scala program. The fromFile method is the one you’ll probably need most.

[sourcecode language=”scala”]
scala> Source.fromFile("pg1661.txt")
res3: scala.io.BufferedSource = non-empty iterator
[/sourcecode]

This creates a BufferedSource, from which you can easily get all of file’s contents as a String.

[sourcecode language=”scala”]
scala> val holmes = Source.fromFile("pg1661.txt").mkString
holmes: String =
"Project Gutenberg’s The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net
<…many more lines…>
[/sourcecode]

With this, you can do the same things as shown it tutorial 7 to get the word counts (except that here we’ll split on white space sequences rather than just a single space).

[sourcecode language=”scala”]
scala> val counts = holmes.split("\s+").groupBy(x=>x).mapValues(x=>x.length)
counts: scala.collection.immutable.Map[java.lang.String,Int] = Map(wood-work, -> 1, "Pray, -> 1, herself. -> 2, stern-post -> 1, "Should -> 1, incident -> 8, serious -> 14, earth–" -> 2, sinister -> 10, comply -> 7, breaks -> 1, forgotten -> 3, precious -> 10, ‘It -> 3, compliment -> 2, suite, -> 1, "DEAR -> 1, summarise. -> 1, "Done -> 1, fine.’ -> 1, lover -> 5, of. -> 2, lead. -> 1, plentiful -> 1, ‘Lone -> 4, malignant -> 1, terrible -> 14, rate -> 1, mole -> 1, assert -> 1, lights -> 2, Stevenson, -> 1, submitted -> 4, tap. -> 1, beard, -> 1, band–a -> 1, force! -> 1, snow -> 7, Produced -> 2, ask, -> 1, purchasing -> 1, Hall, -> 1, wall. -> 5, remarked -> 32, laughing -> 4, member." -> 1, 30,000 -> 2, Redistributing -> 1, coat, -> 6, "’One -> 2, ‘band,’ -> 1, relapsed -> 1, apol…

scala> counts("Holmes")
res2: Int = 197

scala> counts("Watson")
res3: Int = 4
[/sourcecode]

Lest you think it strange that Watson only shows up four times, keep in mind that we split on whitespace, and that means that in a sentence like the following, the token of interest is Watson,” rather than Watson.

“You could not possibly have come at a better time, my dear Watson,” he said cordially.

Looking that and others up shows more tokens containing Watson in the story.

[sourcecode language=”scala”]
scala> counts("Watson,"")
res4: Int = 19

scala> counts("Watson,")
res5: Int = 40

scala> counts("Watson.")
res6: Int = 10
[/sourcecode]

Of course, the real problem is that tokenizing on whitespace is too crude. To do this properly generally takes a good hand-built tokenizer (which is able to keep tokens like e.g. and Mr. and Yahoo! while splitting punctuation off most words) or a machine learned one that is trained on data hand-labeled for tokens. For an example of the latter, see the Apache OpenNLP toolkit tokenizers, which includes pre-trained models for English.

Working line by line

Quite often, you need to work through a file line by line, rather than reading the entire thing in as a single string as we did above. For example, you might need to process each line differently, so just having it as a single String isn’t particular convenient. Or, you might be working with a large file that cannot easily fit into memory (which is what happens when you read in the entire string). You can obtain the lines in the file as an Iterator[String], in which each item is a single line from the file, using the getLines method.

[sourcecode language=”scala”]
scala> Source.fromFile("pg1661.txt").getLines
res4: Iterator[String] = non-empty iterator
[/sourcecode]

This iterator is ready for you to consume lines, but it doesn’t read all of the file into memory right away — instead it buffers it such that each line will be available for you as you ask for it, essentially reading off disk as you demand more lines. You can think of this as streaming the file to your Scala program, much like modern audio and video content is streamed to your computer: it is never actually stored, but is just transferred in parts to where it is needed, when it is needed.

Of course, Iterators share much with sequence data structures like Lists: once we have an Iterator, we can use foreach, for, map, etc. on it. So to print out all of the lines in the file, we can do the following.

[sourcecode language=”scala”]
scala> Source.fromFile("pg1661.txt").getLines.foreach(println)
Project Gutenberg’s The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net

Title: The Adventures of Sherlock Holmes

Author: Arthur Conan Doyle
<…many more lines…>
[/sourcecode]

That creates a lot of output, but it shows you how you can easily create your own Scala implementation of the Unix cat program: just save the following line in a file called cat.scala:

[sourcecode language=”scala”]
scala.io.Source.fromFile(args(0)).getLines.foreach(println)
[/sourcecode]

And then call that with the name of the file to list its contents.

[sourcecode language=”bash”]
$ scala cat.scala pg1661.txt
[/sourcecode]

Back in the REPL, it is somewhat less-than-ideal to see the entire file. If you just want to see the start of the file, use the take method on the Iterator before the foreach.

[sourcecode language=”scala”]
scala> Source.fromFile("pg1661.txt").getLines.take(5).foreach(println)
Project Gutenberg’s The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
[/sourcecode]

The take method is quite useful in general with any sequence, and provides the complement of the drop method, as shown in the following examples on a simple List[Int].

[sourcecode language=”scala”]
scala> val numbers = 1 to 10 toList
numbers: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> numbers.take(3)
res12: List[Int] = List(1, 2, 3)

scala> numbers.drop(3)
res13: List[Int] = List(4, 5, 6, 7, 8, 9, 10)

scala> numbers.take(3) ::: numbers.drop(3)
res14: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
[/sourcecode]

Word counting line by line, first try

Now that we’ve seen how to read a file and start working with it line-by-line, how do we count the number of occurrences of each word? Recall from tutorial 7 and above that the starting point was to have a sequence (Array, List, etc) of Strings in which each element is a word token. To start moving toward that, we can simply use the toList method on the Iterator[String] obtained from getLines.

[sourcecode language=”scala”]
scala> val holmes = Source.fromFile("pg1661.txt").getLines.toList
holmes: List[String] = List(The Project Gutenberg EBook of The Adventures of Sherlock Holmes, by Sir Arthur Conan Doyle, (#15 in our series by Sir Arthur Conan Doyle), "", Copyright laws are changing all over the world. Be sure to check the, copyright laws for your country before downloading or redistributing, this or any other Project Gutenberg eBook., "", This header should be the first thing seen when viewing this Project, Gutenberg file.  Please do not remove it.  Do not change or edit the, header without written permission., "", Please read the "legal small print," and other information about the, eBook and Project Gutenberg at the bottom of this file.  Included is, important information about your specific rights and restrictions in, how the file may be used.  You can also find ou…
[/sourcecode]

We now have the contents of the file as a List[String], and may proceed to do useful things with it. For example, we could map each line (Strings) to be sequences of whitespace-separated Strings.

[sourcecode language=”scala”]
scala> val listOfListOfWords = Source.fromFile("pg1661.txt").getLines.toList.map(x => x.split(" ").toList)
listOfListOfWords: List[List[java.lang.String]] = List(List(Project, Gutenberg’s, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle), List(""), List(This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with), List(almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or), List(re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included), List(with, this, eBook, or, online, at, www.gutenberg.net), List(""), List(""), List(Title:, The, Adventures, of, Sherlock, Holmes), List(""), List(Author:, Arthur, Conan, Doyle), List(""), List(Posting, Date:, April, 18,, 2011, [EBook, #1661]), List(First, Posted:, November, 29,, 2002), List(""), List(Language:, English), List(""), List(""), List(***, START, OF, THIS, PRO…
[/sourcecode]

And, as we saw in tutorial 7, when we have a List of Lists, we can use flatten to create one big List.

[sourcecode language=”scala”]
scala> val listOfWords = listOfListOfWords.flatten
listOfWords: List[java.lang.String] = List(Project, Gutenberg’s, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle, "", This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with, almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or, re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included, with, this, eBook, or, online, at, www.gutenberg.net, "", "", Title:, The, Adventures, of, Sherlock, Holmes, "", Author:, Arthur, Conan, Doyle, "", Posting, Date:, April, 18,, 2011, [EBook, #1661], First, Posted:, November, 29,, 2002, "", Language:, English, "", "", ***, START, OF, THIS, PROJECT, GUTENBERG, EBOOK, THE, ADVENTURES, OF, SHERLOCK, HOLMES, ***, "", "", "", "", Produced, by, an, anonymous, Project, Gut…
[/sourcecode]

But, now you might recognize that this is the map-then-flatten pattern we saw previously, which means we can flatMap it instead.

[sourcecode language=”scala”]
scala> val flatMappedWords = Source.fromFile("pg1661.txt").getLines.toList.flatMap(x => x.split(" "))
flatMappedWords: List[java.lang.String] = List(Project, Gutenberg’s, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle, "", This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with, almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or, re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included, with, this, eBook, or, online, at, www.gutenberg.net, "", "", Title:, The, Adventures, of, Sherlock, Holmes, "", Author:, Arthur, Conan, Doyle, "", Posting, Date:, April, 18,, 2011, [EBook, #1661], First, Posted:, November, 29,, 2002, "", Language:, English, "", "", ***, START, OF, THIS, PROJECT, GUTENBERG, EBOOK, THE, ADVENTURES, OF, SHERLOCK, HOLMES, ***, "", "", "", "", Produced, by, an, anonymous, Project,…
[/sourcecode]

But you should be a bit bothered by all this: wasn’t the idea here (in part) not to read all of the lines in at once? Indeed, with what we did above, as soon as we said toList on the Iterator, the whole file was read into memory. However, we can do without the toList step and just directly flatMap the Iterator and get a new Iterator over the tokens rather than the lines.

[sourcecode language=”scala”]
scala> val flatMappedWords = Source.fromFile("pg1661.txt").getLines.flatMap(x => x.split(" "))
flatMappedWords: Iterator[java.lang.String] = non-empty iterator
[/sourcecode]

Now, if we want to count the words, we can convert that to a List and do the groupBy the mapValues trick we’ve seen already (output omitted).

[sourcecode language=”scala”]
scala> val counts = Source.fromFile("pg1661.txt").getLines.flatMap(x => x.split(" ")).toList.groupBy(x=>x).mapValues(x=>x.length)
[/sourcecode]

Oops — that worked, but we once again brought the whole file into memory because the List that was created from toList has all lines for the file. We’ll see next how to use a mutable Map to get around this.

Word counting by streaming with an Iterator and using mutable Maps

In all of the tutorials so far, I’ve pretty much stuck to immutable data structures except when mutable ones show up due to context (like Arrays coming out of the toString method). It’s good to try to make use of immutable data structures where possible, but there are times when mutable ones are more convenient and perhaps more appropriate.

With the immutable Maps we saw in the previous tutorial, you could not change the assignment to a key, nor could you add a new key.

[sourcecode language=”scala”]
lettersToNumbers: scala.collection.immutable.Map[java.lang.String,Int] = Map(A -> 1, B -> 2, C -> 3)

[sourcecode language="scala"]
scala> lettersToNumbers("A") = 4
<console>:9: error: value update is not a member of scala.collection.immutable.Map[java.lang.String,Int]
lettersToNumbers("A") = 4

scala> lettersToNumbers("D") = 5
<console>:9: error: value update is not a member of scala.collection.immutable.Map[java.lang.String,Int]
lettersToNumbers("D") = 5
[/sourcecode]

There is another kind of Map, scala.collection.mutable.Map, that does allow this sort of behavior.

[sourcecode language=”scala”]
scala> import scala.collection.mutable
import scala.collection.mutable

scala> val mutableLettersToNumbers = mutable.Map("A"->1, "B"->2, "C"->3)
mutableLettersToNumbers: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, B -> 2, A -> 1)

scala> mutableLettersToNumbers("A") = 4

scala> mutableLettersToNumbers("D") = 5

scala> mutableLettersToNumbers
res4: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, D -> 5, B -> 2, A -> 4)
[/sourcecode]

It also has a handy way to increase the count associated with a key, using the += method.

[sourcecode language=”scala”]
scala> mutableLettersToNumbers("D") += 5

scala> mutableLettersToNumbers
res6: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, D -> 10, B -> 2, A -> 4)
[/sourcecode]

However, we can’t use that method with a key that doesn’t exist.

[sourcecode language=”scala”]
scala> mutableLettersToNumbers("E") += 1
java.util.NoSuchElementException: key not found: E
<…stacktrace…>
[/sourcecode]

Fortunately, we can provide a default. Here’s an example of starting a new Map with a default of 0.

[sourcecode language=”scala”]
scala> val counts = mutable.Map[String,Int]().withDefault(x=>0)
counts: scala.collection.mutable.Map[String,Int] = Map()

scala> counts("Z") += 1

scala> counts("Y") += 1

scala> counts("Z") += 1

scala> counts
res11: scala.collection.mutable.Map[String,Int] = Map(Z -> 2, Y -> 1)
[/sourcecode]

Note: when you start with some values already in a Map, Scala can infer the types of the keys and the values, but when initializing an empty Map, it is necessary to explicitly declare the key and value types.

With this in hand, here is how we can use flatMap plus a mutable Map to count words in a text without reading the entire text into memory.

[sourcecode language=”scala”]
import scala.collection.mutable
val counts = mutable.Map[String, Int]().withDefault(x=>0)
for (token <- scala.io.Source.fromFile("pg1661.txt").getLines.flatMap(x =>x.split("\s+")))
counts(token) += 1
[/sourcecode]

Having created the counts Map in this way, we can convert it to an immutable Map with the toMap method once we are done adding elements.

[sourcecode language=”scala”]
scala> val fixedCounts = counts.toMap
fixedCounts: scala.collection.immutable.Map[String,Int] = Map(wood-work, -> 1,
<…output truncated…>
[/sourcecode]

Now we can’t modify the values on fixedCounts, which has advantages in many contexts, e.g. we can’t accidentally destroy values or add unwanted keys, and there are (positive) implications for parallel processing.

[sourcecode language=”scala”]
scala> fixedCounts("Holmes") = 0
<console>:13: error: value update is not a member of scala.collection.immutable.Map[String,Int]
fixedCounts("Holmes") = 0
^
[/sourcecode]

Reading a file from a URL

As it turns out scala.io.Source can do a lot more than read from a file. Another example is to read from a URL to access a file on the internet, using the fromURL method.

[sourcecode language=”scala”]
val holmesUrl = """http://www.gutenberg.org/cache/epub/1661/pg1661.txt"""
for (line <- Source.fromURL(holmesUrl).getLines)
println(line)
[/sourcecode]

If you are just going to analyze the same file again and again, this is probably not what you need — just download the file and use it locally. However, it can be quite useful in contexts where you are exploring links within pages (e.g. while processing Wikipedia or Twitter data) and need to read in content from URLs on the fly.

Use (up) the Source

A final note on the Iterators you get with Source.fromFile and Source.fromURL: you can only iterate through them once! This is part of what makes them more efficient — they aren’t holding all thattext in memory. So, don’t be surprised if you get the following behavior.

[sourcecode language=”scala”]

scala> val holmesIterator = Source.fromFile("pg1661.txt").getLines
holmesIterator: Iterator[String] = non-empty iterator

scala> holmesIterator.foreach(println)

Project Gutenberg’s The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net

<…many lines of output…>

This Web site includes information about Project Gutenberg-tm,
including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how to
subscribe to our email newsletter to hear about new eBooks.

scala> holmesIterator.foreach(println)

<…nothing output!…>

[/sourcecode]

So, the Iterator is used up! If you want to go through the file again, you’ll need to spin up a new Iterator just like you did the first time around. The neat thing about staying with the Iterators and not converting to Lists (and thus bringing everything into memory) is that each mapping operation we do on the Iterator applies only for the current item we are looking at, so we never need to read the whole file into memory.

Of course, if you have a reasonably small file to work with, you should feel absolutely free to toList it and work with it that way if you prefer — it will often be more convenient since you can do the groupBy and mapValue pattern.

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

First steps in Scala for beginning programmers, Part 7

Topics: Maps, Sets, groupBy, Options, flatten, flatMap

Preface

This is part 7 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

Lists (and other sequence data structures, like Ranges and Arrays) allow you to group collections of objects in an ordered manner: you can access elements of a list by indexing their position in the list, or iterate over the list elements, one by one, using for expressions and sequence functions like map, filter, reduce and fold. Another important kind of data structure is the associative array, which you’ll come to know in Scala as a Map. (Yes, this has the unfortunate ambiguity with the map function, but their use will be quite clear from context.) Maps allow you to store a collection of key-value pairs and to access the values by the keys associated with them, rather than via an index (as with a List).

Example cases where you could use a Map:

  • Associating English words with their German translations
  • Associating each word with its count in a given text
  • Associating each word with its possible parts-of-speech

You’ll see concrete examples of each of these in this post.

Creating Maps and accessing their elements

Maps are quite intuitive to grasp. Here’s an example with a few English words and their German translations. One easy way of creating a Map is by passing in a list of pairs, where the first element of each pair defines a key and the second defines a corresponding value.

[sourcecode language=”scala”]
scala> val engToDeu = Map(("dog","Hund"), ("cat","Katze"), ("rhinoceros","Nashorn"))
engToDeu: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)
[/sourcecode]

Notice that the Map entries are of the form key -> value. We may then retrieve the German translation for dog by providing the key “dog” to the Map we created.

[sourcecode language=”scala”]
scala> engToDeu("dog")
res0: java.lang.String = Hund
[/sourcecode]

Think for a moment what you would have to do to accomplish this with Lists. You’d need need two Lists, one for each language, and they’d need to be aligned so that each element in one list corresponded to its translation in the other list.

[sourcecode language=”scala”]
scala> val engWords = List("dog","cat","rhinoceros")
engWords: List[java.lang.String] = List(dog, cat, rhinoceros)

scala> val deuWords = List("Hund","Katze","Nashorn")
deuWords: List[java.lang.String] = List(Hund, Katze, Nashorn)
[/sourcecode]

Then, to find the translation of cat, you would have to find the index of cat in engWords, and then look up that index in deuWords.

[sourcecode language=”scala”]
scala> engWords.indexOf("cat")
res2: Int = 1

scala> deuWords(engWords.indexOf("cat"))
res3: java.lang.String = Katze
[/sourcecode]

This is actually quite inefficient, as well as having other problems. Maps are the right thing for what we want here, and they do they job of retrieving values for keys quite efficiently.

It turns out that we can take two lists that are aligned in this way and construct a Map very easily. Recall that zipping two lists together creates one list of pairs, where each pair gives the elements that shared the same index.

[sourcecode language=”scala”]
scala> engWords.zip(deuWords)
res4: List[(java.lang.String, java.lang.String)] = List((dog,Hund), (cat,Katze), (rhinoceros,Nashorn))
[/sourcecode]

By calling the toMap method on such a List of pairs, we get a Map.

[sourcecode language=”scala”]
scala> engWords.zip(deuWords).toMap
res5: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)
[/sourcecode]

Note that even though the REPL is showing the order of the key-value pairs to be the same as the original list we constructed the map from, there is no inherent order to the elements of a Map.

You can add elements to a Map to create a new Map using the + operator and an arrow -> between each key and value pair.

[sourcecode language=”scala”]

scala> engToDeu + "owl" -> "Eule"
res6: (java.lang.String, java.lang.String) = (Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)owl,Eule)

scala> engToDeu + ("owl" -> "Eule", "hippopotamus" -> "Nilpferd")
res7: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(rhinoceros -> Nashorn, dog -> Hund, owl -> Eule, hippopotamus -> Nilpferd, cat -> Katze)
[/sourcecode]

You can add one Map to another using the ++ operator.

[sourcecode language=”scala”]

scala> val newEntries = Map(("hippopotamus", "Nilpferd"),("owl","Eule"))
newEntries: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(hippopotamus -> Nilpferd, owl -> Eule)

scala> val expandedEngToDeu = engToDeu ++ newEntries
expandedEngToDeu: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(rhinoceros -> Nashorn, dog -> Hund, owl -> Eule, hippopotamus -> Nilpferd, cat -> Katze)
[/sourcecode]

You can do the same by passing in a List of tuples to the ++ operator.

[sourcecode language=”scala”]

scala> engToDeu ++ List(("hippopotamus", "Nilpferd"),("owl","Eule"))
res8: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(rhinoceros -> Nashorn, dog -> Hund, owl -> Eule, hippopotamus -> Nilpferd, cat -> Katze)
[/sourcecode]

And you can remove a key from a Map with the – operator.

[sourcecode language=”scala”]

scala> engToDeu – "dog"
res9: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(cat -> Katze, rhinoceros -> Nashorn)
[/sourcecode]

See the Map API for more examples of such functions. Note: throughout this post, I’m sticking to immutable Maps — if you are looking at any other tutorials and are wondering why certain methods from those aren’t working here, they may have been using mutable Maps, which we’ll discuss later.

If we ask for the value associated with a key that doesn’t exist in the Map, we get an error.

[sourcecode language=”scala”]
scala> engToDeu("bird")
java.util.NoSuchElementException: key not found: bird
at scala.collection.MapLike$class.default(MapLike.scala:224)
(etc.)
[/sourcecode]

You can check for whether a key is in the Map using the contains method.

[sourcecode language=”scala”]
scala> engToDeu.contains("bird")
res10: Boolean = false

scala> engToDeu.contains("dog")
res11: Boolean = true
[/sourcecode]

Let’s say you had a list of English words and wanted to look up their corresponding German words, and you want to protect yourself against the NoSuchElementException. One way to do this is to filter the words using contains, and then map the remaining ones through engToDeu.

[sourcecode language=”scala”]
scala> val wordsToTranslate = List("dog","bird","cat","armadillo")
wordsToTranslate: List[java.lang.String] = List(dog, bird, cat, armadillo)

scala> wordsToTranslate.filter(x=>engToDeu.contains(x)).map(x=>engToDeu(x))
res12: List[java.lang.String] = List(Hund, Katze)
[/sourcecode]

This is a useful ways of safely applying a Map to a list of items. However, we’ll see a better way to deal with missing values later on, using Options.

If you there is a sensible default value for any key you might try with your map, you can use the getOrElse method. You provide the key as the first argument, and then the default value as the second.

[sourcecode language=”scala”]

scala> engToDeu.getOrElse("dog","???")
res1: java.lang.String = Hund

scala> engToDeu.getOrElse("armadillo","???")
res2: java.lang.String = ???
[/sourcecode]

It is quite common to use getOrElse with a default of 0 for Maps that contain statistics, such as word counts (see below), where the absence of a key naturally indicates that it has, e.g., a count of zero.

If you have a consistent default value for any keys that aren’t in the Map, you can set it by using the withDefault method.

[sourcecode language=”scala”]

scala> val engToDeu = Map(("dog","Hund"), ("cat","Katze"), ("rhinoceros","Nashorn")).withDefault(x => "???")
engToDeu: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)

scala> engToDeu("armadillo")
res3: java.lang.String = ???
[/sourcecode]

Now you can ask for values in the usual manner, without needing to use getOrElse and providing the default every time.

Keys and values in Maps

You may have observed that Scala tells you more than that you have just created a Map. Like List, Map is a parameterized type, which means that it is a generic way of collecting a bunch of objects of particular types together. Above we saw an instance of a Map[String, String] (leaving off the java.lang part to make it clearer). The first String indicates that the keys are strings and the second that values are Strings. Basically, any type can be used in either position (warning: you should avoid using mutable data structures as keys unless you know what you are doing). Here are some examples (try to ignore the scala.collection.immutable and java.lang parts and just focus on the Map[X,Y] signatures we get).

[sourcecode language=”scala”]
scala> Map((10,"ten"), (100,"one hundred"))
res0: scala.collection.immutable.Map[Int,java.lang.String] = Map(10 -> ten, 100 -> one hundred)

scala> Map(("a",1),("b",2))
res1: scala.collection.immutable.Map[java.lang.String,Int] = Map(a -> 1, b -> 2)

scala> Map((1,3.14), (2,6.28))
res2: scala.collection.immutable.Map[Int,Double] = Map(1 -> 3.14, 2 -> 6.28)

scala> Map((("pi",1),3.14), (("tau",2),6.28))
res3: scala.collection.immutable.Map[(java.lang.String, Int),Double] = Map((pi,1) -> 3.14, (tau,2) -> 6.28)

scala> Map(("the",List("Determiner")),("book",List("Verb","Noun")),("off",List("Preposition","Verb")))
res4: scala.collection.immutable.Map[java.lang.String,List[java.lang.String]] = Map(the -> List(Determiner), book -> List(Verb, Noun), off -> List(Preposition, Verb))
[/sourcecode]

The last two examples show some very useful aspects of key and values types that allow you to use more complex keys and values. The former uses a (String, Int) pair as a key, with signature Map[(String, Int), Double], and the latter uses a List[String] as the value, with signature Map[String, List[String]]. So you can bundle together several types using tuples and you can use parameterized data structures to parameterize another data structure.

A simple translation task

Here is a mini German/English dictionary as a Map.

[sourcecode language=”scala”]
scala> val miniDictionary = Map(("befreit","liberated"),("baeche","brooks"),("eise","ice"),("sind","are"),("strom","river"),("und","and"),("vom","from"))
miniDictionary: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(und -> and, eise -> ice, sind -> are, befreit -> liberated, strom -> river, vom -> from, baeche -> brooks)
[/sourcecode]

We can provide a (very bad) translation of the German sentence “vom eise befreit sind strom und baeche” using this dictionary: we simply split the German sentence and then map over its elements, looking up each word in the dictionary.

[sourcecode language=”scala”]
scala> val example = "vom eise befreit sind strom und baeche"
example: java.lang.String = vom eise befreit sind strom und baeche

scala> example.split(" ").map(deuWord => miniDictionary(deuWord)).mkString(" ")
res0: String = from ice liberated are river and brooks
[/sourcecode]

Okay, not quite “from the ice they are freed, the stream and brook” but then again it’s pretty much the dumbest machine translation approach available…

A danger of course is that we will have words that aren’t in the dictionary, leading to an exception.

[sourcecode language=”scala”]
scala> val example2 = "vom eise befreit sind strom und schiffe"
example2: java.lang.String = vom eise befreit sind strom und schiffe

scala> example2.split(" ").map(deuWord => miniDictionary(deuWord)).mkString(" ")
java.util.NoSuchElementException: key not found: schiffe
[/sourcecode]

We’ll return to this below.

Creating Maps from Lists using groupBy

We frequently have data stored in a particular data structure and would like to work with it using another data structure that organizes the data points in some other manner. Here, we’ll look at how to convert a List into Map using the groupBy method in order to do some useful processing for working with parts-of-speech. We’ll also see the Set data structure along the way.

We’ll start with a very basic example of what groupBy does. Given a list of number tokens, we can obtain a Map from the number types to all of the tokens of each number.

[sourcecode language=”scala”]
scala> val numbers = List(1,4,5,1,6,5,2,8,1,9,2,1)
numbers: List[Int] = List(1, 4, 5, 1, 6, 5, 2, 8, 1, 9, 2, 1)

scala> numbers.groupBy(x=>x)
res19: scala.collection.immutable.Map[Int,List[Int]] = Map(5 -> List(5, 5), 1 -> List(1, 1, 1, 1), 6 -> List(6), 9 -> List(9), 2 -> List(2, 2), 8 -> List(8), 4 -> List(4))
[/sourcecode]

As you can see from the result, groupBy took the anonymous function x=>x, grouped all of the elements of the List that have the same value of x, and then created a Map from each x to the group containing its tokens. So, we get 2 mapping to a List containing 2’s, and so on. This probably seems a bit weird, but it is incredibly useful when we consider Lists that have more interesting elements in them. To do so, let’s go back to the part-of-speech tagging example from Part 4 of these tutorials. Say we have a sentence that is tagged with parts of speech, such as the following (made up) example that ensures some tag ambiguities.

in the dark , a tall man saw the saw that he needed to man to cut the dark tree .

The parts-of-speech could be annotated as follows (with lots of simplifications, and apologies to any offense caused to anyone’s linguistic sensitivities).

in/Prep the/Det dark/Noun ,/Punc a/Det tall/Adjective man/Noun saw/Verb the/Det saw/Noun that/Pronoun he/Pronoun needed/Verb to/Prep man/Verb to/Prep cut/Verb the/Det dark/Adjective tree/Noun ./Punc

See Part 4 for detailed explanation of how the following expression turns a string like this into a List of tuples.

[sourcecode language=”scala”]
scala> val tagged = "in/Prep the/Det dark/Noun ,/Punc a/Det tall/Adjective man/Noun saw/Verb the/Det saw/Noun that/Pronoun he/Pronoun needed/Verb to/Prep man/Verb to/Prep cut/Verb the/Det dark/Adjective tree/Noun ./Punc".split(" ").toList.map(x => x.split("/")).map(x => (x(0), x(1)))
tagged: List[(java.lang.String, java.lang.String)] = List((in,Prep), (the,Det), (dark,Noun), (,,Punc), (a,Det), (tall,Adjective), (man,Noun), (saw,Verb), (the,Det), (saw,Noun), (that,Pronoun), (he,Pronoun), (needed,Verb), (to,Prep), (man,Verb), (to,Prep), (cut,Verb), (the,Det), (dark,Adjective), (tree,Noun), (.,Punc))
[/sourcecode]

Now, let’s use groupBy in various ways on this. The first thing we might be interested in is seeing which parts of speech each word is associated with.

[sourcecode language=”scala”]
scala> val groupedTagged = tagged.groupBy(x => x._1)
groupedTagged: scala.collection.immutable.Map[java.lang.String,List[(java.lang.String, java.lang.String)]] = Map(in -> List((in,Prep)), needed -> List((needed,Verb)), . -> List((.,Punc)), cut -> List((cut,Verb)), saw -> List((saw,Verb), (saw,Noun)), a -> List((a,Det)), man -> List((man,Noun), (man,Verb)), that -> List((that,Pronoun)), dark -> List((dark,Noun), (dark,Adjective)), to -> List((to,Prep), (to,Prep)), , -> List((,,Punc)), tall -> List((tall,Adjective)), he -> List((he,Pronoun)), tree -> List((tree,Noun)), the -> List((the,Det), (the,Det), (the,Det)))
[/sourcecode]

So, now you see that the keys in the Map constructed by groupBy are the words and the values are the groups of the original elements. You can then see that the anonymous function x => x._1 provided to groupBy does two things: it specifies the part of the input elements that will group different items together and it specifies that that part of the input defines the key space.

However, we don’t quite have what we want, which is to have the set of parts of speech associated with each word. Instead we have a List of tuples, e.g.:

[sourcecode language=”scala”]
scala> groupedTagged("saw")
res21: List[(java.lang.String, java.lang.String)] = List((saw,Verb), (saw,Noun))
[/sourcecode]

Focussing on just this for a moment, we can map this and produce a List with just the parts-of-speech, and then turn that List into a Set with the toSet method in order to get just the unique parts-of-speech.

[sourcecode language=”scala”]
scala> groupedTagged("saw").map(x=>x._2)
res24: List[java.lang.String] = List(Verb, Noun)

scala> groupedTagged("saw").map(x=>x._2).toSet
res25: scala.collection.immutable.Set[java.lang.String] = Set(Verb, Noun)
[/sourcecode]

Converting the List to a Set didn’t do much here, but consider the, which has multiple tokens with the same part-of-speech.

[sourcecode language=”scala”]
scala> groupedTagged("the")
res26: List[(java.lang.String, java.lang.String)] = List((the,Det), (the,Det), (the,Det))

scala> groupedTagged("the").map(x=>x._2)
res27: List[java.lang.String] = List(Det, Det, Det)

scala> groupedTagged("the").map(x=>x._2).toSet
res28: scala.collection.immutable.Set[java.lang.String] = Set(Det)
[/sourcecode]

Sets are yet another of the useful data structures you have to work with, along with Maps and Lists. They work just like you would expect Sets to: they contain a collection of unique, unordered elements, and they allow you to see whether an element is in the set, whether one set is a subset of another, iterate over their elements, etc.

Now, back to getting from the word/tag pairs to a mapping from words to possible tags for each word. The keys we got from tagged.groupBy(x => x._1)  are what we want, but we want to transform the values from Lists of word/tag tokens to Sets of tags, which we can do with the mapValues method on Maps.

[sourcecode language=”scala”]
scala> val wordsToTags = tagged.groupBy(x => x._1).mapValues(listOfWordTagPairs => listOfWordTagPairs.map(wordTagPair => wordTagPair._2).toSet)
wordsToTags: scala.collection.immutable.Map[java.lang.String,scala.collection.immutable.Set[java.lang.String]] = Map(in -> Set(Prep), needed -> Set(Verb), . -> Set(Punc), cut -> Set(Verb), saw -> Set(Verb, Noun), a -> Set(Det), man -> Set(Noun, Verb), that -> Set(Pronoun), dark -> Set(Noun, Adjective), to -> Set(Prep), , -> Set(Punc), tall -> Set(Adjective), he -> Set(Pronoun), tree -> Set(Noun), the -> Set(Det))
[/sourcecode]

The bit inside the mapValues(…) part will have some readers scrunching up their eyes, but you just need to look at the line where we got res28 above: if you understood that, then you just need to realize we are doing exactly the same thing, but now in the context of mapping over the values rather than dealing with a single value. Now you know how to map over values that you are mapping over.

Now that it is hand, we can easily query the wordsToTags Map to see whether various words have various tags.

[sourcecode language=”scala”]
scala> wordsToTags("man")("Noun")
res8: Boolean = true

scala> wordsToTags("man")("Det")
res9: Boolean = false

scala> wordsToTags("man")("Verb")
res10: Boolean = true

scala> wordsToTags("saw")("Verb")
res11: Boolean = true
[/sourcecode]

This is an example of how data structures within data structures (here Sets within a Map) are quite useful. (Exercise: think about what a tree is for a moment and how you might implement it using Lists.)

There are a variety of things you can do in computational linguistics with Maps from words to their parts-of-speech. A simple example is to compute the average number of tags per word type.

[sourcecode language=”scala”]
scala> val avgTagsPerType = wordsToTags.values.map(x=>x.size).sum/wordsToTags.size.toDouble
avgTagsPerType: Double = 1.2
[/sourcecode]

If it isn’t clear to you what is going on here, tease it apart in your own REPL!

We can turn our word/tag pairs the other way to find out which words go with each part-of-speech. The only thing we need to do is groupBy on the second element of each pair, and then map the List values to their first element and get a Set from those.

[sourcecode language=”scala”]
scala> val tagsToWords = tagged.groupBy(x => x._2).mapValues(listOfWordTagPairs => listOfWordTagPairs.map(wordTagPair => wordTagPair._1).toSet)
tagsToWords: scala.collection.immutable.Map[java.lang.String,scala.collection.immutable.Set[java.lang.String]] = Map(Prep -> Set(in, to), Det -> Set(the, a), Noun -> Set(dark, man, saw, tree), Pronoun -> Set(that, he), Verb -> Set(saw, needed, man, cut), Punc -> Set(,, .), Adjective -> Set(tall, dark))
[/sourcecode]

This basic paradigm is a powerful one for flipping between different data structures depending on what our needs are. It also demonstrates several important concepts with working with Lists, Maps and Sets. The next section shows a simple application of this idea for counting words in a text.

Counting words

A common task in computational linguistics is to calculate word statistics, and the most basic of those is to count the number of tokens of each word type in a particular text. The most common way to store and access those counts is in a Map, but how do you create such a Map from a given text? If we look at a text as a list of strings, then the groupBy paradigm we did above gives us exactly what we need — in fact it is even simpler than the word/tag manipulations done above.

The example text we’ll use is the tongue-twister about woodchucks.

[sourcecode language=”scala”]
scala> val woodchuck = "how much wood could a woodchuck chuck if a woodchuck could chuck wood ? as much wood as a woodchuck would , if a woodchuck could chuck wood ."
woodchuck: java.lang.String = how much wood could a woodchuck chuck if a woodchuck could chuck wood ? as much wood as a woodchuck would , if a woodchuck could chuck wood .
[/sourcecode]

Given this, here’s how we can compute the number of occurrences of each word type. First we groupBy on the elements. Though a list of strings isn’t as interesting as having a list of Tuples as we had with words and tags, it still produces a useful result: we now have a unique set of keys corresponding to the types of elements found in the Array, and there is a corresponding value to each one that is the Array of tokens of that type.

[sourcecode language=”scala”]
scala> woodchuck.split(" ").groupBy(x=>x)
res29: scala.collection.immutable.Map[java.lang.String,Array[java.lang.String]] = Map(woodchuck -> Array(woodchuck, woodchuck, woodchuck, woodchuck), chuck -> Array(chuck, chuck, chuck), . -> Array(.), would -> Array(would), if -> Array(if, if), a -> Array(a, a, a, a), as -> Array(as, as), , -> Array(,), how -> Array(how), much -> Array(much, much), wood -> Array(wood, wood, wood, wood), ? -> Array(?), could -> Array(could, could, could))
[/sourcecode]

And, we want to do something much simpler than what we did with the part-of-speech example: we just need to count the length of each list, since they each contain every token of the corresponding word type. The function passed to mapValues is thus quite a bit simpler than the ones given in the previous section.

[sourcecode language=”scala”]
scala> val counts = woodchuck.split(" ").groupBy(x=>x).mapValues(x=>x.length)
counts: scala.collection.immutable.Map[java.lang.String,Int] = Map(woodchuck -> 4, chuck -> 3, . -> 1, would -> 1, if -> 2, a -> 4, as -> 2, , -> 1, how -> 1, much -> 2, wood -> 4, ? -> 1, could -> 3)
[/sourcecode]

With counts, we can now access the frequencies of any of the words that were in the text.

[sourcecode language=”scala”]
scala> counts("woodchuck")
res5: Int = 4

scala> counts("could")
res6: Int = 3
[/sourcecode]

Easy!  Of course, we normally want to build word counts for texts that are longer and are stored in a file rather than explicitly added to Scala code. The next tutorial will demonstrate how to do that.

Iterating over the keys and values in a Map

The material above shows some useful aspects of Maps, but of course there is much more you can do with them, often requiring iterating through the key-value pairs in the Map. We’ll use the counts Map created above for demonstrating this.

You can access just the keys, or just the values.

[sourcecode language=”scala”]
scala> counts.keys
res0: Iterable[java.lang.String] = Set(woodchuck, chuck, ., would, if, a, as, ,, how, much, wood, ?, could)

scala> counts.values
res1: Iterable[Int] = MapLike(4, 3, 1, 1, 2, 4, 2, 1, 1, 2, 4, 1, 3)
[/sourcecode]

Notice that these are both Iterable data structures, so we can do all of the usual mapping, filtering, and so on, that we have already done with lists. (You may convert them to Lists if you like using toList, of course.)

You can print out all of the key -> value pairs in the Map in a number of ways. One is to use a for expression.

[sourcecode language=”scala”]
scala> for ((k,v) <- counts) println(k + " -> " + v)
woodchuck -> 4
chuck -> 3
. -> 1
would -> 1
if -> 2
a -> 4
as -> 2
, -> 1
how -> 1
much -> 2
wood -> 4
? -> 1
could -> 3
[/sourcecode]

And here are other ways to achieve the same result (output omitted since it is the same).

[sourcecode language=”scala”]
for (k <- counts.keys) println(k + " -> " + counts(k))
counts.map(kvPair => kvPair._1 + " -> " + kvPair._2).foreach(println)
counts.keys.map(k => k + " -> " + counts(k)).foreach(println)
counts.foreach { case(k,v) => println(k + " -> " + v) }
counts.foreach(kvPair => println(kvPair._1 + " -> " + kvPair._2))
[/sourcecode]

And so on. Basically, you are able to step through the Map one key-value pair at a time, or you can grab the set of keys and then step through those and access the values from the map. Which form you use depends on what you need — for example, the foreach construct doesn’t return a value, but the for expressions and the map expressions do return values. Why would you do that? Well, as an example, consider grouping all words that have occurred the same number of times.

[sourcecode language=”scala”]
scala> val countsToWords = counts.keys.toList.map(k => (counts(k),k)).groupBy(x=>x._1).mapValues(x=>x.map(y=>y._2))
countsToWords: scala.collection.immutable.Map[Int,List[java.lang.String]] = Map(3 -> List(chuck, could), 4 -> List(woodchuck, a, wood), 1 -> List(., would, ,, how, ?), 2 -> List(if, as, much))
[/sourcecode]

We go from a Map to a Set of its keys to a List of those keys to a List of Tuples of the values and the keys to a Map from the values of the original Map to such Tuples, and then we map the values of the new map to just contain the words (the original keys). (That’s a mouthful, so try each step in the REPL to see what is going on in detail.)

Now we can output countsToWords sorted in descending numerical order by count, and then by alphabetical order by word within each count.

[sourcecode language=”scala”]
scala> countsToWords.keys.toList.sorted.reverse.foreach(x => println(x + ": " + countsToWords(x).sorted.mkString(",")))
4: a,wood,woodchuck
3: chuck,could
2: as,if,much
1: ,,.,?,how,would
[/sourcecode]

Options and flatMapping for dealing with missing keys

I pointed out toward the start of this tutorial that we run into trouble if we ask for a key that doesn’t exist in a Map. Let’s go back to the engToDeu Map we began with.

[sourcecode language=”scala”]
scala> val engToDeu = Map(("dog","Hund"), ("cat","Katze"), ("rhinoceros","Nashorn"))
engToDeu: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)

scala> engToDeu("dog")
res0: java.lang.String = Hund

scala> engToDeu("bird")
java.util.NoSuchElementException: key not found: bird
[/sourcecode]

There is another way of accessing the elements of a Map, using the get method.

[sourcecode language=”scala”]
scala> engToDeu.get("dog")
res2: Option[java.lang.String] = Some(Hund)

scala> engToDeu.get("bird")
res3: Option[java.lang.String] = None
[/sourcecode]

Now, the return value is an Option[String]. An Option is either a Some that contains a value or a None, which means there is no value. If you want to get the value out of a Some, you use the get method on Options.

[sourcecode language=”scala”]
scala> val dogTrans = engToDeu.get("dog")
dogTrans: Option[java.lang.String] = Some(Hund)

scala> dogTrans.get
res4: java.lang.String = Hund
[/sourcecode]

If you just use get on a Map to obtain an Option and then immediately call get on the Option, we get the same behavior we had before.

[sourcecode language=”scala”]
scala> engToDeu.get("dog").get
res6: java.lang.String = Hund

scala> engToDeu.get("bird").get
java.util.NoSuchElementException: None.get
[/sourcecode]

So, at this point, you are probably thinking that this sounds like a waste of time that is just making things more complex. Wait! It actually is tremendously useful because of pattern matching and the way many methods on sequences work.

First, here is how you can write a protected form of translating the words in a list without getting an exception.

[sourcecode language=”scala”]
scala> wordsToTranslate.foreach { x => engToDeu.get(x) match {
|   case Some(y) => println(x + " -> " + y)
|   case None =>
| }}
dog -> Hund
cat -> Katze
[/sourcecode]

I know… this probably still isn’t convincing — it still looks more involved than the conditional we used (far) above to check whether engToDeu contained a given key (at least for this particular example). Hold on… because now we are just about ready for things to get simpler, and learn some useful things about Lists in doing so.

First, you should know about a great method on Lists called flatten. If you have a List of Lists of Strings, you can use flatten to get a single List of Strings. Consider the following example, in which we flatten a List of Lists of Strings and make a single String out of the result with mkString. Notice that the empty List in the third spot of the main List just disappears when we flatten it.

[sourcecode language=”scala”]
scala> val sentences = List(List("Here","is","sentence","one","."),List("The","third","sentence","is","empty","!"),List(),List("Lastly",",","we","have","a","final","sentence","."))
sentences: List[List[java.lang.String]] = List(List(Here, is, sentence, one, .), List(The, third, sentence, is, empty, !), List(), List(Lastly, ,, we, have, a, final, sentence, .))

scala> sentences.flatten
res0: List[java.lang.String] = List(Here, is, sentence, one, ., The, third, sentence, is, empty, !, Lastly, ,, we, have, a, final, sentence, .)

scala> sentences.flatten.mkString(" ")
res1: String = Here is sentence one . The third sentence is empty ! Lastly , we have a final sentence .
[/sourcecode]

Flattening in general is pretty useful in its own right. Where it comes to play with Option values is that Options can be thought of a Lists: Somes are like one element Lists and Nones are like empty Lists. So, when you have a List of Options, the flatten method gives you the value in a Some and any Nones just drop away.

[sourcecode language=”scala”]
scala> wordsToTranslate.map(x => engToDeu.get(x))
res12: List[Option[java.lang.String]] = List(Some(Hund), None, Some(Katze), None)

scala> wordsToTranslate.map(x => engToDeu.get(x)).flatten
res13: List[java.lang.String] = List(Hund, Katze)
[/sourcecode]

This is such a generally useful paradigm that there is a function flatMap which does exactly this.

[sourcecode language=”scala”]
scala> wordsToTranslate.flatMap(x => engToDeu.get(x))
res14: List[java.lang.String] = List(Hund, Katze)
[/sourcecode]

So, returning to the translation example above, we can now safely skip on by “schiffe” without fuss.

[sourcecode language=”scala”]
scala> example2.split(" ").flatMap(deuWord => miniDictionary.get(deuWord)).mkString(" ")
res15: String = from ice liberated are river and
[/sourcecode]

Whether this is the desired behavior in this particular case is another question (e.g. you really should be doing some special unknown word handling). Nonetheless, you’ll find that flatMap is quite handy in general for this sort of pattern, in which a list of elements is used to retrieve values from a Map that will be missing some of those values.

An example of the further use of Options and flatMap is that you also may create functions that return Options and are thus amenable to flatMapping. Consider a function that squares only odd numbers and throws evens away (note: the % operator is the modulo operator that finds the remainder of division of one number by another — try it in the REPL).

[sourcecode language=”scala”]

scala> def squareOddNumber (x: Int) = if (x % 2 != 0) Some(x*x) else None
squareOddNumber: (x: Int)Option[Int]

[/sourcecode]

If you map over the numbers 1 to 10, you’ll see the Somes and Nones, and if you flatMap it, you get exactly the desired result of the squares of all the odd numbers without any pollution from the evens.

[sourcecode language=”scala”]
scala> (1 to 10).toList.map(x=>squareOddNumber(x))
res16: List[Option[Int]] = List(Some(1), None, Some(9), None, Some(25), None, Some(49), None, Some(81), None)

scala> (1 to 10).toList.flatMap(x=>squareOddNumber(x))
res17: List[Int] = List(1, 9, 25, 49, 81)
[/sourcecode]

This turns out to be amazingly useful and common, so much so that the expression “just flatMap that shit” has become a common refrain among Scala programmers. Scala programmers even write scripts to remind them to do it. 🙂

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

First steps in Scala for beginning programmers, Part 6

Topics: regular expressions, matching and substitutions with the scala.util.matching API

Preface

This is part 6 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

This post is the second of two about regular expressions (regexes), which are essential for a wide range of programming tasks, and for computational linguistics tasks in particular. If you haven’t read it already, you might want to start with the first post about regexes. For what its worth, this post might actually be of some use to programmers who already are reasonably familiar with Scala but who haven’t used regular expressions much yet: it might saving some poking around to figure out how to do things you already know how to do quite well in other languages.

The use of regular expressions for capturing values for variable assignment and cases in match expressions is a very clean, well-thought out and highly useful trait of support for regular expressions in the Scala language. However, their use for more complex string matching and substitution is, frankly, much less straightforward than it is in languages with built-in support for regular expressions, such as Perl (which—speaking as one who has coded a lot in Perl—you do *not* want to use for general programming). Scala is fully capabable in that you can use regular expressions fully, but you’ll need to use it via the Regex API. In other words, you need to use a number of commands, not all of which as as straightforward as they could be. (This is not a rant, though I do obviously wish regular expressions were supported more naturally in Scala.)

Though I’ll refer to what I’m doing below as using the Regex API, I’ll note first that this makes it sound like a bigger deal than it is. It just means you are directly using classes and objects from the scala.util.matching package rather than using some of the special syntax and integration with Scala pattern matching we saw in the previous post.

More extensive matching

First off, let’s do what we did with pattern matching in the previous post, but now using the Regex class and the methods available to it to achieve the same ends. We can then start working with finding multiple matches and performing substitutions.

To recap, recall the name regular expression and how we can use it to initialize a group of variables based on matching a given string.

[sourcecode language=”scala”]
scala> val Name = """(Mr|Mrs|Ms). ([A-Z][a-z]+) ([A-Z][a-z]+)""".r
Name: scala.util.matching.Regex = (Mr|Mrs|Ms). ([A-Z][a-z]+) ([A-Z][a-z]+)

scala> val smith = "Mr. John Smith"
smith: java.lang.String = Mr. John Smith

scala> val Name(title, first, last) = smith
title: String = Mr
first: String = John
last: String = Smith
[/sourcecode]

Instead of doing it this way, let’s instead use the API methods. We start by using the regex to find the matches, if any. The method findAllIn of Regex does this for us.

[sourcecode language=”scala”]
scala> val matchesFound = Name.findAllIn(smith)
matchesFound: scala.util.matching.Regex.MatchIterator = non-empty iterator
[/sourcecode]

The result is an iterator, which is an object that is like a list in that you can iterate over its elements with for expressions and foreach, use map to transform its values, and more.

[sourcecode language=”scala”]
scala> matchesFound.foreach(println)
Mr. John Smith
[/sourcecode]

However, unlike Lists, you can only do this a single time. As the following shows, after you iterate through it once, its elements are used up.

[sourcecode language=”scala”]
scala> val matchesFound = Name.findAllIn(smith)
matchesFound: scala.util.matching.Regex.MatchIterator = non-empty iterator

scala> matchesFound.foreach(println)
Mr. John Smith

scala> matchesFound.foreach(println)

[/sourcecode]

Another difference is that you cannot index into its elements directly.

[sourcecode language=”scala”]
scala> val matchesFound = Name.findAllIn(smith)
matchesFound: scala.util.matching.Regex.MatchIterator = non-empty iterator

scala> matchesFound(0)
<console>:11: error: scala.util.matching.Regex.MatchIterator does not take parameters
matchesFound(0)
^
[/sourcecode]

If you wish to do that, you need to just call toList on the MatchIterator.

[sourcecode language=”scala”]
scala> val matchList = Name.findAllIn(smith).toList
matchList: List[String] = List(Mr. John Smith)

scala> matchList.foreach(println)
Mr. John Smith

scala> matchList.foreach(println)
Mr. John Smith
[/sourcecode]

I’ll primarily work with the match results as a List for the remainder of this tutorial. However, note that when you are programming, you should consider whether you really need to do this—usually, the iterator will be sufficient and it has the advantage of being a more efficient.

Note above that what we have is a List[String]. That means we can see which portions of a string matched, which could include multiple matches.

[sourcecode language=”scala”]
scala> val sentence = "Mr. John Smith said hello to Ms. Jane Hill and then to Mr. Bill Brown."
sentence: java.lang.String = Mr. John Smith said hello to Ms. Jane Hill and then to Mr. Bill Brown.

scala> val matchList = Name.findAllIn(sentence).toList
matchList: List[String] = List(Mr. John Smith, Ms. Jane Hill, Mr. Bill Brown)
[/sourcecode]

This will be useful in many contexts, but it won’t allow us to access the match groups that were defined in the Regex. For that, we need to use the matchData method, which converts the MatchIterator (which offers Strings as its elements) into an Iterator[Match] (which offers Match objects as its elements).

[sourcecode language=”scala”]
scala> val matchList = Name.findAllIn(smith).matchData
matchList: java.lang.Object with Iterator[scala.util.matching.Regex.Match] = non-empty iterator
[/sourcecode]

Let’s convert that to a List and then grab the first element.

[sourcecode language=”scala”]
scala> val matchList = Name.findAllIn(smith).matchData.toList
matchList: List[scala.util.matching.Regex.Match] = List(Mr. John Smith)

scala> val firstMatch = matchList(0)
firstMatch: scala.util.matching.Regex.Match = Mr. John Smith
[/sourcecode]

This Match object contains captured groups that we can access with the group method. The first index, 0, returns the entire match, and the rest access the captured groups.

[sourcecode language=”scala”]
scala> firstMatch.group(0)
res8: String = Mr. John Smith

scala> val title = firstMatch.group(1)
title: String = Mr

scala> val first = firstMatch.group(2)
first: String = John

scala> val last = firstMatch.group(3)
last: String = Smith
[/sourcecode]

We can get a bit closer to the original pattern matched variable assignment by packaging them up as a tuple.

[sourcecode language=”scala”]
scala> val (title, first, last) = (firstMatch.group(1), firstMatch.group(2), firstMatch.group(3))
title: String = Mr
first: String = John
last: String = Smith
[/sourcecode]

Update: There is a more concise way to do this using the range 1 to 3 and map firstMatch.group over that range. This creates a Seq(uence), which we can pattern match on. (Thanks to @missingfaktor.)

[sourcecode language=”scala”]

val Seq(title, first, last) = 1 to 3 map firstMatch.group

[/sourcecode]

This should demonstrate why Scala’s support for Regexes in patterning match is very nice for this. What you gain with the API is the ability to match multiple instances of a pattern in a string and then to perform computations with the Match results on the fly. For example, let’s return to the sentence with multiple names in it and use the Name regex to say hello to every name found in it.

[sourcecode language=”scala”]
scala> Name.findAllIn(sentence).matchData.foreach(m => println("Hello, " + m.group(0)))
Hello, Mr. John Smith
Hello, Ms. Jane Hill
Hello, Mr. Bill Brown
[/sourcecode]

Of course, you can choose to print only subparts of the names, such as the title and the last name.

[sourcecode language=”scala”]
scala> Name.findAllIn(sentence).matchData.foreach(m => println("Hello, " + m.group(1) + ". " + m.group(3)))
Hello, Mr. Smith
Hello, Ms. Hill
Hello, Mr. Brown
[/sourcecode]

Or you can filter the results, e.g. to only the Mr’s, and then print only the first names.

[sourcecode language=”scala”]
scala> Name.findAllIn(sentence).matchData.filter(m=>m.group(1) == "Mr").foreach(m => println("Hello, " + m.group(2)))
Hello, John
Hello, Bill
[/sourcecode]

Notice that in the above lines, I didn’t convert the MatchIterator to a List since I was happy to just go through the list once and do some actions.

Performing substitutions

The other thing you gain is the ability to use regular expressions for substituting once class of expressions with another. For example, let’s say that (for some odd reason) you would like to reverse everyone’s name so that “Mr. John Smith” becomes “Mr. Smith John“. This is accomplished by using the Regex method replaceAllIn, which takes two arguments: the first is the original string and the second is a function that takes a Match object and returns a String.

[sourcecode language=”scala”]
scala> val swapped = Name.replaceAllIn(sentence, m => m.group(1) + ". " + m.group(3) + " " + m.group(2))
swapped: String = Mr. Smith John said hello to Ms. Hill Jane and then to Mr. Brown Bill.
[/sourcecode]

The variable m above is referring to each of the Match objects identified, in turn. That means we can access the groups as we did before. The thing that might feel strange at first is that the anonymous function m => m.group(1) + “. ” + m.group(3) + ” ” + m.group(2) is an argument. It’s not very different from the following, where we first create a named function and then pass it as an argument.

[sourcecode language=”scala”]
scala> def swapFirstLast = (m: scala.util.matching.Regex.Match) => m.group(1) + ". " + m.group(3) + " " + m.group(2)
swapFirstLast: (util.matching.Regex.Match) => java.lang.String

scala> val swapped = Name.replaceAllIn(sentence, swapFirstLast)swapped: String = Mr. Smith John said hello to Ms. Hill Jane and then to Mr. Brown Bill.
[/sourcecode]

Note that now that we’ve defined it, we can use that same function to map the Matches returned by findAllIn to their swapped versions.

[sourcecode language=”scala”]
scala> val swappedNames = Name.findAllIn(sentence).matchData.map(swapFirstLast).toList
swappedNames: List[java.lang.String] = List(Mr. Smith John, Ms. Hill Jane, Mr. Brown Bill)
[/sourcecode]

The difference is that using findAllIn gives us the Match results themselves, whereas replaceAllIn replaces them in the String in situ. Whether you need to do one or the other depends on your programming needs.

Determining whether an entire string matches using the Regex API

If you just want to know whether an entire given string matches a Regex, Scala unfortunately has a somewhat roundabout way for you to do this. First, here is the syntax, testing whether Name matches on the variables smith and sentence.

[sourcecode language=”scala”]
scala> Name.pattern.matcher(smith).matches
res21: Boolean = true

scala> Name.pattern.matcher(sentence).matches
res22: Boolean = false
[/sourcecode]

So, sentence doesn’t match (despite having three names in it) because the entire string is not a single match to Name.

What is going on here is that we are actually using classes defined in Java for working with regular expressions. First, we get the java.util.regex.Pattern object associated with our scala.util.matching.Regex object.

[sourcecode language=”scala”]
scala> Name.pattern
res16: java.util.regex.Pattern = (Mr|Mrs|Ms). ([A-Z][a-z]+) ([A-Z][a-z]+)
[/sourcecode]

Then we use that Pattern to get a java.util.regex.Matcher for the string.

[sourcecode language=”scala”]
scala> Name.pattern.matcher(smith)
res17: java.util.regex.Matcher = java.util.regex.Matcher[pattern=(Mr|Mrs|Ms). ([A-Z][a-z]+) ([A-Z][a-z]+) region=0,14 lastmatch=]
[/sourcecode]

The Matcher class has a matches method that tells us whether there was a match or not for that string.

[sourcecode language=”scala”]
scala> Name.pattern.matcher(smith).matches
res18: Boolean = true
[/sourcecode]

So, long-winded, but you can do it.

Note: there is another way to do this using Scala’s standard pattern matching paradigm discussed in the previous post on regexes.

[sourcecode language=”scala”]
scala> smith match { case Name(_,_,_) => true; case _ => false }
res23: Boolean = true

scala> sentence match { case Name(_,_,_) => true; case _ => false }
res24: Boolean = false
[/sourcecode]

However, this requires the extra work of specifying the capture groups, which are being thrown away anyway.

Simple substitutions with a second regular expression

There is another replaceAllIn method that takes a String defining a (fairly) standard regular expresion substitution as its second argument rather than a function from Matches to Strings. This argument defines a regular expression similar to that used in standard s/// substitutions from the Perl programming language,e.g. the following, which turns strings like “xyzaaaabbb123” int “xyzbbbaaaa123“.

[sourcecode language=”perl”]

s/(a+)(b+)/21/

[/sourcecode]

Unlike Perl (which is the same as the syntax discussed in Jurafsky and Martin’s book), Scala uses $1, $2, etc. As an example, consider the first-last name swap we did before. Here it is repeated:

[sourcecode language=”scala”]
scala> val swapped = Name.replaceAllIn(sentence, m => m.group(1) + ". " + m.group(3) + " " + m.group(2))
swapped: String = Mr. Smith John said hello to Ms. Hill Jane and then to Mr. Brown Bill.
[/sourcecode]

You can get the exact same effect somewhat more easily by constructing the replacement string with $n variables that refer to the groups.

[sourcecode language=”scala”]
scala> val swapped2 = Name.replaceAllIn(sentence, "$1. $3 $2")
swapped2: String = Mr. Smith John said hello to Ms. Hill Jane and then to Mr. Brown Bill.
[/sourcecode]

This is far more concise and readable than the m.group() style above, so it is preferable for cases like this. However, sometimes you’ll want to do some more interesting processing of the values in each group, such as changing the titles to another language and outputing only the first initial of the first name: e.g. “Mr. John Smith” would become “Sr. J. Smith” and “Mrs. Jane Hill” would become “Sra. J. Hill”. It isn’t clear to me how one could do this with the $n substitutions (if some reader is aware, please let me know). To do it with the Match => String function, it is straightforward. First, let’s define a method that maps the titles from English to Spanish.

[sourcecode language=”scala”]
def engTitle2Esp (title: String) = title match {
case "Mr" => "Sr"
case "Mrs" => "Sra"
case "Ms" => "Srta"
}
[/sourcecode]

Then we pass m.group(1) through that function by using engTitle2Esp(m.group(1)), and get just the first character of group 2 by indexing into it as m.group(2)(0).

[sourcecode language=”scala”]
scala> val spanishized = Name.replaceAllIn(sentence, m => engTitle2Esp(m.group(1)) + ". " + m.group(2)(0) + ". " + m.group(3))
spanishized: String = Sr. J. Smith said hello to Srta. J. Hill and then to Sr. B. Brown.
[/sourcecode]

This gives you considerable control over how to process the replacements.

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

First steps in Scala for beginning programmers, Part 5

Topics: regular expressions, matching

Preface

This is part 5 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

This post is the first of two about regular expressions (regexes), which are essential for a wide range of programming tasks, and for computational linguistics tasks in particular. This tutorial explains how to use them with Scala, assuming that the reader is already familiar with regular expression syntax. It shows how to create regular expressions in Scala and use them with Scala powerful pattern matching capabilities, in particular for variable assignment and cases in match expressions.

Creating regular expressions

Scala provides a very simple way to create regexes: just define a regex as a string and then call the r method on it. The following defines a regular expression that characterizes the string language a^mb^n (one or more a‘s followed by one or more b’s, not necessarily the same as the number of a‘s).

[sourcecode language=”scala”]
scala> val AmBn = "a+b+".r
AmBn: scala.util.matching.Regex = a+b+
[/sourcecode]

To use meta-characters, like s, w, and d, you must either escape the slashes or use multiquoted strings, which are referred to as raw strings. The following are two equivalent ways to write a regex that covers strings of a sequence of word characters followed by a sequence of digits.

[sourcecode language=”scala”]
scala> val WordDigit1 = "\w+\d+".r
WordDigit1: scala.util.matching.Regex = w+d+

scala> val WordDigit2 = """w+d+""".r
WordDigit2: scala.util.matching.Regex = w+d+
[/sourcecode]

Whether escaping or using raw strings is preferable depends on the context. For example, with the above, I’d go with the raw string. However, for using a regex to split a string on whitespace characters, escaping is somewhat preferable.

[sourcecode language=”scala”]
scala> val adder = "We’re as similar as two dissimilar things in a pod.nt-Blackadder"
adder: java.lang.String =
We’re as similar as two dissimilar things in a pod.
-Blackadder

scala> adder.split("\s+")
res2: Array[java.lang.String] = Array(We’re, as, similar, as, two, dissimilar, things, in, a, pod., -Blackadder)

scala> adder.split("""s+""")
res3: Array[java.lang.String] = Array(We’re, as, similar, as, two, dissimilar, things, in, a, pod., -Blackadder)
[/sourcecode]

A note on naming: the convention in Scala is to use variable names with the first letter uppercased for Regex objects. This makes them consistent with the use of pattern matching in match statements, as shown below.

Matching with regexes

We saw above that using the r method on a String returns a value that is a Regex object (more on the scala.util.matching part below). How do you actually do useful things with these Regex objects? There are a number of ways. The prettiest, and perhaps most common for the non-computational linguist, is to use them in tandem with Scala’s standard pattern matching capabilities. Let’s consider the task of parsing names and turning them into useful data structures that we can do various useful things with.

[sourcecode language=”scala”]
scala> val Name = """(Mr|Mrs|Ms). ([A-Z][a-z]+) ([A-Z][a-z]+)""".r
Name: scala.util.matching.Regex = (Mr|Mrs|Ms). ([A-Z][a-z]+) ([A-Z][a-z]+)

scala> val Name(title, first, last) = "Mr. James Stevens"
title: String = Mr
first: String = James
last: String = Stevens

scala> val Name(title, first, last) = "Ms. Sally Kenton"
title: String = Ms
first: String = Sally
last: String = Kenton
[/sourcecode]

Notice the similarity with pattern matching on types like Array and List.

[sourcecode language=”scala”]
scala> val Array(title, first, last) = "Mr. James Stevens".split(" ")
title: java.lang.String = Mr.
first: java.lang.String = James
last: java.lang.String = Stevens

scala> val List(title, first, last) = "Mr. James Stevens".split(" ").toList
title: java.lang.String = Mr.
first: java.lang.String = James
last: java.lang.String = Stevens
[/sourcecode]

Of course, notice that here the “.” was captured, while the regex excised it. A more substantive difference with the regular expression is that it only accepts strings with the right form and will reject others, unlike simple splitting and matching to Array.

[sourcecode language=”scala”]
scala> val Array(title, first, last) = "221B Baker Street".split(" ")
title: java.lang.String = 221B
first: java.lang.String = Baker
last: java.lang.String = Street

scala> val Name(title, first, last) = "221B Baker Street"
scala.MatchError: 221B Baker Street (of class java.lang.String)
at .<init>(<console>:12)
at .<clinit>(<console>)
at .<init>(<console>:11)
at .<clinit>(<console>)
at $export(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:592)
at scala.tools.nsc.interpreter.IMain$Request$$anonfun$10.apply(IMain.scala:828)
at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
at scala.tools.nsc.io.package$$anon$2.run(package.scala:31)
at java.lang.Thread.run(Thread.java:680)
[/sourcecode]

That’s a lot of complaining, of course, but actually you would generally be either (a) absolutely sure that you have strings that are in the correct format or (b) you will be checking for such possible exceptions or (c) you’ll be using the regex as one option of many in a match expression.

For now, let’s assume the input is appropriate. This means we can easily convert a list of names as strings into a list of tuples using map and a match expression.

[sourcecode language=”scala”]
scala> val names = List("Mr. James Stevens", "Ms. Sally Kenton", "Mrs. Jane Doe", "Mr. John Doe", "Mr. James Smith")
names: List[java.lang.String] = List(Mr. James Stevens, Ms. Sally Kenton, Mrs. Jane Doe, Mr. John Doe, Mr. James Smith)

scala> names.map(x => x match { case Name(title, first, last) => (title, first, last) })
res11: List[(String, String, String)] = List((Mr,James,Stevens), (Ms,Sally,Kenton), (Mrs,Jane,Doe), (Mr,John,Doe), (Mr,James,Smith))
[/sourcecode]

Note the crucial use of groups in the Name regex: the number of groups equal the number of variables being initialized in the match. The first group is needed for the alternatives Mr, Mrs, and Ms. Without the other groups, we get an error. (From here on, I’ll shorten the MatchError output.)

[sourcecode language=”scala”]
scala> val NameOneGroup = """(Mr|Mrs|Ms). [A-Z][a-z]+ [A-Z][a-z]+""".r
NameOneGroup: scala.util.matching.Regex = (Mr|Mrs|Ms). [A-Z][a-z]+ [A-Z][a-z]+

scala> val NameOneGroup(title, first, last) = "Mr. James Stevens"
scala.MatchError: Mr. James Stevens (of class java.lang.String)
[/sourcecode]

Of course, we can still match to the first group.

[sourcecode language=”scala”]
scala> val NameOneGroup(title) = "Mr. James Stevens"
title: String = Mr
[/sourcecode]

What if we go in the other direction, creating more groups so that we can, for example, share the “M” in the various titles? Here’s an attempt.

[sourcecode language=”scala”]
scala> val NameShareM = """(M(r|rs|s)). ([A-Z][a-z]+) ([A-Z][a-z]+)""".r
NameShareM: scala.util.matching.Regex = (M(r|rs|s)). ([A-Z][a-z]+) ([A-Z][a-z]+)

scala> val NameShareM(title, first, last) = "Mr. James Stevens"
scala.MatchError: Mr. James Stevens (of class java.lang.String)
[/sourcecode]

What happened is that a new group was created, so there are now four groups to match.

[sourcecode language=”scala”]
scala> val NameShareM(title, titleEnding, first, last) = "Mr. James Stevens"
title: String = Mr
titleEnding: String = r
first: String = James
last: String = Stevens

scala> val NameShareM(title, titleEnding, first, last) = "Mrs. Sally Kenton"
title: String = Mrs
titleEnding: String = rs
first: String = Sally
last: String = Kenton
[/sourcecode]

So, there is submatched group capturing. To stop the (r|rs|s) part from creating a match group while still being able to use it to group alternatives in a disjunction, use the ?: operator.

[sourcecode language=”scala”]
scala> val NameShareMThreeGroups = """(M(?:r|rs|s)). ([A-Z][a-z]+) ([A-Z][a-z]+)""".r
NameShareMThreeGroups: scala.util.matching.Regex = (M(?:r|rs|s)). ([A-Z][a-z]+) ([A-Z][a-z]+)

scala> val NameShareMThreeGroups(title, first, last) = "Mr. James Stevens"
title: String = Mr
first: String = James
last: String = Stevens
[/sourcecode]

By this point, sharing the M hasn’t saved anything over (Mr|Mrs|Ms), but there are plenty of situations where this is quite useful.

We can also use regex backreferences. Say we want to match names like “Mr. John Bohn“, “Mr. Joe Doe“, and “Mrs. Jill Hill“.

[sourcecode language=”scala”]
scala> val RhymeName = """(Mr|Mrs|Ms). ([A-Z])([a-z]+) ([A-Z])3""".r
RhymeName: scala.util.matching.Regex = (Mr|Mrs|Ms). ([A-Z])([a-z]+) ([A-Z])3

scala> val RhymeName(title, firstInitial, firstRest, lastInitial) = "Mr. John Bohn"
title: String = Mr
firstInitial: String = J
firstRest: String = ohn
lastInitial: String = B
[/sourcecode]

Then we could piece things together to get the names we wanted.

[sourcecode language=”scala”]
scala> val first = firstInitial+firstRest
first: java.lang.String = John

scala> val last = lastInitial+firstRest
last: java.lang.String = Bohn
[/sourcecode]

But we can do better by using an embedded group and just thowing its match result away with the underscore _.

[sourcecode language=”scala”]
scala> val RhymeName2 = """(Mr|Mrs|Ms). ([A-Z]([a-z]+)) ([A-Z]3)""".r
RhymeName2: scala.util.matching.Regex = (Mr|Mrs|Ms). ([A-Z]([a-z]+)) ([A-Z]3)

scala> val RhymeName2(title, first, _, last) = "Mr. John Bohn"
title: String = Mr
first: String = John
last: String = Bohn
[/sourcecode]

Note: we can’t use the ?: operator with ([a-z]+) to stop the match because we need exactly that string to match with the 3 later.

Using regexes for assignment via pattern matching requires full string match.

[sourcecode language=”scala”]
scala> val Name(title, first, last) = "Mr. James Stevens"
title: String = Mr
first: String = James
last: String = Stevens

scala> val Name(title, first, last) = "Mr. James Stevens walked to the door."
scala.MatchError: Mr. James Stevens walked to the door. (of class java.lang.String)
[/sourcecode]

This is a crucial aspect of using them in match expressions. Consider an application that needs to be able to parse telephone numbers in different formats, like (123)555-5555 and 123-555-5555. Here are regexes for these two patterns and their use to parse these numbers.

[sourcecode language=”scala”]
scala> val Phone1 = """((d{3}))s*(d{3})-(d{4})""".r
Phone1: scala.util.matching.Regex = ((d{3}))s*(d{3})-(d{4})

scala> val Phone2 = """(d{3})-(d{3})-(d{4})""".r
Phone2: scala.util.matching.Regex = (d{3})-(d{3})-(d{4})

scala> val Phone1(area, first3, last4) = "(123) 555-5555"
area: String = 123
first3: String = 555
last4: String = 5555

scala> val Phone2(area, first3, last4) = "123-555-5555"
area: String = 123
first3: String = 555
last4: String = 5555
[/sourcecode]

We could of course use a single regular expression, but we’ll go with these two so that they can be used as separate case statements in a match expression that is part of a function that takes a string representation of a phone number and returns a tuple of three strings (thus normalizing the numbers).

[sourcecode language=”scala”]
def normalizePhoneNumber (number: String) = number match {
case Phone1(x,y,z) => (x,y,z)
case Phone2(x,y,z) => (x,y,z)
}
[/sourcecode]

The action being taken for each match is just to package the separate values up in a Tuple3 — more interesting things could be done if one were looking for country codes, dealing with multiple countries, etc. The point here is to see how the regular expressions are used for the cases to capture values and assign them to local variables, each time appropriate for the form of the string that is brought in. (We’ll see in a later tutorial how to protect such a method from inputs that are not phone numbers and such.)

Now that we have that function, we can easily apply it to a list of strings representing phone numbers and filter out just those in a specific area, for example.

[sourcecode language=”scala”]
scala> val numbers = List("(123) 555-5555", "123-555-5555", "(321) 555-0000")
numbers: List[java.lang.String] = List((123) 555-5555, 123-555-5555, (321) 555-0000)

scala> numbers.map(normalizePhoneNumber)
res16: List[(String, String, String)] = List((123,555,5555), (123,555,5555), (321,555,0000))

scala> numbers.map(normalizePhoneNumber).filter(n => n._1=="123")
res17: List[(String, String, String)] = List((123,555,5555), (123,555,5555))
[/sourcecode]

Building Regexes from Strings

Sometimes one wants to build up a regex from smaller component parts, for example, defining what a noun phrase is and then searching for sequence of noun phrases. To do this, we first must see the longer form of creating a regex.

[sourcecode language=”scala”]
scala> val AmBn = new scala.util.matching.Regex("a+b+")
AmBn: scala.util.matching.Regex = a+b+
[/sourcecode]

This is the first time in these tutorials that we are explicitly creating an object using the reserved word new. We’ll be covering objects in more detail later, but what you need to know now is that Scala has a great deal of functionality that is not available by default. Mostly, we’ve been working with things like Strings, Ints, Doubles, Lists, and so on — and for the most part it has appeared to you as though they are “just” Strings, Ints, Doubles, and Lists. However, that is not the case: actually they are fully specified as:

  • java.lang.String
  • scala.Int
  • scala.Double
  • scala.List

And, in the case of the last one, scala.List is a type that is actually backed by a concrete implementation in scala.collection.immutable.List. So, when you just see “List”, Scala is actually hiding some detail; most importantly, it makes it possible to use extremely common types with very little fuss.

What scala.util.matching.Regex is telling you is that the Regex class is part of the scala.util.matching package (and that scala.util.matching is a subpackage of scala.util, which itself is a subpackage of the scala package). Fortunately, you don’t need to type out scala.util.matching every time you want to use Regex: just use an import statement, and then use Regex without the extra package specification.

[sourcecode language=”scala”]
scala> import scala.util.matching.Regex
import scala.util.matching.Regex

scala> val AmBn = new Regex("a+b+")
AmBn: scala.util.matching.Regex = a+b+
[/sourcecode]

The other thing to explain is the new part. Again, we’ll cover this in more detail later, but for now think about it the following way. The Regex class is like a factory for producing regex objects, and the way you request (order) one of those objects is to say “new Regex(…)“, where the indicates the string that should be used to define the properties of that object. You’ve actually been doing this quite a lot already when creating Lists, Ints, and Doubles, but again, for those core types, Scala has provided special syntax to simplify their creation and use.

Okay, but why would one want to use new Regex(“a+b+”) when “a+b+”.r can be used to do the same? Here’s why: the latter needs to be given a complete string, but the former can be built up from several String variables. As an example, say you want a regex that matches strings of the form “the/a dog/cat/mouse/bird chased/ate the/a dog/cat/mouse/bird” such as “the dog chased the cat” and “a cat chased the bird.” The following might be the first attempt.

[sourcecode language=”scala”]
scala> val Transitive = "(a|the) (dog|cat|mouse|bird) (chased|ate) (a|the) (dog|cat|mouse|bird)".r
Transitive: scala.util.matching.Regex = (a|the) (dog|cat|mouse|bird) (chased|ate) (a|the) (dog|cat|mouse|bird)
[/sourcecode]

This works, but we can also build it without repeating the same expression twice by using a variable that contains a String defining a regular expression (but which is not a Regex object itself) and building the regex with that.

[sourcecode language=”scala”]
scala> val nounPhrase = "(a|the) (dog|cat|mouse|bird)"
nounPhrase: java.lang.String = (a|the) (dog|cat|mouse|bird)

scala> val Transitive = new Regex(nounPhrase + " (chased|ate) " + nounPhrase)
Transitive: scala.util.matching.Regex = (a|the) (dog|cat|mouse|bird) (chased|ate) (a|the) (dog|cat|mouse|bird)
[/sourcecode]

UPDATE: Actually, you can do this with .r rather than new Regex(…).

[sourcecode language=”scala”]

scala> val Transitive = (nounPhrase + " (chased|ate) " + nounPhrase).r
Transitive: scala.util.matching.Regex = (a|the) (dog|cat|mouse|bird) (chased|ate) (a|the) (dog|cat|mouse|bird)

[/sourcecode]

The next tutorial will show how to use the scala.util.matching package API to do more extensive matching with regular expressions, such as finding multiple matches and performing substitutions.

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.