A hidden gem in Manning and Schutze: what to call 4+-grams?

I’m a longtime fan of Chris Manning and Hinrich Schutze’s “Foundations of Natural Language Processing” — I’ve learned from it, I’ve taught from it, and I still find myself thumbing through it from time to time. Last week, I wrote a blog post on SXSW titles that involved looking at n-grams of different lengths, including unigrams, bigrams, trigrams and … well, what do we call the next one up? Manning and Schutze devoted an entire paragraph to it on page 193 which I absolutely love and thought would be fun to share for those who haven’t seen it.

Before continuing with model-building, let us pause for a brief interlude on naming. The cases of n-gram language models that people usually use are for n=2,3,4, and these alternatives are usually referred to as a bigram, a trigram, and a four-gram model, respectively. Revealing this will surely be enough to cause an Classicists who are reading this book to stop, and leave the field to uneducated engineering sorts: “gram” is a Greek root and so should be put together with Greek number prefixes. Shannon actually did use the term “digram”, but with the declining levels of education in recent decades, this usage has not survived. As non-prescriptive linguists, however, we think that the curious mix of English, Greek, and Latin that our colleagues actually use is quite fun. So we will not try to stamp it out. (1)

And footnote (1) follows this up with a note on four-grams.

1. Rather than “four-gram”, some people do make an attempt at appearing educated by saying “quadgram”, but this is not really correct use of a Latin number prefix (which would be “quadrigram”, cf. “quadrilateral”), let alone correct use of a Greek number prefix, which would give us “a tetragram model.”

In part to be cheeky, I went with “quadrigram” in my post, which was obviously a good choice as it has led to the term being the favorite word of the week for Ken Cho, my People Pattern cofounder, and the office in general. (“Hey Jason, got any good quadrigrams in our models?”)

If you want to try out some n-gram analysis, check out my followup blog post on using Unix, Mallet, and BerkelyLM for analyzing SXSW titles. You can call 4+-grams whatever you like.

Author: jasonbaldridge

Co-founder of People Pattern and Associate Professor in the Department of Linguistics at the University of Texas at Austin. My primary specialization is computational linguistics and my core research interests are formal and computational models of syntax, probabilistic models of both syntax and discourse structure, and machine learning for natural language tasks in general.

Leave a Reply

Your email address will not be published. Required fields are marked *