Adapting Digital Text Analysis Tools to the Latin Language: Vocabulary in Martial’s De Spectaculis

I explored three different digital text analysis tools, applying them to Martial’s De Spectaculis:  Voyant Tools, Wordle, and AntWordProfiler.  All three programs are completely free to use.  The first two, Voyant Tools and Wordle, are websites that create word clouds to provide a visual representation of word frequency in a given text.  AntWordProfiler is a freeware lexical profile tool, producing data rather than word clouds.  Wordle calls itself a “toy” on its homepage, and its focus is on creating fun and attractive word clouds.  Voyant Tools on the other hand calls itself “a web-based reading and analysis environment for digital texts”; its website gives you more data on your text, such as a list of each word that appears and its frequency, as well as a word cloud.

Here’s what it looks like when you put the text of Martial’s De Spectaculis on Wordle, and here’s what it looks like on Voyant Tools.

As you can see right away, both graphics are dominated by the most common words in all Latin, such as et, non, est and other adverbs and pronouns.  This can’t tell us very much about Martial’s  text.  How can we get remove these common words to see the unique elements of Martial’s vocabulary?

First, Wordle has a Latin language setting which automatically removes the most common words from your word cloud.  Here’s what the word cloud looks like on Wordle after we use their stop words list.

This word cloud is more representative of Martial’s individual text.  The word Caesar stands out and calls attention to all the vocative and nominative occurrences of the name in Martial’s poem.  I specify which cases because Wordle’s Latin language setting does not recognize case endings, let alone verb inflections.  The giant “Caesar” in the middle of the graphic is in fact an underrepresentation.  Other words which appear several times in different cases may not appear on the graph at all.  There are also clear problems with Wordle’s Latin stop words list:  fuit, fuisse, and sub all appear in the altered graphic, so only the most extremely common words are being excluded.  Even if the list were thorough, it would still be a major methodological problem not to know precisely which words are being excluded.

Voyant Tools on the other hand has customizable stop words lists.  The website does include several default stop words lists to customize, but it does not have a Latin stop words list.  I made one specifically for Martial’s De Spectaculis based on the words that appear more than twice.  I excluded conjunctions, prepositions, the most common or non-specific adverbs (such as diu, saepe, and nunc, but not, say, maxime), and all forms of sum, hic, ille, ipse, is, and qui.  You can see the full stop words list I used here.

This is what you end up with when you run Martial’s De Spectaculis on Voyant Tools with this stop words list in place.  You can see the result certainly clearer than the first Voyant Tools word cloud, as well as clearer than both Wordle graphics.   The word frequency list is also more useful, as it now shows only those words which are immediately relevant to Martial’s project.  The most frequent words are: Caesar(16 times), tibi (9 times), taurus (7 times), tulit (7 times), harena (6 times), astra, fama, fera, unda (5 times each), and Caesaris (4 times), all of which are very immediately related to Martial’s subject matter.   However, this sampling shows the problem that Voyant Tools can’t work around.  Latin is an inflected language and neither of these tools can recognize that Caesar and Caesaris are the same word and count them appropriately.

At this point, I set aside looking for word cloud generators and looked for something that could handle Latin’s inflections and settled on a free, open-source program called AntWordProfiler, designed by Laurence Anthony, to whom I am indebted for technical help in getting started.  You can download his program (a small executable file) here.

The program was originally intended to find and recognize word families in English texts (such as “left,” “leftist,” and “leftism”).  Rather than create an enormous master list, Anthony made it extremely simple to upload a customized word list reflecting each user’s text and interests.  Anyone can upload a .txt file (such as a Notepad file) with a list of words and word families they want the program to recognize as related.

I used this system as though each Latin word and all its possible forms were a single word family to form a word level list that recognizes inflected forms as the same word.  I included every word that appears in Martial’s De Spectaculis, 902 unique word forms in a 1,344 word text.  For all nouns, adjectives, and pronouns entered, I included each gender, case, and number.  Verb paradigms are prohibitively large, however, so I included only those verb forms which actually appear in the text.  You can see my complete Martial word level list here.   In this list, the “stopwords” category refers to the set of common words that I consider unimportant, including conjunctions, prepositions, and the most common adverbs.  I maintained separate categories for each of the most common verbs and pronouns, such as sum, qui, hic, and ille.  I decided to maintain this distinction rather than add them to the “stopwords” category because it’s more transparent.  I feel strongly that it is better to see this information and remove it later if necessary.

Using this data from AntWordProfiler, I returned to Wordle, as it allows you to plug in frequency numbers to generate a word cloud.  I removed the “stopwords” category, as well as all forms of sum, qui, hic, ipse and ille.  You can see the final result here.  It’s not dramatically different from the second Voyant Tools word cloud (which used a stop words list), but this is a much more accurate visual representation of Martial’s vocabulary in De Spectaculis.

Heather Odell (heatherm.odell[at]gmail.com)

