Tuesday, June 17, 2014

The American Language: A Historical Database of English in the U.S.

Robert Frost is reported to have said, “The difference between a job and a career is the difference between 40 and 60 hours a week.”

While this bon mot has the advantage of being admirably succinct, it leaves something to be desired in terms of completeness. What other differences are there between a job and a career?

For one thing, the word job is considerably more common than career, as the former has been used by American writers more than twice as often over the past 200 years. Additionally, jobs tend to be viewed as less desirable, no matter how many additional hours a career may require.

When one looks at the adjectives most commonly used to describe a job, the list includes dirty, lousy, and toughest. The corresponding set of adjectives used to describe a career includes glorious, illustrious, and distinguished. Careers are far more likely to be artistic (literary, operatic, or dramatic), whereupon jobs are more likely to be pedestrian (tedious, thankless, or steady).

One rarely hears of anyone having a steady career. While the descriptors affixed to jobs cover a wide range of ground, the words are much more frequently referring to some sort of necessary and ungratifying work, whereas careers appear to be viewed as much more fulfilling.

How do we know all this? Has a team of researchers been painstakingly keeping tabs on how people use these words ever since Thomas Jefferson was in the White House? Is it through some multi-year, large-scale survey taken of the writing habits of the American people? No, it is by taking 60 seconds or so to run a search on a publicly accessible website, the Corpus of Historical American English (COHA).

COHA was created in 2009 by Mark Davies, a linguistics professor at Brigham Young University, and it is the largest tagged and searchable corpus of historical English available today. Containing hundreds of millions of words spread out from the beginning of the 19th century to the end of the 20th, it is evenly distributed between fiction and non-fiction (and each of these categories is drawn from a variety of genres), and is free to all, requiring nothing more than your inquiry. It allows linguistic researchers, scholars of other fields, and anybody who has more than a passing interest in language to discover subtleties about how we use words that would have been impossible to find until very recently. It is a marvelous trove of linguistic data, and shines light on thousands of aspects of that peculiar variety of language, American English.

Before we look at all that makes COHA unusual and interesting, we should first look at what a linguistic corpus is. Corpus, in Latin, simply means “body,” a sense that is in large measure preserved in many English words that derive from it and are in use today. A corporation is a body of people united in a business sense, a corporal is a non-commissioned military officer who leads a body of troops, and someone who is corpulent has a large body. Hence, a linguistic corpus is just another kind of body: A body of language.

With a few exceptions, linguistic corpora are a relatively recent addition to the study of language. The first ones were of necessity small and limited in usefulness, as they were compiled by hand. In the late 19th century, the German psychologist William Preyer studied early language acquisition by creating a corpus of words that parents had written down when their children used them. In addition to various forms of mother and father, the words bird, sugar, and hair were apparently popular with German infants at this time. Also in the 19th century, the original editors of the Oxford English Dictionary relied on a somewhat corpus-based approach to their dictionary, as the bulk of that work is made up of millions of citations that were all originally written out on little slips of paper, which were then organized into thousands of pigeonholes, built into an enormous unheated iron shed, based on the word each citation was meant to illustrate.

Creating a corpus without a computer required an exhausting commitment of time and energy. Charles F. Meyer, writing in Corpus Linguistics, An International Handbook, points out that one of the earlier attempts to create a searchable body of text, Alexander Cruden’s 18th-century concordance of the Bible, was completed in only two years. Cruden, however, worked 18 hours every day. Otto Jesperson, the great Danish linguist, used a corpus for much of the research on his seven-volume A Modern English Grammar on Historical Principles. It was a corpus that Jesperson had created himself over the course of several decades.

It was not until the 1940s that computers began to play a part in creating a searchable corpus. In 1949, a Jesuit priest named Roberto Busa began work on what he called the Index Thomasticus, what was to be a complete computerized concordance to all the words in the work of Thomas Aquinas. A mere 30 years or so later, the work was completed (initially published in 56 print volumes, it was later released on CD, and is now accessible on the Internet).

In the 1960s, Henry Kucera and W. Nelson Francis built the Brown University Standard Corpus of Present-Day American English. The Brown Corpus, as it is generally known, was the first large body of language to be put in a format that allowed for complex searches. It tagged parts of speech, consisted of words from a wide variety of sources (such as differing genres of fiction, academic and non-academic texts, and specific types of newspaper writing), and permitted users to do far more than simply see how many times a specific word had been used. One could, for example, find all the pronouns that were used after a certain verb. The earliest version of the Brown Corpus had slightly over 1,000,000 words, and became the prototype for most of the corpora to follow in the next few decades.

COHA could be described as a direct descendant of the Brown Corpus. It relies mostly on written American English (there are few corpora of spoken language), is spread over a wide variety of genres, and consists of 400,000,000 words, each of which has been identified by its part of speech. Both Brown and COHA have a large number of words, but COHA is considerably larger than Brown, or most any of the similar corpora that came before it. It takes evidence from a variety of wells (Project Gutenberg, archive.org, The Making of America site at Cornell University), all of which were capable of providing digital text with high-quality optical character recognition.

Once you begin tossing about numbers in the hundreds-of-millions range, it is very easy to lose perspective on size. After all, Google Books is a corpus, and, by most estimates, has hundreds of billions of words. But Google Books does not allow for many kinds of complex searches a linguist or language lover might want to undertake. Say you want to examine the history of usage behind the controversy over whether it is acceptable to use impact as a verb. Using only Google Books, you would have to dig through thousands of untagged results and filter out by hand all the instances of impact being used as a noun.

When I spoke with Mark Davies on the telephone, he referred to Google Books as “one of the best and the worst ambassadors” for corpora studies. The search function of Google Books and their Ngram viewer does not allow for much beyond examining word frequency, and Davies says, “word frequency is very interesting, but it’s just a very small glimpse of what words are doing.”

Many linguists have taken the position that corpora needn’t be so large, and that stuffing hundreds of millions of words into a database does naught but create a messy environment. Davies disagreed, and after having built a number of large corpora in other languages (Spanish and Portuguese), he saw a need for a more substantial searchable record of English than was currently available.

One obvious thing that COHA provides is a historic record of the language, rather than simply a thin slice of the fossil record. This allows us to look at how our language has changed (some would say deteriorated) over the past two hundred years. As an example, we can look at the movement and usage of a single word. Let’s start with unique.

Some people hold that unique is a non-modifiable adjective. After all, something is either unique or it is not, and thus such turns of phrase as almost unique, somewhat unique, and most unique are illogical. The apparent misuse of this word is one of the more common bugaboos of people who like to complain about the language use of others, and one frequently hears that this misuse is a recent phenomenon.

When we examine how unique has been used in COHA, we get a somewhat different picture. For the first two decades of data (1800-1820), unique appears very infrequently, and tends to be used without qualification, coming up in such phrases as “the last is unique” and “a unique single principle.” All is well with the world.

But soon enough, Satan comes to Eden, cleverly disguised as semantic drift. By the 1830s, we can see, American writers had begun to qualify unique, and to use it in ways that indicate something not exactly one-of-a-kind. Phrases such as “quite unique,” “more unique,” and “somewhat unique” begin to appear. By the 1850s, Americans are using “almost unique,” and it becomes apparent that this word was conveying shades of meaning in a way that purists have found objectionable for quite some time now. This type of examination is not possible when one has only a few million words to search through.

Linguists have been using COHA for much more detailed and tricky questions than these. Martin Hilpert used it in his paper “Diachronic Collostructional Analysis Meets the Noun Phrase: Studying Many a Noun in COHA.” Guenther and Martina Lampert used it to research their paper “Where Does Evidentiality Reside? Notes on (Alleged) Limiting Cases: Quotatives and Seem-Verbs.” But not all of the academic inquiries using COHA sound so intimidating. Johanna Wood and Sten Vikner employed COHA in search of the answer posed by the title of a paper they published in 2013, “What’s to the Left of the Indefinite Article?” (Oddly enough, the answer is another indefinite article: The authors found that in some dialects of German, Danish, and English, the construction a such a is used.)

For those of you who have never found yourselves awake at night wondering where evidentiality resides, or what else is to the left of the indefinite article, rest assured that there is still much in COHA to amuse and educate you.

For instance, you can use this corpus to examine how political labels have shifted over the past 200 years. You would find that the noun Democrat did not begin to be affiliated with the adjective liberal until the 1930s, the heyday of the New Deal. Before that, it is most often seen with the words Southern, good, and little (which is apparently not a pejorative use—Little Democrat was the name of a ship that spawned a diplomatic dispute between the U.S. and France). Shifts in how Republican has been used are also visible. For four decades, from 1910 until 1950, progressive Republican was somewhat common, before being eclipsed in use by collocates such as conservative or moderate Republican.

Politician, on the other hand, appears to have relatively stable connotations over this time period. When one searches for the adjectives that have been used to describe such a person, the resultant list reads like a thesaurus entry for something especially unpleasant: unprincipled, crooked, profane, third-rate, unscrupulous, corrupt, cheap, wily, and selfish all make appearances.

Be warned, as it is quite easy to lose hours of your time on this sort of thing. You start off by just looking up a word to see how its frequency has changed since the 1930s. This prompts you to wonder what other words precede or follow it, or whether it was more common in fiction or non-fiction. Then maybe you compose searches for variant forms of the word in passive constructions. Before you know it, you find yourself sitting at a desk, bathed in the light of the computer screen, searching through the historical record of the language and wondering to yourself about what lies to the left of the indefinite article, and what is to the left of that.
_________________ References:

Shea, Ammon. 2014. “The American Language: A Historical Database of English in the U.S.”. Pacific Standard. Posted: May 5, 2014. Available online: http://www.psmag.com/navigation/books-and-culture/american-language-historical-database-english-u-s-80748/

