Note: While this page will display in any browser, it will look much better when viewed with a browser that is compliant with the latest W3C standards. If you can see this message you may want to consider updating your software at this time.
The American National Corpus (ANC) project figured prominently in the August 18, 2002 On Language column, featured in the New York Times Magazine. The article "Corpus Linguistics", written by John Rosenthal, gives a good overview of how linguists can use corpora to describe current usage. Several quotations from the ANC Project Manager, Randi Reppen, are included:
Reppen is the project manager for the American National Corpus, a huge undertaking sponsored by a consortium of publishers, software companies and academics, including Pearson, Microsoft, Sony and the Universities of California, Colorado and Pennsylvania, among many others. When it is completed, the corpus will contain more than 100 million words, chosen from a broad selection of contemporary written and spoken texts -- everything from books, magazines and newspapers to face-to-face conversations in drugstores and Laundromats that have been recorded and transcribed by researchers. Based on a similar corpus of British English created in 1994, the American Corpus will provide a definitive portrait of how the English language is used in the United States today.
The first installment of 10 million words is scheduled for release this fall and will be available to anybody with Internet access. Say, for example, you're writing advertising copy, and you want to know whether most people still use ''I couldn't care less'' or opt instead for the easier (but nonsensical) ''I could care less.'' You'll simply hop on the Web, enter the phrase ''could care less'' and count the occurrences in the corpus. Then you'll do the same for ''couldn't care less'' and compare the number of hits. ''You could choose to limit your search to spoken language or to newspapers or even to academic writing,'' Reppen says.
The article incorrectly indicates that the ANC will be "available to anybody with Internet access". However, while the ANC will indeed be web-accessible, access to the corpus for development of commercial products (dictionaries and other reference publications, language-aware software, etc.) is restricted to ANC Consortium members until the year 2007. Commercial users who are not members of the ANC Consortium can gain access before 2007 by joining the consortium at any time. For the purposes of academic research and education, the ANC will be broadly available from the University of Pennsylvania’s Linguistic Data Consortium for a nominal fee covering part of the costs of distribution.
Linguists hunt and study words in their natural habitat
By Nathan Bierma
Special to the TribuneMarch 25, 2004
Sometimes language lovers sound as if they're on a safari. They talk about observing words in their natural habitat and studying their behavior in herds.
With the first release of the American National Corpus, an annotated body of over 10 million words, linguists can hunt like never before.
"Up until now, linguists were kind of like Victorian bug hunters," says Erin McKean, the Chicago-based senior editor of U.S. dictionaries for Oxford University Press and board member of the American National Corpus. "We'd go out with our nets and we'd catch some butterflies and we'd chloroform them and pin them to cards and put them in a drawer."
"But now, when people are really studying an ecosystem -- and English is like an ecosystem -- what they do is, they take a representative square area and report everything that's there: every bug, every plant, every leaf," she said. "And now with the corpus, we can do that for English."
If the dictionary is like the drawer with bugs on cards, the corpus is the jungle. The ANC collects blocks of text from newspapers, books and conversations so words and phrases can be viewed in their natural habitat -- that is, in an American English context.
Readers can search the collection by word, phrase, part of speech or type of source and find their quarry used in a sentence or paragraph.
For students learning English as a second language, a corpus -- Latin for body -- can help teach idioms and tendencies in a way dictionaries cannot, as ANC users around the world have already discovered.
"I hear from language teacher trainers in Egypt, Germany, Japan and Sweden who are really excited to have these data available to them, so they can go in and look at aspects of conversation," said Randi Reppen, English professor at Northern Arizona University and Project Manager for the ANC.
The ANC could also be used by advertising copywriters in search of resonant slogans, or by computer programmers to make automated customer service hotlines sound more natural, McKean said.
The ANC's initial release last October, available on CD-ROM for $75 at www.americannationalcorpus.org, contains 11.5 million words. About one-fourth of the collection is made up of spoken English, including transcribed phone conversations from volunteers who were given phone cards in exchange for being recorded.
The rest of the corpus is written text contributed by The New York Times, the online magazine Slate, Langenscheidt travel guides and books from Oxford University Press on architecture and Abraham Lincoln.
"We want writers to want to be part of the American National Corpus," McKean said. "We're hoping to have an ANC logo that authors can have their publishers put on their books, as a way of saying, `My work is influencing the study of the English language.'"
By the end of 2005, the ANC, which last year received a grant from the National Science Foundation, hopes to release 100 million words -- 90 million written, 10 million spoken -- evenly balanced among sources as diverse as town meetings, medical journals and novels.
"It's hard to take one area and say, `This is English,'" Reppen said. "By having different types of writing and speaking situations, the corpus gives a better picture for language researchers, teachers and learners."
Until now, such seekers of untamed English have relied on other corpora such as the British National Corpus, a collection of 100 million words of British English released 10 years ago. But in the last 10 years, new technology has made formatting samples of text faster and cheaper.
"We're lucky that we're doing it today," McKean said. "This is something that would have been insane to do in the 1950s and was barely possible in the 1980s when the British National Corpus [started]."
Meanwhile, demand for corpora has grown in the field of computational linguistics, which uses computer programs to analyze the structure of language.
"The motivation for the ANC came from the fact that many computational linguists were using the BNC to gather statistics about syntactic patterns, [when in fact] British English and American English are not alike in several ways," said Nancy Ide, professor of computer science at Vassar College and Technical Director of the ANC.
Another new wrinkle in corpus linguistics is the Internet. The ANC plans to add e-mails, message boards and Web sites to its collection. McKean has already gotten permission from her message board of fellow "Buffy the Vampire Slayer" fans to use their posts for the ANC.
Copyright (c) 2004, Chicago Tribune