Note: While this page will display in any browser, it will look much better when viewed with a browser that is compliant with the latest W3C standards. If you can see this message you may want to consider updating your software at this time.


15em 7em
about contents encoding frequency data using xaira bugs & caveats
about obtaining contributing contents encoding frequency data using Xaira bugs & caveats
contents using annotations download
annotations software source code frequency data publications contributor's FAQ
project people consortium anc mailing list contact us site map

The ANCProject


The American National Corpus (ANC) project is fostering the development of a corpus comparable to the British National Corpus (BNC), covering American English. Corpus-analytic work has demonstrated that the BNC is inappropriate for the study of American English, due to the numerous differences in use of the language.

The availability of a corpus of American English will significantly contribute to language and linguistic research, development of language understanding computer applications (e.g., language translation and search and retrieval software), compilation of reference works such as dictionaries and thesauri, as well as provide a rich national resource for use in education at all levels.

The ANC will contain a core corpus of at least 100 million words, including both written and spoken (transcripts) data comparable across genres to the BNC. The genres in the ANC will be expanded to include "new" types of language data that have become available in recent years, such as web blogs and web pages, chats, email, and rap music lyrics. In addition to the core 100 million words, the ANC will include an additional component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of data possible.

A consortium of publishers of American English dictionaries and companies with interests in language processing was formed in 1999. Consortium members are providing materials for inclusion in the corpus, and provided initial financial support for the project.

In fall, 2003, the ANC produced its First Release of over 11 million words of American English. This and all future relases of ANC data are distributed by the Linguistic Data Consortium (LDC).

All ANC data is distributed by the LDC for a nominal ($75) charge, for non-commercial research purposes. Commercial use is limited to members of the ANC Consortium (ANCC) until fall, 2008. New commercial members can join the ANCC at any time.