Note: While this page will display in any browser, it will look much better when viewed with a browser that is compliant with the latest W3C standards. If you can see this message you may want to consider updating your software at this time.


15em 7em
about contents encoding frequency data using xaira bugs & caveats
about obtaining contributing contents encoding frequency data using Xaira bugs & caveats
contents using annotations download
annotations software source code frequency data publications contributor's FAQ
project people consortium anc mailing list contact us site map

The ANC Second Release



File Structure

The second release of the American National Corpus includes updated versions all of the files in the first release plus an additional 10 million new words. However, the second release uses standoff annotations to a much greater extent than did the first release. All documents are now stored logically as annotation graphs with a node set and an edge set. The node set consists of a UTF-16 character stream with an implied node between each pair of characters and at the start and end of the stream. The edge set consists of one or more XML documents that describe the annotations.

ANC Documents

Each logical document in the ANC is described by a cesHeader that assosciates edge set(s) with a node set. In addion, the ANC uses the following naming convention to make the associations human readable.

Given an annotation graph G used to represet the logical document file named filename, the following files are used to store the representation of G:

All of the XML files together describe the complete edge set. It is possible that the annoation graph does not describe a valid XML document so combining all the annoations may result in an malformed XML document. However, the XML document produced by combining the the logical markup, sentence boundaries, and one of the part of speech tags, will be well formed XML and validate with the XCES schema.

Several tools are available on the ANC's Tools Page that can be used to generate XML files from the annotations graphs as well as several Gate processing resources that allow Gate to load and save ANC documents.