The second release of the American National Corpus includes updated versions all of the files in the first release plus an additional 10 million new words. However, the second release uses standoff annotations to a much greater extent than did the first release. All documents are now stored logically as annotation graphs with a node set and an edge set. The node set consists of a UTF-16 character stream with an implied node between each pair of characters and at the start and end of the stream. The edge set consists of one or more XML documents that describe the annotations.
Each logical document in the ANC is described by a cesHeader that assosciates edge set(s) with a node set. In addion, the ANC uses the following naming convention to make the associations human readable.
Given an annotation graph G used to represet the logical document file named filename, the following files are used to store the representation of G:
All of the XML files together describe the complete edge set. It is possible that the annoation graph does not describe a valid XML document so combining all the annoations may result in an malformed XML document. However, the XML document produced by combining the the logical markup, sentence boundaries, and one of the part of speech tags, will be well formed XML and validate with the XCES schema.
Several tools are available on the ANC's Tools Page that can be used to generate XML files from the annotations graphs as well as several Gate processing resources that allow Gate to load and save ANC documents.