Note: While this page will display in any browser, it will look much better when viewed with a browser that is compliant with the latest W3C standards. If you can see this message you may want to consider updating your software at this time.
Annotations of the ANC data contributed by members of the community are available as stand-off annotations in the same format as annotations included in the ANC Second Release. Note that these annotations are not included as a part of the ANC Second Release as distributed by the LDC. You must download and install these additional annotations, after which they will be usable in the same way as annotations included in the Second Release distribution.
Please Note: The CLAWS and co-reference annotations cannot be used with the Open ANC, since some OANC texts have been modified as a result of manual validation. Therefore, the CLAWS and co-reference stand-off annotations may contain invalid offsets. Versions of these annotations will be made available for the Open ANC in the future.
The written portion of the ANC has been tagged for part speech using the C5 tagset (the tag set used in the BNC) and the C7 tagset by the University of Lancaster. The two sets of annotations have been packaged separately so that users can install portions of each tag set; for example, it is possible to install in C5 tags for the Slate corpus and the C7 tags for the New York Times corpus.
Each set of annotations can be installed on your system using either of two installers:
The installation process is the same regarless of the installer you use. Each installer is an executable jar file. On most systems either installer can be run simply by double clicking on the installer's jar file. If that does not work, open a command prompt (Windows) or a shell (Unix/Linux/MaxOSX) and run the command:
java -jar installer.jar
where installer.jar is the name of the installer you downloaded. For example, to run the web installer for the C7 annotations, the command would be:
java -jar C7-web.jar
1. If you use the web installer, please note that the installer displays messages indicating that it is "connecting to the internet" while it is downloading the various packages. For this reason it is recommended that you use the stand-alone installer unless only a small subset of the annotations will be installed.
2. When you select the $ANC_HOME directory, the installer will warn you that the directory already exists and ask if you are sure you want to overwrite its contents. Select "Yes".
3. The installers assume that the ANC directory structure as it is on the DVD distributed by LDC has been preserved. The expected directory structure is shown below.
\---data
+---spoken
| +---academic-discourse
| | \---micase
| +---face-to-face
| | \---charlotte
| \---telephone
| +---callhome
| \---switchboard
+---written_1
| +---fiction
| | +---eggan
| | \---hargrave
| +---journal
| | +---slate
| | \---verbatim
| +---leisure
| | \---blog
| \---letters
| \---icic
\---written_2
+---newspapers
| \---nytimes
+---non-fiction
| \---OUP
+---technical
| +---911report
| +---biomed
| +---government
| \---plos
\---travel_guides
+---berlitz1
\---berlitz2
Back to the top.
Shane Bergsma of the University of Alberta has annotated a sub-set of the Slate corpus for coreference (anaphora). Here is what Shane has to say:
We labelled pronoun-antecedent pairs in 118 documents from the Slate section of the American National Corpus. There are 1398 labelled pronouns in 78 documents in the training set and 1381 labelled pronouns in 40 documents in the test set. Most of the Slate documents are ``gist'' articles which provide factual background information for stories currently in the news. Only pronouns that refer to noun phrases given previously in the text are used in our system. Thus we label and ignore pronouns referring to implicit entities not specifically mentioned, cataphora (e.g., ``Afterhe was elected, president Clinton...''), and pleonastic pronouns without antecedent (e.g., ``it is raining''). Of the 2779 total pronouns labelled, 219 are so identified.
The coreference annotations are packaged as a separate corpus and should not be installed into the ANC home directory. The installer can create a new directory if the directory selected for installation does not already exist.
The coreference annotations are packaged in an executable jar file. To install the annotations run the jar file by double clicking on it, or by opening a command prompt (Windows) or a shell (Unix/Linux/MacOSX) and running the command:
java -jar Slate-coref-install.jar
Once the annotations have been installed you will likely want to process the files with the ANC Tool.
Back to the top.