MASC Downloads

 

MASC is a community resource that is freely available for download and use. In turn we ask that you provide us with any of the following that may have resulted from your use of MASC data and/or annotations, which we will make freely available to the user community:

  1. errors or problems

  2. corrections/validations of any part of MASC or the OANC, both text and annotations

  3. additional annotations in any format

  4. derived resources, including word lists, frequency lists, n-grams, extracted entities or other knowledge, statistics of any kind, etc.

DATA DOWNLOAD


MASC data and annotations can be obtained in two ways:

  1. use ANC2Go to select portions of the corpus and annotations and receive a “customized” corpus including only your selections in one of the following output formats:

  2. in-line XML (XCES), suitable for use with the BNC’s XAIRA search and access interface and other XML-aware software

  3. token / part of speech, a common input format for general-purpose concordance software such as MonoConc, as well as the Natural Language Toolkit (NLTK)

  4. CONLL IOB format

  5. download the data, alone or with all available annotations in the ANC format, below.

The “core” MASC corpus is divided into three sets:


MASC I                            *** Data and annotations available ***

80K words of data with validated annotations for token, part of speech, sentence boundary, noun chunks, verb chunks, named entities, and Penn Treebank syntax; and full-text FrameNet annotation for seventeen texts. This portion of the corpus contains 40K of texts annotated by the Unified Linguistic Annotation Project and about 5000 words of license-free English language data from the Language Understanding Corpus.


DOWNLOAD DATA ONLY (82K words UTF-8 textfiles)

masc1_data-only.zip    |    masc1_data-only.tgz


DOWNLOAD DATA AND STANDOFF ANNOTATIONS

Date                Version       Release notes          Download

2010-07-23      1.0.2            1.0.2_notes              MASC-1.0.2.zip  |   MASC-1.0.2.tgz

2010-05-17      1.0.1                                             MASC1.zip         |   MASC1.tgz

MASC II                              *** Data available ***

120K words of additional data from a range of genres. Annotations produced within the MASC project (token, part of speech, sentence boundary, noun chunks, verb chunks, named entities, FrameNet, plus WordNet sense annotations) will be released in fall, 2010.


DOWNLOAD DATA ONLY (140K words UTF-8 textfiles)

masc2_data-only.zip    |   masc2_data-only.tgz


MASC III                             *** Data available early fall 2010 ***

280K words of additional data, filling out the 500K sub-corpus and rounding out the genre distribution. Annotations produced within the MASC project (token, part of speech, sentence boundary, noun chunks, verb chunks, named entities, FrameNet, plus WordNet sense annotations) will be released in 2011.

WORDNET SENSE ANNOTATIONS

One thousand occurrences of 100 words chosen by the FrameNet-WordNet harmonization effort have been manually annotated with WordNet 3.1 senses. The sentences containing the occurrences for 100 instances of each word have also been annotated for FrameNet frame elements. The data and annotations are distributed as a separate corpus. See WordNet - FrameNet Annotations for more information.


DOWNLOAD SENTENCE CORPUS WITH STANDOFF ANNOTATIONS, DOCUMENTATION, AND INTER-ANNOTATOR AGREEMENT DATA

masc_wordsense.zip     |     masc_wordsense.tgz

TOOL DOWNLOAD


The ANC project has not developed project-specific software for MASC and OANC data. Our approach is to instead provide the data and annotations in formats compatible with a wide variety of applications and frameworks.


  1. For XML-aware tools and applications, BNC’s XIARA, concordancing software such as MonoConc, and NLTK (token/pos only), use ANC2Go to generate the corpora and annotations in the appropriate format


  1. To use MASC/OANC data and annotations in the General Architecture for Text Engineering (GATE) and/or output annotations created in GATE in GrAF format, DOWNLOAD THE ANC/GrAF GATE PLUGINS. Installation and use instructions are available here.


  1. Available August 1: To use MASC/OANC data and annotations in the Unstructured Information Management Architecture (UIMA), DOWNLOAD UIMA CAS CONSUMER. Installation and use instructions are available here.


  1. Available early fall: To use MASC/OANC data and annotations in the Natural Language Toolkit (NLTK), DOWNLOAD NLTK CORPUS READER. Installation and use instructions are available here.


  1. To access and manipulate GrAF annotations directly from Java programs, USE THE GrAF API. The GrAF API also provides a renderer that generates input to the open source GraphViz graph visualization application.