ANC First Release Frequency Data
These are preliminary word frequency counts for the first release of the ANC. The counts will be refined as texts are added and our part of speech tagger(s) are fine-tuned. The data is divided into counts for the entire first release as well as for the spoken texts and written texts.
In addition, three versions of the bigram counts are provided:
- Sorted by frequency
- Sorted by first word of the bigram
- Sorted by second word of the bigram
Trigram data is forthcoming.
Lexicons
Lexicons, including frequency counts, for the documents in the first release. Lexicons are provided for the complete first release as well as lexicons for written and spoken texts.
- Complete lexicon : tgz | zip (420KB)
- Lexicon for written texts : tgz | zip (404KB)
- Lexicon for spoken texts : tgz | zip (112KB)
Bigrams
Complete ANC
- Sorted by frequency : tgz | zip (8MB)
- Sorted by the first word of the bigram : tgz | zip (8MB)
- Sorted by the second word of the bigram : tgz | zip (8MB)
Written Texts Only
- Sorted by frequency : tgz | zip (7.5MB)
- Sorted by the first word of the bigram : tgz | zip (7.5MB)
- Sorted by the second word of the bigram : tgz | zip (7.5MB)
Spoken Texts Only
- Sorted by frequency : tgz | zip (1.8MB)
- Sorted by the first word of the bigram : tgz | zip (1.8MB)
- Sorted by the second word of the bigram : tgz | zip (1.8MB)
Trigrams
Written Texts Only
- Sorted by frequency : tgz | zip (21.4 MB)
- Sorted by first word : tgz | zip (22.3 MB)
- Sorted by second word : tgz | zip (21.3 MB)
- Sorted by third word : tgz | zip (21.4 MB)
Spoken Texts Only
- Sorted by frequency : tgz | zip (6.3 MB)
- Sorted by first word : tgz | zip (6.3 MB)
- Sorted by second word : tgz | zip (6.6 MB)
- Sorted by third word : tgz | zip (6.3 MB)