USENET Orthographic Frequencies for 111,627 English Words. (2005-2006)

These frequencies were derived from a corpus of USENET postings. This corpus was collected between Oct 2005 and August 2006, and covers 47860 English language, non-binary-file news groups. The list of words is not a complete list of words found in the corpus, but rather a large list that we use for psycholinguistic research.

Processing: All NNTP headers were discarded. All message bodies that had the same 128bit SHA-1 hash as other message bodies were discarded (reducing duplication of documents from cross-posts). All lines that began with the quote characters (">" "<", "|", ":","#","!" and "%") were not processed in order to reduce text duplication. This step removed 35% of the lines in the corpus. All text was converted into uppercase, and all punctuation, except for the apostrophe, was removed.

Corpus size:   7,781,959,860 words

Citation: Shaoul, C. & Westbury C. (2006) USENET Orthographic Frequencies for 111,627 English Words. (2005-2006)  Edmonton, AB: University of Alberta (downloaded from

Acknowledgments: This work would not have been possible without the hardware and software provided by the TaPoR project. This research is also supported by NSERC.

If you have any questions about this data, please contact Cyrus Shaoul

