USENET Orthographic Frequencies for 1,618,598 types. (2005-2006)

These frequencies were derived from a corpus of USENET postings. This corpus was collected between Oct 2005 and August 2006, and covers 47860 English language, non-binary-file news groups. All types that occurred more than 20 times in the corpus were added to the list of types to be counted. This list includes many non-words, as the software used to remove unwanted punctuation caused some unwanted concatenation of words. Other non-words may come from spelling errors and texts in languages other than English that were posted to mainly English speaking newsgroups.

Processing: All NNTP headers were discarded. All message bodies that had the same 128bit SHA-1 hash as other message bodies were discarded (reducing duplication of documents from cross-posts). All lines that began with the quote characters (">" "<", "|", ":","#","!" and "%") were not processed in order to reduce text duplication. This step removed 35% of the lines in the corpus. All text was converted into uppercase, and all punctuation, except for the intra-word hyphens and apostrophes, were removed.

Corpus size:   7,772,031,659 words

Citation: Shaoul, C. & Westbury C. (2006) USENET Orthographic Frequencies for 1,618,598 types. (2005-2006)  Edmonton, AB: University of Alberta (downloaded from

Acknowledgments: This work would not have been possible without the hardware and software provided by the TaPoR project. This research is also supported by NSERC.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License.

