Westbury Lab Web Site: Usenet type frequency data Download

These frequencies were derived from a corpus of USENET postings. This corpus was collected between Oct 2005 and August 2006, and covers 47860 English language, non-binary-file news groups. All types that occurred more than 20 times in the corpus were added to the list of types to be counted. This list includes many non-words, as the software used to remove unwanted punctuation caused some unwanted concatenation of words. Other non-words may come from spelling errors and texts in languages other than English that were posted to mainly English speaking newsgroups.

Query the list:

Processing: All NNTP headers were discarded. All message bodies that had the same 128bit SHA-1 hash as other message bodies were discarded (reducing duplication of documents from cross-posts). All lines that began with the quote characters (">" "<", "|", ":","#","!" and "%") were not processed in order to reduce text duplication. This step removed 35% of the lines in the corpus. All text was converted into uppercase, and all punctuation, except for the intra-word hyphens and apostrophes, were removed.

Corpus size: 7,772,031,659 words

Citation: Shaoul, C. & Westbury C. (2006) USENET Orthographic Frequencies for 1,618,598 types. (2005-2006) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/wlallfreq.download.html)

Acknowledgments: This work would not have been possible without the hardware and software provided by the TaPoR project. This research is also supported by NSERC.

If you have any questions about this data, please contact Cyrus Shaoul

PLEASE NOTE:

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License.

Download the list:

Please fill out this form so that we can keep track of who has downloaded this file.