Westbury Lab Web Site: Usenet word frequency data Download

These frequencies were derived from a corpus of USENET postings. This corpus was collected between Oct 2005 and August 2006, and covers 47860 English language, non-binary-file news groups. The list of words is not a complete list of words found in the corpus, but rather a large list that we use for psycholinguistic research.

Processing: All NNTP headers were discarded. All message bodies that had the same 128bit SHA-1 hash as other message bodies were discarded (reducing duplication of documents from cross-posts). All lines that began with the quote characters (">" "<", "|", ":","#","!" and "%") were not processed in order to reduce text duplication. This step removed 35% of the lines in the corpus. All text was converted into uppercase, and all punctuation, except for the apostrophe, was removed.

Corpus size: 7,781,959,860 words

Citation: Shaoul, C. & Westbury C. (2006) USENET Orthographic Frequencies for 111,627 English Words. (2005-2006) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html)

Acknowledgments: This work would not have been possible without the hardware and software provided by the TaPoR project. This research is also supported by NSERC.

If you have any questions about this data, please contact Cyrus Shaoul

PLEASE NOTE:

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License.

Please fill out this form so that we can keep track of who has downloaded this file.