Data Please read the licence information carefully when downloading data. |
USENET Orthographic Frequencies for 111,627 English Words. (2005-2006)
These frequencies were derived from a corpus of USENET postings. This corpus was collected between Oct 2005 and August 2006, and covers 47860 English language, non-binary-file news groups. The list of words is not a complete list of words found in the corpus, but rather a large list that we use for psycholinguistic research. Processing: All NNTP headers were discarded. All message bodies that had the same 128bit SHA-1 hash as other message bodies were discarded (reducing duplication of documents from cross-posts). All lines that began with the quote characters (">" "<", "|", ":","#","!" and "%") were not processed in order to reduce text duplication. This step removed 35% of the lines in the corpus. All text was converted into uppercase, and all punctuation, except for the apostrophe, was removed. Corpus size: 7,781,959,860 words Citation: Shaoul, C. & Westbury C. (2006) USENET Orthographic Frequencies for 111,627 English Words. (2005-2006) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html)
Please fill out this form so that we can keep track of who has downloaded this file.
|
©2005,2006,2007
WestburyLab chrisw at
ualberta
dot ca
|