Data Please read the licence information carefully when downloading data. |
USENET Orthographic Frequencies for 1,618,598 types. (2005-2006)
These frequencies were derived from a corpus of USENET postings. This corpus was collected between Oct 2005 and August 2006, and covers 47860 English language, non-binary-file news groups. All types that occurred more than 20 times in the corpus were added to the list of types to be counted. This list includes many non-words, as the software used to remove unwanted punctuation caused some unwanted concatenation of words. Other non-words may come from spelling errors and texts in languages other than English that were posted to mainly English speaking newsgroups.
Query the list:Processing: All NNTP headers were discarded. All message bodies that had the same 128bit SHA-1 hash as other message bodies were discarded (reducing duplication of documents from cross-posts). All lines that began with the quote characters (">" "<", "|", ":","#","!" and "%") were not processed in order to reduce text duplication. This step removed 35% of the lines in the corpus. All text was converted into uppercase, and all punctuation, except for the intra-word hyphens and apostrophes, were removed. Corpus size: 7,772,031,659 words Citation: Shaoul, C. & Westbury C. (2006) USENET Orthographic Frequencies for 1,618,598 types. (2005-2006) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/wlallfreq.download.html)
Download the list:Please fill out this form so that we can keep track of who has downloaded this file.
|
©2005,2006,2007
WestburyLab chrisw at
ualberta
dot ca
|