Westbury Lab Web Site: ELP Download

USENET Orthographic Frequencies for the 40,481 words in the English Lexicon Project. (2005-2006)

These frequencies were derived from a corpus of USENET postings. This corpus was collected between Oct 2005 and August 2006, and covers 47860 English language, non-binary-file news groups.

Processing: All NNTP headers were discarded. All message bodies that had identical 128bit SHA-1 hashes to other messages were discarded (reducing extraneous cross-posts). All lines that began with the quote characters (">" "<", "|", ":","#","!" and "%") were not processed in order to reduce text duplication. This step removed 36% of the lines in the corpus. All text was converted into uppercase, and all punctuation, except for the apostrophe, was removed.

Corpus size: 7,781,959,860 words

Citation: Shaoul, C. & Westbury C. (2006) USENET Orthographic Frequencies for the 40,481 words in the English Lexicon Project. (2005-2006) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/elp.download.html)

Acknowledgments: This work would not have been possible without the hardware and software provided by the TaPoR project. This research is also supported by NSERC.

If you have any questions about this data, please contact Cyrus Shaoul

PLEASE NOTE:

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License.

Please fill out this form so that we can keep track of who has downloaded this file.


Data Please read the licence information carefully when downloading data.		USENET Orthographic Frequencies for the 40,481 words in the English Lexicon Project. (2005-2006) These frequencies were derived from a corpus of USENET postings. This corpus was collected between Oct 2005 and August 2006, and covers 47860 English language, non-binary-file news groups. Processing: All NNTP headers were discarded. All message bodies that had identical 128bit SHA-1 hashes to other messages were discarded (reducing extraneous cross-posts). All lines that began with the quote characters (">" "<", "\|", ":","#","!" and "%") were not processed in order to reduce text duplication. This step removed 36% of the lines in the corpus. All text was converted into uppercase, and all punctuation, except for the apostrophe, was removed. Corpus size: 7,781,959,860 words Citation: Shaoul, C. & Westbury C. (2006) USENET Orthographic Frequencies for the 40,481 words in the English Lexicon Project. (2005-2006) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/elp.download.html) Acknowledgments: This work would not have been possible without the hardware and software provided by the TaPoR project. This research is also supported by NSERC. If you have any questions about this data, please contact Cyrus Shaoul PLEASE NOTE: This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License. Please fill out this form so that we can keep track of who has downloaded this file. Full Name: Email Address: Organization: What do you intend to use the data for? Comments/Questions: