![]() |
![]() |
||
| Corpus Please read the licence information carefully when downloading data. |
A USENET corpus (2005-2009) [BETA VERSION]
This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2009, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs. Method of Transmission: For most users, we reccommend using a BitTorrent Client. This is the most efficient method for more users. For those users on Internet2, the HTTP service works well. If you are NOT on the Internet2 (most non-academic networks) and cannot use BitTorrent, please try the Limited HTTP service, with global daily download limits. Processing: All NNTP headers were discarded [there is no way to recover them, and this is done to ensure the privacy of the authors.] All message bodies that had the same 128bit SHA-1 hash as other message bodies were discarded (reducing duplication of documents from cross-posts). To reduce the amount of garbage data and non-english text in the corpus, the following pre-processing steps were taken:
Corpus size: over 20 billion words Data size: over 22gb, compressed Last Update: Jan, 2009 Citation: Shaoul, C. & Westbury C. (2009) A USENET corpus (2005-2009) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html) Download the corpus:Please fill out this form so that we can keep track of who has downloaded the corpus. The information that you enter below will be kept completely confidential. Please enter a valid e-mail address.
|
|
©2008
WestburyLab chrisw at
ualberta
dot ca
|