Corpus Please read the licence information carefully when downloading data. |
A reduced redundancy USENET corpus (2005-2011)
This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47,860 English language, non-binary-file news groups (see list of newsgroups included with the corpus for details). Despite our best efforts to clean this corpus, contains a very small percentage of non-English words and non-words. No automatic spelling correction was performed, and no text was tranformed. The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs. Method
of Transmission: This file is now hosted on a cloud-based web download service. Please read the licensing information and then fill in the form below to be taken to the download page.
Unlike previous versions of this corpus released between 2005 and 2012,
this corpus had further redudant text removal algorithms applied to
it. This made the corpus smaller, but we hope of higher quality.
Data size: over 8Gb, compressed into a single file. Last Update: May, 2011.
Citation: Shaoul, C. & Westbury C. (2013) A reduced redundancy USENET corpus (2005-2011) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html)
Acknowledgments:
This work would not have been possible without the hardware and
software provided by the TaPoR project. This research is
also supported by NSERC. Download the corpus:Please fill out this form so that we can keep track of who has downloaded the corpus. The information that you enter below will be kept completely confidential. Please enter a valid e-mail address.
|
©2011,2012,2013
WestburyLab chrisw at
ualberta
dot ca
|