Please read the licence information carefully when downloading data.
A USENET corpus (2005-2010) [BETA VERSION]
This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs.
of Transmission: For most users, we strongly reccommend using a BitTorrent
Client. This is the most efficient method for more users. For
those users on Internet2, the HTTP service works well. If you are NOT on the Internet2
(most non-academic networks) and cannot use BitTorrent, please try the
Limited HTTP service, with global daily download limits.
Corpus size: over 30 billion words,
Data size: over 34Gb, compressed (delivered as weekly bundles of about 150 Mb each.)
Last Update: May, 2011.
Citation: Shaoul, C. & Westbury C. (2011) A USENET corpus (2005-2010) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html)
This work would not have been possible without the hardware and
software provided by the TaPoR project. This research is
also supported by NSERC.
Download the corpus:Please fill out this form so that we can keep track of who has downloaded the corpus. The information that you enter below will be kept completely confidential. Please enter a valid e-mail address. IMPORTANT: Once you click on the above button, you will be able to choose BitTorrent for your download. Please make sure to try to use BitTorrent if you are connected to the Internet (and are NOT connected to the Internet2.) Also, please leave your BitTorrent program running for a few days after the download is complete to help others obtain the corpus. Some people have left BitTorrent running for weeks without any ill effects! :-)
©2011 WestburyLab chrisw at ualberta dot ca