![]() |
![]() |
||
| Corpus Please read the licence information carefully when downloading data. |
A USENET corpus (2005-2009) [BETA VERSION]
This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2010, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs. Method
of Transmission: For most users, we strongly reccommend using a BitTorrent
Client. This is the most efficient method for more users. For
those users on Internet2, the HTTP service works well. If you are NOT on the Internet2
(most non-academic networks) and cannot use BitTorrent, please try the
Limited HTTP service, with global daily download limits.
Corpus size: over 25 billion words, over 28 million documents Data size: over 28gb, compressed (delivered as weekly bundles of about 150 Mb each) Last Update: Jan, 2010 Citation: Shaoul, C. & Westbury C. (2009) A USENET corpus (2005-2009) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html) Download the corpus:Please fill out this form so that we can keep track of who has downloaded the corpus. The information that you enter below will be kept completely confidential. Please enter a valid e-mail address.
|
|
©2010
WestburyLab chrisw at
ualberta
dot ca
|