Lablogo
Corpus

Please read the licence information carefully when downloading data.

A reduced redundancy USENET corpus (2005-2011)


This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47,860 English language, non-binary-file news groups (see list of newsgroups included with the corpus for details). Despite our best efforts to clean this corpus, contains a very small percentage of non-English words and non-words. No automatic spelling correction was performed, and no text was tranformed. The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs.

Method of Transmission: This file is now hosted on a cloud-based web download service. Please read the licensing information and then fill in the form below to be taken to the download page.

For those who want the original corpus: The full, un-reduced, corpus (2005-2009) is still avaible as a Public Data Set on Amazon Web Services. You will need a free AWS account to gain access. Click here to get the corpus through your AWS account. It is 36Gb, compressed.

Processing: All NNTP headers were discarded [there is no way to recover them, and this is done to ensure the privacy of the authors.] All message bodies that had the same 128bit SHA-1 hash as other message bodies were discarded (reducing duplication of documents from cross-posts). To reduce the amount of garbage data and non-english text in the corpus, the following pre-processing steps were taken:
  • All documents that were less than 500 words and greater than 500,000 words were omitted.
  • Documents that contained less than 90% English words were omitted. (English words were defined as words that are contained in a 100,000 words dictionary of english).
To anonymize the text, we aslo did the following:
  • Replaced all of the obvious e-mail addresses with the token <EMAILADDRESS>.
  • Replaced all of the obvious HTTP URLs with the token <URL> , and news URLs with <NEWSURL>.

Unlike previous versions of this corpus released between 2005 and 2012, this corpus had further redudant text removal algorithms applied to it. This made the corpus smaller, but we hope of higher quality.
The two extra steps were the following:
1) Any sections of text that were obviously quotes of other posts were removed (for example lines beginning with the ">" character).
2) A hashing algorithm was applied to all paragraphs, assigning a unique signature to each paragraph. Any paragraphs that appeared more than twice in the corpus were removed.
After applying this redudant text removal algorithm, the size of the corpus shrunk from 30 billion words to 7 billion words.

Corpus size:   over 7 billion words,
Data size:   over 8Gb, compressed into a single file.
Last Update:  May, 2011.

Citation: Shaoul, C. & Westbury C. (2013) A reduced redundancy USENET corpus (2005-2011)  Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html)

Acknowledgments: This work would not have been possible without the hardware and software provided by the TaPoR project. This research is also supported by NSERC.

If you have any questions about this corpus, please contact Cyrus Shaoul

PLEASE NOTE THE LICENSE THAT THIS CORPUS IS RELEASED UNDER:
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Download the corpus:

Please fill out this form so that we can keep track of who has downloaded the corpus. The information that you enter below will be kept completely confidential. Please enter a valid e-mail address.

Your Full Name:
Your Email Address:
Your Organization:
What do you intend to use the data for?
Comments/Questions:



 

©2011,2012,2013  WestburyLab   chrisw at ualberta dot ca