Wikipedia Corpus

Please read the licence information carefully when downloading data.

The Wesbury Lab Wikipedia corpus (2010)

This corpus was created from a snapshot of all the articles in the English part of the Wikipedia that was taken in April 2010. It was processed, as described in detail below, to remove all links and irrelevant material (navigation text, etc) The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs.

Method of Transmission: This file is now hosted on a cloud-based web download service. Please fill in the form below to be taken to the download page.

Processing: The snapshot was downloaded in April 2010 from the Wikipedia dumper. It was then converted to text using the WikiExtractor. The following pre-processing steps were also taken:
  • All documents that were less than 2000 characters long were omitted.
Corpus size:   990,248,478 words, over 2 million documents
Data size:   over 6Gb raw, 1.8Gb bzip compressed (delivered as a single file)

Citation: Shaoul, C. & Westbury C. (2010) The Westbury Lab Wikipedia Corpus,   Edmonton, AB: University of Alberta (downloaded from

Acknowledgments: This work would not have been possible without the hardware and software provided by the TaPoR project and Dr. Harald Baayen. This research is also supported by NSERC.

If you have any questions about this corpus, please contact Cyrus Shaoul

Creative Commons License
Wikipedia Corpus by Cyrus Shaoul is licensed under a Creative Commons Attribution 3.0 Unported License.
Based on a work at

Also, please read about the Wikipedia CC license.

Download the WestburyLab Wikipedia corpus:

Please fill out this form so that we can keep track of who has downloaded the corpus. The information that you enter below will be kept completely confidential. Please make sure to enter a valid e-mail address.

Your Full Name:
Your Email Address:
Your Organization:
What do you intend to use the data for?


©2010,2011,2012,2013  WestburyLab   chrisw at ualberta dot ca