Word Vector Data

Please read the licence information carefully when downloading data.

Word vectors created from a Wikipedia corpus

This page allows for the download of pre-computed word vector set that is designed to be used with the HiDEx system. It contains co-occurrence vectors for 57,377 words. For more info on HiDEx, or to download the software, please click here.

Processing: The corpus used to create this vector set is the free WestburyLab Wikipedia corpus. It was processed using HiDEx with a 5 word forward window and a 5 word word backward window. We uses a inverse linear ramp for the weighting scheme, and the normalization was done using the PPMI method.

Method of Transmission: Direct HTTP download.

Vector Set Data Size: 4.9Gb raw (1Gb compressed by bzip2)  

Vector Size: 57,377 vectors of 20,000 elements each  

Source Corpus Size: over 900 million words in over 2 million documents (The WestburyLab Wikipedia corpus)

Citation: Shaoul, C. & Westbury C. (2010) Word Vectors from the 2010 Westbury Lab Wikipedia corpus.  Edmonton, AB: University of Alberta (downloaded from

Acknowledgments: This work would not have been possible without the hardware and software provided by the TaPoR project. This research is also supported by NSERC.

If you have any questions about this data, please contact Cyrus Shaoul

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Unported License.

Please fill out this form so that we can keep track of who has downloaded this file.

Your Full Name:
Your Email Address:
Your Organization:
What do you intend to use the data for?

©2005,2006,2007,2008,2009,2010,2011,2012,2013   WestburyLab   chrisw at ualberta dot ca