Lablogo
Word Vector Data

Please read the licence information carefully when downloading data.


Word vectors created from a Wikipedia corpus v.0.1

This page allows for the download of pre-computed word vector set that is designed to be used with the HiDEx system. It contains co-occurrence vectors for 57,377 words. For more info on HiDEx, please click here.

Processing: The corpus used to create this vector set is the free WestburyLab Wikipedia corpus. It was processed using HiDEx with a 5 word forward window and a 5 word word backward window. We uses a inverse linear ramp for the weighting scheme, and the normalization was done using the PPMI method.

Method of Transmission: For most users, we strongly reccommend using a BitTorrent Client. This is the most efficient method because it allows the parallel download of many parts at once, speeding up the transfer. For those users on Internet2, the HTTP service works well. Our download system will detect if you are on the Internet2, and will automatically enable downloads from the IP address that you used to submit this form. If you are NOT on the Internet2 (most non-academic networks) and cannot use BitTorrent, please try the Limited HTTP service, with global daily download limits.

Vector Set Data Size: 4.9Gb raw (1Gb compressed by bzip2)  

Vector Size: 57,377 vectors of 20,000 elements each  

Source Corpus Size: over 900 million words in over 2 million documents (The WestburyLab Wikipedia corpus)

Citation: Shaoul, C. & Westbury C. (2010) Word Vectors from the 2010 Westbury Lab Wikipedia corpus.  Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/HiDEx.vectorset.download.html)

Acknowledgments: This work would not have been possible without the hardware and software provided by the TaPoR project. This research is also supported by NSERC.

If you have any questions about this data, please contact Cyrus Shaoul

PLEASE NOTE:
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Unported License.

Please fill out this form so that we can keep track of who has downloaded this file.

Your Full Name:
Your Email Address:
Your Organization:
What do you intend to use the data for?
Comments/Questions:

IMPORTANT: Once you click on the above button, you will be able to choose BitTorrent for your download. Please make sure to try to use BitTorrent if you are connected to the Internet (and are NOT connected to the Internet2.) Also, please leave your BitTorrent program around after the download is complete to help others obtain the corpus. Some people have left BitTorrent running for weeks without any ill effects! :-)


 


©2005,2006,2007,2008,2009,2010   WestburyLab   chrisw at ualberta dot ca