| |
|
Google to release N-gram dataset
|
| |
California, USA (Google): The Google Machine Translation Team has processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times. Google Research have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others.
For more information, please visit:
googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
|
|