Research Article

Learning to Translate: A Statistical and Computational Analysis

Table 1

Number of total and distinct words in training, development, and test sets.

TrainingDevelopmentTest

No. of Sentences1,259,9142,0002,000
EnglishTotal words35,284,05258,76259,147
EuroparlDistinct words124,0806,5516,429
SpanishTotal words36,695,62860,53661,160
Distinct words164,9208,1828,239

No. of Sentences4,968,8571,00010,000
EnglishTotal words146,980,34429,545295,085
UN corpusDistinct words485,4945,21017,105
ChineseTotal words138,045,74027,764278,4256
Distinct words530,2954,35313,193

No. of Sentences22,515,4002,0003,000
EnglishTotal words636,113,86651,54990,474
Giga corpusDistinct words2,603,9078,69112,580
FrenchTotal words772,104,55862,682109,197
Distinct words2,512,28610,12414,614