Research Article

A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation

Table 2

Corpora statistics.

Data SetLang.SentencesTokensAv. len.

Test setEN2,05060,39929.46
ZH59,62829.09

Dev. setEN2,00059,73229.26
ZH2,00059,06429.07

In-domainEN43,6211,330,46429.16
ZH1,321,65528.97

Training setEN1,138,04428,626,36725.15
ZH28,239,74724.81