Research Article
A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation
Table 2
Corpora statistics.
| Data Set | Lang. | Sentences | Tokens | Av. len. |
| Test set | EN | 2,050 | 60,399 | 29.46 | ZH | 59,628 | 29.09 |
| Dev. set | EN | 2,000 | 59,732 | 29.26 | ZH | 2,000 | 59,064 | 29.07 |
| In-domain | EN | 43,621 | 1,330,464 | 29.16 | ZH | 1,321,655 | 28.97 |
| Training set | EN | 1,138,044 | 28,626,367 | 25.15 | ZH | 28,239,747 | 24.81 |
|
|