| Dataset | Training documents words | Distinct words | Vocabulary size |
| A1 | 25,771 | 3,252 | 900 | A2 | 128,845 | 48,012 | 1,500 | A3 | 25,771 | 3,252 | 900 | A4 | 128,845 | 48,012 | 1,500 | C1 | 96,052 | 26,654 | 2,400 | C2 | 480,250 | 133,256 | 4,000 | C3 | 96,052 | 26,654 | 2,400 | C4 | 480,250 | 133,256 | 4,000 | I1 | 2,353,267 | 137,315 | 4,200 | I2 | 11,766,325 | 7,839,471 | 7,000 | I3 | 2,353,267 | 137,315 | 4,200 | I4 | 11,766,325 | 7,839,471 | 7,000 |
|
|