Research Article

iSentenizer- : Multilingual Sentence Boundary Detection Model

Table 4

Size of the Brown, WSJ, and Tycho Brahe corpora.

CorpusSentencesTokens
Training dataTest data

WSJ corpus41,9774,6711,153,993
Brown corpus51,5995,8011,155,242
Tycho Brahe corpus38,0005,102953,080