Research Article

ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition

Table 4

Comparison of number of tokens, average token length, and number of incorrectly segmented entities for various tokenizers.

Data setChemSpottmVarChemTokWhite space Tokenizer
NTATLNISENTATLNISENTATLNISENTATLNISE

Chem DNER
 Train9074054.62409650564.35118993434.6667182445.849189
 Development9016104.64369584754.36118931804.6837142875.859174
 Test7797004.6388280014.3637728474.6735136305.857804
DrugBank
 Train1274355.06501356254.76481267535.0961074096.004623
 Test31895.12134074.79131745.14026656.12116
Medline
 Train326254.772341784.552322594.821270665.75431
 Test129784.850136734.610128754.890108395.1196

NT: number of tokens, ATL: average token length, and NISE: number of incorrectly segmented entities.