Research Article
ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition
Table 4
Comparison of number of tokens, average token length, and number of incorrectly segmented entities for various tokenizers.
| Data set | ChemSpot | tmVar | ChemTok | White space Tokenizer | NT | ATL | NISE | NT | ATL | NISE | NT | ATL | NISE | NT | ATL | NISE |
| Chem DNER | | | | | | | | | | | | | Train | 907405 | 4.62 | 40 | 965056 | 4.35 | 11 | 899343 | 4.66 | 6 | 718244 | 5.84 | 9189 | Development | 901610 | 4.64 | 36 | 958475 | 4.36 | 11 | 893180 | 4.68 | 3 | 714287 | 5.85 | 9174 | Test | 779700 | 4.63 | 8 | 828001 | 4.36 | 3 | 772847 | 4.67 | 3 | 513630 | 5.85 | 7804 | DrugBank | | | | | | | | | | | | | Train | 127435 | 5.06 | 50 | 135625 | 4.76 | 48 | 126753 | 5.09 | 6 | 107409 | 6.00 | 4623 | Test | 3189 | 5.12 | 1 | 3407 | 4.79 | 1 | 3174 | 5.14 | 0 | 2665 | 6.12 | 116 | Medline | | | | | | | | | | | | | Train | 32625 | 4.77 | 2 | 34178 | 4.55 | 2 | 32259 | 4.82 | 1 | 27066 | 5.75 | 431 | Test | 12978 | 4.85 | 0 | 13673 | 4.61 | 0 | 12875 | 4.89 | 0 | 10839 | 5.11 | 96 |
|
|
NT: number of tokens, ATL: average token length, and NISE: number of incorrectly segmented entities.
|