Research Article

ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition

Table 5

Features used for training classifiers.

Feature setActual features in the feature setNumber of features used in set

Space featuresHas right space, has left space, and has both right and left space3

Context words One token before and one token after current token2

n-gram affixes n-gram affixes (prefixes + suffixes) for ā€‰:ā€‰4 for each token8

Word shapesWord shape (number of uppercase, lowercase letters, digits, punctuation, and Greeks), digital word shape (word shape in digital format), and summarized word shape (combination of two aforementioned features)3

Orthographic featuresAll uppercase, has slash, has punctuation, has real number, starts with digit, starts with uppercase, has more than 2 uppercase letters7

Token lengthNumber of characters in the token1

Common chemical prefixes and suffixesContains chemical affixes from the list of chemical affixes in [15]1