ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition
Table 5
Features used for training classifiers.
Feature set
Actual features in the feature set
Number of features used in set
Space features
Has right space, has left space, and has both right and left space
3
Context words
One token before and one token after current token
2
n-gram affixes
n-gram affixes (prefixes + suffixes) for ā:ā4 for each token
8
Word shapes
Word shape (number of uppercase, lowercase letters, digits, punctuation, and Greeks), digital word shape (word shape in digital format), and summarized word shape (combination of two aforementioned features)
3
Orthographic features
All uppercase, has slash, has punctuation, has real number, starts with digit, starts with uppercase, has more than 2 uppercase letters
7
Token length
Number of characters in the token
1
Common chemical prefixes and suffixes
Contains chemical affixes from the list of chemical affixes in [15]