ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition
Table 1
Rules used in Step of the algorithm.
Rule number
Rule explanation
Example
Tokens after Step
Merged token
Numeric tokens which are separated by “.” or “,” or “/” or “-” or “_” are integrated into a single token.
125 , 12 , 12
125,12,12
If concatenated tokens from Rule are surrounded by balanced containers such as parentheses, braces, and brackets, both container tokens are conjoined into the token.
( 1-3 )
(1-3)
Single uppercase tokens which are followed by sequence of lowercase letters as the next token are recombined to a single token.
C ommon
Common
If the concatenation of consecutive tokens is found in the list of known chemical names, they are merged into one token.