Research Article

ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition

Table 1

Rules used in Step of the algorithm.

Rule numberRule explanationExample
Tokens after Step Merged token

Numeric tokens which are separated by “.” or “,” or “/” or “-” or “_” are integrated into a single token.125
,
12
,
12
125,12,12

If concatenated tokens from Rule are surrounded by balanced containers such as parentheses, braces, and brackets, both container tokens are conjoined into the token.(
1-3
)
(1-3)

Single uppercase tokens which are followed by sequence of lowercase letters as the next token are recombined to a single token.C
ommon
Common

If the concatenation of consecutive tokens is found in the list of known chemical names, they are merged into one token. Na
CL
NaCL

Apply the plurality rule to the tokensAcidsAcids