Table of Contents Author Guidelines Submit a Manuscript
BioMed Research International
Volume 2016, Article ID 4248026, 9 pages
http://dx.doi.org/10.1155/2016/4248026
Research Article

ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition

1Computer Engineering Department, Eastern Mediterranean University, Famagusta, Northern Cyprus, Mersin 10, Turkey
2Information Technology Department, Eastern Mediterranean University, Famagusta, Northern Cyprus, Mersin 10, Turkey

Received 21 November 2015; Revised 10 December 2015; Accepted 10 December 2015

Academic Editor: Yudong Cai

Copyright © 2016 Abbas Akkasi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linked References

  1. PubChem, “The PubChem Project,” http://pubchem.ncbi.nlm.nih.gov.
  2. N. Chinchor and P. Robinson, “MUC-7 named entity task definition,” in Proceedings of the 7th Conference on Message Understanding, p. 29, New York, NY, USA, September 1997.
  3. E. F. T. K. Sang and F. De Meulder, “Introduction to the CoNLL-2003 shared task: language-independent named entity recognition,” in Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003 (CONLL '03), vol. 4, pp. 142–147, Association for Computational Linguistics, 2003. View at Publisher · View at Google Scholar
  4. M. Vazquez, M. Krallinger, F. Leitner, and A. Valencia, “Text mining for drugs and chemical compounds: methods, tools and applications,” Molecular Informatics, vol. 30, no. 6-7, pp. 506–519, 2011. View at Publisher · View at Google Scholar · View at Scopus
  5. J. Jiang and C. Zhai, “An empirical study of tokenization strategies for biomedical information retrieval,” Information Retrieval, vol. 10, no. 4-5, pp. 341–363, 2007. View at Publisher · View at Google Scholar · View at Scopus
  6. R. Arens, “A preliminary look into the use of named entity information for bioscience text tokenization,” in Proceedings of the Student Research Workshop at HLT-NAACL (HLT-SRWS '04), pp. 37–42, Association for Computational Linguistics, Boston, Mass, USA, May 2004.
  7. N. A. Bennett, Q. He, K. Powell, and B. R. Schatz, “Extracting noun phrases for all of MEDLINE,” in Proceedings of the AMIA Symposium, pp. 671–675, American Medical Informatics Association, Washington, DC, USA, 1999.
  8. K. Seki and J. Mostafa, “An approach to protein name extraction using heuristics and a dictionary,” Proceedings of the American Society for Information Science and Technology, vol. 40, no. 1, pp. 71–77, 2003. View at Publisher · View at Google Scholar
  9. M. Kayaalp, A. R. Aronson, S. M. Humphrey et al., “Methods for accurate retrieval of MEDLINE citations in functional genomics,” in Proceedings of the Notebook of Text REtrieval Conference (TREC '03), vol. 2003, pp. 175–184, 2003.
  10. N. Barrett and J. Weber-Jahnke, “Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm,” BMC Bioinformatics, vol. 12, supplement 3, article S1, 2011. View at Publisher · View at Google Scholar · View at Scopus
  11. R. Leaman, C.-H. Wei, and Z. Lu, “tmChem: a high performance approach for chemical named entity recognition and normalization,” Journal of Cheminformatics, vol. 7, supplement 1, article S3, 2015. View at Publisher · View at Google Scholar
  12. ChemSpot Tool, WBI, 2013, https://www.informatik.hu-berlin.de/de/forschung/gebiete/wbi/resources/chemspot/chemspot.
  13. T. Rocktäschel, M. Weidlich, and U. Leser, “Chemspot: a hybrid system for chemical named entity recognition,” Bioinformatics, vol. 28, no. 12, Article ID bts183, pp. 1633–1640, 2012. View at Publisher · View at Google Scholar · View at Scopus
  14. C.-H. Wei, B. R. Harris, H.-Y. Kao, and Z. Lu, “TmVar: a text mining approach for extracting sequence variants in biomedical literature,” Bioinformatics, vol. 29, no. 11, pp. 1433–1439, 2013. View at Publisher · View at Google Scholar · View at Scopus
  15. Chemical Affixes, Affixes: The building block of English, http://www.affixes.org/themes/index.html.
  16. Y. He and M. Kayaalp, A Comparison of 13 Tokenizers on MEDLINE, The Lister Hill National Center for Biomedical Communications, Bethesda, Md, USA, 2006.
  17. B. Habert, G. Adda, M. Adda-Decker et al., “Towards tokenization evaluation,” in Proceedings of the International Conference on Language Resources and Evaluation (LREC '98), pp. 427–431, Granada, Spain, May 1998.
  18. C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. View at Publisher · View at Google Scholar · View at Scopus
  19. J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: probabilistic models for segmenting and labeling sequence data,” in Proceedings of the 18th International Conference on Machine Learning (ICML '01), pp. 282–289, Williamstown, Mass, USA, June 2001.
  20. M. Krallinger, O. Rabal, F. Leitner et al., “The CHEMDNER corpus of chemicals and drugs and its annotation principles,” Journal of Cheminformatics, vol. 7, supplement 1, article S2, 2015. View at Google Scholar
  21. Sem Eval Data Set, “DDIExtraction 2013: extraction of drug-drug Interactions from biomedical texts,” 2013, http://www.mavir.net/conf/137-ddiextraction2013.
  22. I. Segura Bedmar, P. Martínez, and M. Herrero Zazo, Semeval-2013 task 9: extraction of drug-drug interactions from biomedical texts (DDIextraction 2013), Association for Computational Linguistics, 2013.
  23. J. J. Webster and C. Kit, “Tokenization as the initial phase in NLP,” in Proceedings of the 14th Conference on Computational Linguistics—Volume 4 (COLING '92), pp. 1106–1110, Association for Computational Linguistics, Nantes, France, August 1992.
  24. M. A. Attia, “Arabic tokenization system,” in Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 65–72, Association for Computational Linguistics, June 2007.
  25. S. Ramanan and P. S. Nathan, “Adapting cocoa, a multi-class entity detector, for the chemdner task of biocreative IV,” in Proceedings of the BioCreative Challenge Evaluation Workshop, vol. 2, p. 60, Washington, DC, USA, October 2013.
  26. M. Krallinger, F. Leitner, O. Rabal, M. Vazquez, J. Oyarzabal, and A. Valencia, “CHEMDNER: the drugs and chemical names extraction challenge,” Journal of Cheminformatics, vol. 7, supplement 1, article S1, 2015. View at Publisher · View at Google Scholar
  27. Y. Mi, J. Zhao, and S. S. Feng, “Targeted co-delivery of docetaxel, cisplatin and herceptin by vitamin E TPGS-cisplatin prodrug nanoparticles for multimodality treatment of cancer,” Journal of Controlled Release, vol. 169, no. 3, pp. 185–192, 2013. View at Publisher · View at Google Scholar
  28. R. N. Mshana, G. Tadesse, G. Abate, and H. Miörner, “Use of 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide for rapid detection of rifampin-resistant Mycobacterium tuberculosis,” Journal of Clinical Microbiology, vol. 36, no. 5, pp. 1214–1219, 1998. View at Google Scholar
  29. E. F. T. K. Sang and J. Veenstra, “Representing text chunks,” in Proceedings of the 9th Conference on European Chapter of the Association for Computational Linguistics (EACL '99), pp. 173–179, Association for Computational Linguistics, 1999.
  30. Amino Acids, “Twenty Amino Acids,” http://www.cryst.bbk.ac.uk/education/AminoAcid/the_twenty.html.
  31. Periodic table of elements, Periodic table of elements: LANL, http://periodic.lanl.gov/downloads.shtml.
  32. T. Kudo, Yamcha: Yet Another Multipurpose Chunk Annotator, 2005, http://chasen.org/~taku/software/yamcha.
  33. A. K. McCallum, “MALLET: A Machine Learning for Language Toolkit,” 2002, http://mallet.cs.umass.edu/.
  34. M. Herrero-Zazo, I. Segura-Bedmar, P. Martínez, and T. Declerck, “The DDI corpus: an annotated corpus with pharmacological substances and drug-drug interactions,” Journal of Biomedical Informatics, vol. 46, no. 5, pp. 914–920, 2013. View at Publisher · View at Google Scholar · View at Scopus