Table of Contents Author Guidelines Submit a Manuscript
Advances in Bioinformatics
Volume 2012, Article ID 391574, 17 pages
http://dx.doi.org/10.1155/2012/391574
Review Article

Applications of Natural Language Processing in Biodiversity Science

1Center for Library and Informatics, Marine Biological Laboratory, 7 MBL Street, Woods Hole, MA 02543, USA
2School of Information Resources and Library Science, University of Arizona, Tucson, AZ 85719, USA

Received 4 November 2011; Accepted 15 February 2012

Academic Editor: Jörg Hakenberg

Copyright © 2012 Anne E. Thessen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linked References

  1. B. Wuethrich, “How climate change alters rhythms of the wild,” Science, vol. 287, no. 5454, pp. 793–795, 2000. View at Publisher · View at Google Scholar · View at Scopus
  2. W. E. Bradshaw and C. M. Holzapfel, “Genetic shift in photoperiodic response correlated with global warming,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 25, pp. 14509–14511, 2001. View at Publisher · View at Google Scholar · View at Scopus
  3. National Academy of Sciences , “New biology for the 21st Century,” Frontiers in Ecology and the Environment, vol. 7, no. 9, article 455, 2009. View at Publisher · View at Google Scholar
  4. A. E. Thessen and D. J. Patterson, “Data issues in life science,” ZooKeys, vol. 150, pp. 15–51, 2011. View at Google Scholar
  5. A. Hey, The Fourth Paradigm: Data-Intensive Scientific Discovery, 2009, http://iw.fh-potsdam.de/fileadmin/FB5/Dokumente/forschung/tagungen/i-science/TonyHey_-__eScience_Potsdam__Mar2010____complete_.pdf.
  6. L. D. Stein, “Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges,” Nature Reviews Genetics, vol. 9, pp. 678–688, 2008. View at Publisher · View at Google Scholar · View at Scopus
  7. P. B. Heidorn, “Shedding light on the dark data in the long tail of science,” Library Trends, vol. 57, no. 2, pp. 280–299, 2008. View at Google Scholar · View at Scopus
  8. Key Perspectives Ltd, “Data dimensions: disciplinary differences in research data sharing, reuse and long term viability,” Digital Curation Centre, 2010, http://scholar.google.com/scholar?hl=en&q=Data+Dimensions:+disciplinary+differences+in+research+data-sharing,+reuse+and+long+term+viability.++&btnG=Search&as_sdt=0,22&as_ylo=&as_vis=0#0.
  9. A. Vollmar, J. A. Macklin, and L. Ford, “Natural history specimen digitization: challenges and concerns,” Biodiversity Informatics, vol. 7, no. 2, 2010. View at Google Scholar
  10. P. N. Schofield, J. Eppig, E. Huala et al., “Sustaining the data and bioresource commons,” Research Funding, vol. 330, no. 6004, pp. 592–593, 2010. View at Publisher · View at Google Scholar · View at Scopus
  11. P. Groth, A. Gibson, and J. Velterop, “Anatomy of a Nanopublication,” Information Services & Use, vol. 30, no. 1-2, pp. 51–56, 2010. View at Publisher · View at Google Scholar · View at Scopus
  12. M. Kalfatovic, “Building a global library of taxonomic literature,” in 28th Congresso Brasileiro de Zoologia Biodiversidade e Sustentabilidade, 2010, http://www.slideshare.net/Kalfatovic/building-a-global-library-of-taxonomic-literature.
  13. X. Tang and P. Heidorn, “Using automatically extracted information in species page retrieval,” 2007, http://scholar.google.com/scholar?hl=en&q=Tang+Heidorn+2007+using+automatically+extracted&btnG=Search&as_sdt=0,22&as_ylo=&as_vis=0#0. View at Google Scholar
  14. H. Cui, P. Selden, and D. Boufford, “Semantic annotation of biosystematics literature without training examples,” Journal of the American Society for Information Science and Technology, vol. 61, pp. 522–542, 2010. View at Publisher · View at Google Scholar · View at Scopus
  15. A. Taylor, “Extracting knowledge from biological descriptions,” in Proceedings of 2nd International Conference on Building and Sharing Very Large-Scale Knowledge Bases, pp. 114–119, 1995.
  16. H. Cui, “Competency evaluation of plant character ontologies against domain literature,” Journal of the American Society for Information Science and Technology, vol. 61, no. 6, pp. 1144–1165, 2010. View at Publisher · View at Google Scholar · View at Scopus
  17. Y. Miyao, K. Sagae, R. Sætre, T. Matsuzaki, and J. Tsujii, “Evaluating contributions of natural language parsers to protein-protein interaction extraction,” Bioinformatics, vol. 25, no. 3, pp. 394–400, 2009. View at Publisher · View at Google Scholar · View at Scopus
  18. K. Humphreys, G. Demetriou, and R. Gaizauskas, “Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures,” in Proceedings of the Pacific Symposium on Biocomputing (PSB '00), vol. 513, pp. 505–513, 2000.
  19. R. Gaizauskas, G. Demetriou, P. J. Artymiuk, and P. Willett, “Protien structures and information extraction from biological texts: the pasta system,” Bioinformatics, vol. 19, no. 1, pp. 135–143, 2003. View at Publisher · View at Google Scholar · View at Scopus
  20. A. Divoli and T. K. Attwood, “BioIE: extracting informative sentences from the biomedical literature,” Bioinformatics, vol. 21, no. 9, pp. 2138–2139, 2005. View at Publisher · View at Google Scholar · View at Scopus
  21. D. P. A. Corney, B. F. Buxton, W. B. Langdon, and D. T. Jones, “BioRAT: extracting biological information from full-length papers,” Bioinformatics, vol. 20, no. 17, pp. 3206–3213, 2004. View at Publisher · View at Google Scholar · View at Scopus
  22. H. Chen and B. M. Sharp, “Content-rich biological network constructed by mining PubMed abstracts,” Bmc Bioinformatics, vol. 5, article 147, 2004. View at Publisher · View at Google Scholar · View at Scopus
  23. X. Zhou, X. Zhang, and X. Hu, “Dragon toolkit: incorporating auto-learned semantic knowledge into large-scale text retrieval and mining,” in Proceedings of the19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI '07), pp. 197–201, October 2007. View at Publisher · View at Google Scholar · View at Scopus
  24. D. Rebholz-Schuhmann, H. Kirsch, M. Arregui, S. Gaudan, M. Riethoven, and P. Stoehr, “EBIMed—text crunching to gather facts for proteins from Medline,” Bioinformatics, vol. 23, no. 2, pp. e237–e244, 2007. View at Publisher · View at Google Scholar · View at Scopus
  25. Z. Z. Hu, I. Mani, V. Hermoso, H. Liu, and C. H. Wu, “iProLINK: an integrated protein resource for literature mining,” Computational Biology and Chemistry, vol. 28, no. 5-6, pp. 409–416, 2004. View at Publisher · View at Google Scholar · View at Scopus
  26. J. Demaine, J. Martin, L. Wei, and B. De Bruijn, “LitMiner: integration of library services within a bio-informatics application,” Biomedical Digital Libraries, vol. 3, article 11, 2006. View at Publisher · View at Google Scholar · View at Scopus
  27. M. Lease and E. Charniak, “Parsing biomedical literature,” in Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP '05), Jeju Island, Korea, 2005.
  28. S. Pyysalo and T. Salakoski, “Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches,” BMC Bioinformatics, vol. 7, supplement 3, article S2, 2006. View at Publisher · View at Google Scholar · View at Scopus
  29. L. Rimell and S. Clark, “Porting a lexicalized-grammar parser to the biomedical domain,” Journal of Biomedical Informatics, vol. 42, no. 5, pp. 852–8865, 2009. View at Publisher · View at Google Scholar · View at Scopus
  30. H. Cui, “Converting taxonomic descriptions to new digital formats,” Biodiversity Informatics, vol. 5, pp. 20–40, 2008. View at Google Scholar
  31. D. Koning, I. N. Sarkar, and T. Moritz, “TaxonGrab: extracting taxonomic names from text,” Biodiversity Informatics, vol. 2, pp. 79–82, 2005. View at Google Scholar
  32. L. M. Akella, C. N. Norton, and H. Miller, “NetiNeti: discovery of scientific names from text using machine learning methods,” 2011.
  33. M. Gerner, G. Nenadic, and C. M. Bergman, “LINNAEUS: a species name identification system for biomedical literature,” BMC Bioinformatics, vol. 11, article 85, 2010. View at Publisher · View at Google Scholar · View at Scopus
  34. N. Naderi and T. Kappler, “OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents,” Bioinformatics, vol. 27, no. 19, pp. 2721–2729, 2011. View at Publisher · View at Google Scholar
  35. R. Abascal and J. A. Sánchez, “X-tract: structure extraction from botanical textual descriptions,” in Proceeding of the String Processing & Information Retrieval Symposium & International Workshop on Groupware, pp. 2–7, IEEE Computer Society, Cancun , Mexico, September 1999.
  36. H. Cui, “CharaParser for fine-grained semantic annotation of organism morphological descriptions,” Journal of the American Society for Information Science and Technology, vol. 63, no. 4, pp. 738–754, 2012. View at Publisher · View at Google Scholar
  37. M. Krauthammer, A. Rzhetsky, P. Morozov, and C. Friedman, “Using BLAST for identifying gene and protein names in journal articles,” Gene, vol. 259, no. 1-2, pp. 245–252, 2000. View at Publisher · View at Google Scholar · View at Scopus
  38. L. Lenzi, F. Frabetti, F. Facchin et al., “UniGene tabulator: a full parser for the UniGene format,” Bioinformatics, vol. 22, no. 20, pp. 2570–2571, 2006. View at Publisher · View at Google Scholar · View at Scopus
  39. A. Nasr and O. Rambow, “Supertagging and full parsing,” in Proceedings of the 7th International Workshop on Tree Adjoining Grammar and Related Formalisms (TAG '04), 2004.
  40. R. Leaman and G. Gonzalez, “BANNER: an executable survey of advances in biomedical named entity recognition,” in Proceedings of the Pacific Symposium on Biocomputing (PSB '08), pp. 652–663, Kona, Hawaii, USA, January 2008.
  41. M. Schröder, “Knowledge-based processing of medical language: a language engineering approach,” in Proceedings of the16th German Conference on Artificial Intelligence (GWAI '92), vol. 671, pp. 221–234, Bonn, Germany, August-September 1992.
  42. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann, 2nd edition, 2005.
  43. C. Blaschke, L. Hirschman, and A. Valencia, “Information extraction in molecular biology,” Briefings in Bioinformatics, vol. 3, no. 2, pp. 154–165, 2002. View at Google Scholar · View at Scopus
  44. A. Jimeno-Yepes and A. R. Aronson, “Self-training and co-training in biomedical word sense disambiguation,” pp. 182–183.
  45. C. Freeland, “An evaluation of taxonomic name finding & next steps in Biodiversity Heritage Library (BHL) developments,” Nature Precedings, 2009, http://precedings.nature.com/documents/3372/version/1. View at Google Scholar
  46. A. Kornai, “Experimental hmm-based postal ocr system,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), vol. 4, pp. 3177–3180, April 1997. View at Scopus
  47. A. Kornai, K. Mohiuddin, and S. D. Connell, “Recognition of cursive writing on personal checks,” in Proceedings of the 5th International Workshop on Frontiers in Handwriting Recognition, pp. 373–378, Citeseer, Essex, UK, 1996.
  48. C. Freeland, “Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing.,” in BioSystematics Berlin, 2011, http://www.slideshare.net/chrisfreeland/digitization-and-enhancement-of-biodiversity-literature-through-ocr-scientific-names-mapping-and-crowdsourcing. View at Google Scholar
  49. A. Willis, D. King, D. Morse, A. Dil, C. Lyal, and D. Roberts, “From XML to XML: the why and how of making the biodiversity literature accessible to researchers,” in Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC '10), pp. 1237–1244, European Language Resources Association (ELRA), Valletta, Malta, May 2010.
  50. F. Bapst and R. Ingold, “Using typography in document image analysis,” in Proceedings of Raster Imaging and Digital Typography (RIDT '98), pp. 240–251, Saint-Malo, France, March-April 1998.
  51. A. L. Weitzman and C. H. C. Lyal, An XML Schema for Taxonomic Literature—TaXMLit, 2004, http://www.sil.si.edu/digitalcollections/bca/documentation/taXMLitv1-3Intro.pdf.
  52. T. Rees, “TAXAMATCH, a “fuzzy” matching algorithm for taxon names, and potential applications in taxonomic databases,” in Proceedings of TDWG, 2008, pp. 35, http://www.tdwg.org/fileadmin/2008conference/documents/Proceedings2008.pdf#page=35.
  53. G. Sautter, K. Böhm, and D. Agosti, “Semi-automated xml markup of biosystematic legacy literature with the goldengate editor,” in Proceedings of the Pacific Symposium on Biocomputing (PSB '07), pp. 391–402, World Scientific, 2007. View at Scopus
  54. B. Settles, “ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text,” Bioinformatics, vol. 21, no. 14, pp. 3191–3192, 2005. View at Publisher · View at Google Scholar · View at Scopus
  55. G. A. Pavlopoulos, E. Pafilis, M. Kuhn, S. D. Hooper, and R. Schneider, “OnTheFly: a tool for automated document-based text annotation, data linking and network generation,” Bioinformatics, vol. 25, no. 7, pp. 977–978, 2009. View at Publisher · View at Google Scholar · View at Scopus
  56. E. Pafilis, S. I. O'Donoghue, L. J. Jensen et al., “Reflect: augmented browsing for the life scientist,” Nature Biotechnology, vol. 27, no. 6, pp. 508–510, 2009. View at Publisher · View at Google Scholar · View at Scopus
  57. M. Kuhn, C. von Mering, M. Campillos, L. J. Jensen, and P. Bork, “STITCH: interaction networks of chemicals and proteins,” Nucleic Acids Research, vol. 36, no. 1, pp. D684–D688, 2008. View at Publisher · View at Google Scholar · View at Scopus
  58. J. P. Balhoff, W. M. Dahdul, C. R. Kothari et al., “Phenex: ontological annotation of phenotypic diversity,” Plos ONE, vol. 5, no. 5, article e10500, 2010. View at Publisher · View at Google Scholar · View at Scopus
  59. W. M. Dahdul, J. P. Balhoff, J. Engeman et al., “Evolutionary characters, phenotypes and ontologies: curating data from the systematic biology literature,” Plos ONE, vol. 5, no. 5, Article ID e10708, 2010. View at Publisher · View at Google Scholar · View at Scopus
  60. G. Sautter, K. Bohm, and D. Agosti, “A combining approach to find all taxon names (FAT) in legacy biosystematics literature,” Biodiversity Informatics, vol. 3, pp. 46–58, 2007. View at Google Scholar
  61. P. R. Leary, D. P. Remsen, C. N. Norton, D. J. Patterson, and I. N. Sarkar, “UbioRSS: tracking taxonomic literature using RSS,” Bioinformatics, vol. 23, no. 11, pp. 1434–1436, 2007. View at Publisher · View at Google Scholar · View at Scopus
  62. N. Okazaki and S. Ananiadou, “Building an abbreviation dictionary using a term recognition approach,” Bioinformatics, vol. 22, no. 24, pp. 3089–3095, 2006. View at Publisher · View at Google Scholar · View at Scopus
  63. K. Bontcheva, V. Tablan, D. Maynard, and H. Cunningham, “Evolving gate to meet new challenges in language engineering,” Natural Language Engineering, vol. 10, no. 3-4, pp. 349–373, 2004. View at Publisher · View at Google Scholar · View at Scopus
  64. H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, C. Ursu et al., Developing Language Processing Components with GATE (A User Guide), University of Sheffield, 2006.
  65. E. Fitzpatrick, J. Bachenko, and D. Hindle, “The status of telegraphic sublanguages,” in Analyzing Language in Restricted Domains: Sublanguage Description and Processing, pp. 39–51, 1986. View at Google Scholar
  66. M. Wood, S. Lydon, V. Tablan, D. Maynard, and H. Cunningham, “Populating a database from parallel texts using ontology-based information extraction,” in Natural Language Processing and Information Systems, vol. 3136, pp. 357–365, 2004. View at Google Scholar
  67. L. Chen, H. Liu, and C. Friedman, “Gene name ambiguity of eukaryotic nomenclatures,” Bioinformatics, vol. 21, no. 2, pp. 248–256, 2005. View at Publisher · View at Google Scholar · View at Scopus
  68. H. Yu, W. Kim, V. Hatzivassiloglou, and W. J. Wilbur, “Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles,” Journal of Biomedical Informatics, vol. 40, no. 2, pp. 150–159, 2007. View at Publisher · View at Google Scholar · View at Scopus
  69. J. T. Chang and H. Schutze, “Abbreviations in biomedical text,” in Text Mining for Biology and Biomedicine, pp. 99–119, 2006. View at Google Scholar
  70. J. D. Wren and H. R. Garner, “Heuristics for identification of acronym-definition patterns within text: towards an automated construction of comprehensive acronym-definition dictionaries,” Methods of Information in Medicine, vol. 41, no. 5, pp. 426–434, 2002. View at Google Scholar · View at Scopus
  71. S. Lydon and M. Wood, “Data patterns in multiple botanical descriptions: implications for automatic processing of legacy data,” Systematics and Biodiversity, vol. 1, no. 2, pp. 151–157, 2003. View at Publisher · View at Google Scholar
  72. A. Taylor, “Using prolog for biological descriptions,” in Proceedings of The 3rd international Conference on the Practical Application of Prolog, pp. 587–597, 1995.
  73. A. E. Radford, Fundamentals of Plant Systematics, Harper & Row, New York, NY, USA, 1986.
  74. J. Diederich, R. Fortuner, and J. Milton, “Computer-assisted data extraction from the taxonomical literature,” 1999, http://math.ucdavis.edu/~milton/genisys.html.
  75. M. Wood, S. Lydon, V. Tablan, D. Maynard, and H. Cunningham, “Using parallel texts to improve recall in IE,” in Proceedings of Recent Advances in Natural Language Processing (RANLP '03), pp. 505–512, Borovetz, Bulgaria, 2003.
  76. H. Cui and P. B. Heidorn, “The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions,” Journal of the American Society for Information Science and Technology, vol. 58, no. 1, pp. 133–149, 2007. View at Publisher · View at Google Scholar
  77. Q. Wei, Information fusion in taxonomic descriptions, Ph.D. thesis, University of Illinois at Urbana-Champaign, Champaign, Ill, USA, 2011.
  78. S. Soderland, “Learning information extraction rules for semi-structured and free text,” Machine Learning, vol. 34, no. 1, pp. 233–272, 1999. View at Google Scholar · View at Scopus
  79. H. Cui, S. Singaram, and A. Janning, “Combine unsupervised learning and heuristic rules to annotate morphological characters,” Proceedings of the American Society for Information Science and Technology, vol. 48, no. 1, pp. 1–9, 2011. View at Publisher · View at Google Scholar
  80. P. M. Mabee, M. Ashburner, Q. Cronk et al., “Phenotype ontologies: the bridge between genomics and evolution,” Trends in Ecology and Evolution, vol. 22, no. 7, pp. 345–350, 2007. View at Publisher · View at Google Scholar · View at Scopus