- About this Journal ·
- Abstracting and Indexing ·
- Advance Access ·
- Aims and Scope ·
- Annual Issues ·
- Article Processing Charges ·
- Articles in Press ·
- Author Guidelines ·
- Bibliographic Information ·
- Citations to this Journal ·
- Contact Information ·
- Editorial Board ·
- Editorial Workflow ·
- Free eTOC Alerts ·
- Publication Ethics ·
- Reviewers Acknowledgment ·
- Submit a Manuscript ·
- Subscription Information ·
- Table of Contents
The Scientific World Journal
Volume 2012 (2012), Article ID 949247, 8 pages
A Learning-Based Approach for Biomedical Word Sense Disambiguation
University of Houston-Clear Lake, Houston, TX 77058, USA
Received 28 October 2011; Accepted 30 November 2011
Academic Editor: Massimo Cafaro
Copyright © 2012 Hisham Al-Mubaid and Sandeep Gungu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
In the biomedical domain, word sense ambiguity is a widely spread problem with bioinformatics research effort devoted to it being not commensurate and allowing for more development. This paper presents and evaluates a learning-based approach for sense disambiguation within the biomedical domain. The main limitation with supervised methods is the need for a corpus of manually disambiguated instances of the ambiguous words. However, the advances in automatic text annotation and tagging techniques with the help of the plethora of knowledge sources like ontologies and text literature in the biomedical domain will help lessen this limitation. The proposed method utilizes the interaction model (mutual information) between the context words and the senses of the target word to induce reliable learning models for sense disambiguation. The method has been evaluated with the benchmark dataset NLM-WSD with various settings and in biomedical entity species disambiguation. The evaluation results showed that the approach is very competitive and outperforms recently reported results of other published techniques.
Word sense disambiguation is the task of determining the correct sense of a given word in a given context. In the general language domain, and within natural language processing (NLP), the word sense disambiguation (WSD) problem has been studied and investigated extensively over the past few decades [1, 2]. In the biomedical domain, on the other hand, WSD is more widely spread in the biological and medical texts and sometimes with more severe consequences. The amount of WSD research in the biomedical domain is not proportional to the extent of the problem. As an example, in the biomedical texts, the term “blood pressure” has three possible senses according to the Unified Medical Language System (UMLS)  as follows: organism function, diagnostic procedure, and laboratory or test result. Thus, if this term blood pressure is found in a medical text, the reader has to manually judge and determines which one of these three senses is intended in that text. Word sense disambiguation contributes in many important applications including the text mining, information extraction, and information retrieval systems [1, 2, 4]. It is also considered a key component in most intelligent knowledge discovery and text mining applications.
The main classes of approaches of word sense disambiguation include supervised methods and unsupervised methods. The supervised methods rely on training and learning phases that require a dataset or corpus containing manually disambiguated instances to be used to train the system [5, 6]. The unsupervised methods, on the other hand, are based on knowledge sources like ontology, for example, from UMLS, or text corpora [2, 4, 7, 8]. Our approach in this paper is a supervised approach. In this paper, we present and evaluate a supervised method for biomedical word sense disambiguation. The method is based on machine learning and uses some feature selection techniques in constructing feature vectors for the words to be disambiguated. We conducted the evaluation using the NLM-WSD benchmark corpus and species disambiguation dataset. The evaluation results proved the competitiveness of the proposed approach as it outperforms some recently published techniques including supervised techniques.
2. Related Work
In the biomedical domain, the applications of text mining and machine learning techniques were quite successful and encouraging . Most of the methods for biomedical entity name recognition, classification, or disambiguation can be roughly divided into three categories: (i) supervised and machine-learning-based techniques, (ii) statistical and corpus-based techniques, and (iii) syntactic and rule-based techniques [9–11]. Moreover, the bioinformatics literature shows that biomedical WSD has been a quite active area of research with a number of approaches proposed and applied to biomedical data [1, 2, 4, 8, 12, 13].
Agirre et al. proposed a graph-based WSD technique which is considered unsupervised but relies on UMLS . The concepts of UMLS are represented as a graph, and WSD is done using personalized page rank algorithm .
In another related research, Jimeno-Yepes and Aronson  presented a review and evaluation of four WSD approaches that rely on UMLS as the source for knowledge for disambiguation. In , Stevenson et al. use supervised learners with linguistic features extracted from the context of the word in combination with MeSH terms for disambiguation.
The UMLS has been used, by Humphrey et al., as a knowledge source for assigning the correct sense for a given word . They used journal descriptor indexing of the abstract containing the term to assign a semantic type from UMLS metathesaurus [3, 13].
In bioinformatics and computational biology, there are quite a few tasks similar to WSD like biomedical term disambiguation, gene protein name disambiguation, and disambiguating species for biomedical named entities [9–11]. The task of biomedical named entity disambiguation or classification is an augmentation of the well-known task of biomedical named entity recognition (NER). In NER, biomedical entity names, for example, gene names, are recognized and extracted from the text. In the biomedical named entity disambiguation, the extracted entity names (e.g., gene product names) will be applied onto a process such that each occurrence should be disambiguated as either gene name or protein name as the same name can refer to a gene or protein. For example, the biomedical entity name SBP2 can be a gene name or a protein name depending on the context [10, 11]. Furthermore, in species disambiguation, the term c-myc is a gene, but it can be either in a human gene (homo sapiens) or mouse gene (mus musculus) depending on the context [9–11, 14–16].
In , Wang et al. devised a rule based system to disambiguate biomedical entity names, like gene products, based on species. In that approach , some parsing techniques are used and syntactic parse tree with paths between words to determine if there exists a path between species word and the entity name. They employed and examined several parsers in the task including C&C, Enju, Minipar, and Stanford-Genia [9, 15, 16].
3. A Method for WSD
A word sense disambiguation method is an algorithm that assigns the most accurate sense to a given word in a given context. Our method is a supervised method requiring a training corpus that contains manually disambiguated instances of the ambiguous words. The method is based on a word classification and disambiguation technique that we have proposed in a preliminary work . In the previous work, , we introduced a method for term disambiguation and evaluated it with biomedical terms to disambiguate gene and protein names in medical texts.
The method relies on representing the instances of the word to be disambiguated, , as a feature vector, and the components of this vector are neighborhood context words in the training instances. In the context of the target word, , we select the words with the high discriminating capabilities as the components of the vectors. As a supervised technique, this method consists of two stages learning (or training) stage and a testing (or application) stage. The trained models (classifiers) produced from the learning phase will then be used to disambiguate unseen and unlabeled examples in the testing phase. That is, during the learning phase, the constructed feature vectors of the training instances will be used as labeled examples to train classifiers. The classifier will be then used to disambiguate unseen and unlabeled examples in the application phase. One of the main strength of this method is that the features are selected for learning and classification.
The features selected from the training examples have great impact on the effectiveness of the machine learning technique. Extensive research efforts have been devoted to feature selection in machine learning research [18–21]. The labeled training instances will be used to extract the word features for the feature vectors.
Suppose the word has two senses , , let the set be the set of instances labeled with , and suppose contains instances of labeled with sense . So, each instance of labeled with sense or (i.e., in the set or in the set ) can be viewed as where the words and ,…, are the context words surrounding this instance, and is the window size. Next, we collect all the context words and of all instances in and in one set W (s.t. . Each context word ∈W may occur in the contexts of instances labeled with or with or combination and in any distribution. We want to determine that, if we see a context word in an ambiguous instance/example, to what extent this occurrence of suggests that this example belongs to or to . Thus, we use as features those context words that can highly discriminate between and . For that, we use feature selection techniques such as mutual information (MI) [19, 20] as follows. For each context word ∈ W in the labeled training examples, we compute four values a, b, c, and as follows:a = number of occurrences of in ,b = number of occurrences of in ,c = number of examples of that do not contain ,d = number of examples of that do not contain .Therefore, the mutual information (MI) can be defined as and is the total number of training examples. MI is a well-known concept in information theory and statistical learning. MI is a measure of interaction and common information between two variables . In this work, we adapted MI to represent the interaction between the context words and the class label based on the values through as defined above. We utilized the training corpus of the labeled instances of the word to be disambiguated to compile the list of all context words () as explained above; all instances of one sense are under one class label. We notice that if the context word, , is mostly occurring in class (or mostly in ), then the MI indicates this as shown in (2). Thus, MI can be used as a means to estimate the amount of information interaction between a context work and a class label. So, MI is used to select the context words with the highest discriminating capability between and . For simplicity, and without loss of generality, we assume that we have two senses (two class labels). Moreover, following the same intuitive reasoning of mutual information, MI, we define another method, M2, for selecting the words as features to be included in the feature vectors as follows: In the following example, assume that the target word has 10 instances already labeled with one of two senses as shown in Table 1. Class are the instances of with the first sense, while are the instances of instances in the second sense. Each instance is shown with its context words within certain window size. The target word is shown in bold face. In this example, is the total number of training examples. The values of a, b, c, d for are (4,1,1,4), respectively. That is, has 4 occurrences in and one instance in , and so on. The values of a, b, c, d for are (3, 2, 2, 3), respectively. As we can see, is more highly related with the class than , and so it has more discriminating power than , and this is quantified by their MI values. MI values for and are 1.8 and 1.2, respectively.
Then, MI (or M2) value is computed for all context words ∈ W. Then, the context words are ordered based on their MI values, and the top words with highest MI values are selected as features. In this research, we experimented with values of 100, 200, and 300. With , for example, each training example will be represented by a vector of 100 entries such that the first entry represent the context word with the highest MI value, and the second entry represents the context word with the second highest MI value and so on.
Then, for a given training example, the feature vector entry is set to +MI (or −MI) if the corresponding feature (context word) occurs (does not occur) in that training example and set to −MI otherwise. Table 2 shows the top 10 context words with the ten highest MI values for the ambiguous word “cold” in the NLM-WSD benchmark corpus explained in Section 3. These 10 words will be used to compose the feature vectors for training or testing examples of the terms to be disambiguated. For example, a simple feature vector of size 5 can be as follows: This feature vector represents an instance that has the first, third, and fourth context words available in its context, and 1.23 is the MI value of the context word with the highest MI.
The Learning Phase
From the labeled training examples of the word, we build the feature vectors using the top context words selected by MI or M2 as features. After that, we use the support vector machine (SVM)  as the learner to train the classifier using the training vectors. SVM has been shown as one of the most successful and efficient machine learning algorithms and is well founded theoretically and experimentally [7, 17, 18, 23]. The applications of SVM are abound; in particular, in NLP domain like text categorization, relation extraction, named entity recognition, SVM proved to be the best performer. We use SVM-light (http://svmlight.joachims.org/) implementation with the default parameters and with the Radial Basis Function (RBF) kernel.
The Disambiguation Step
In the testing step, we want to disambiguate an instance of the word . We construct a feature vector for the instance the same way as in the learning step. The induced learning model (classifier) from the learning step will be employed to classify it (assign ) to one of the two senses.
4. Evaluation and Experiments
4.1. Biomedical WSD (NLM-WSD)
We used the benchmark dataset NLM-WSD for biomedical word sense disambiguation . This dataset was created as a unified and benchmark set of ambiguous medical terms that have been reviewed and disambiguated by reviewers from the field. Most of the previous work on biomedical WSD uses this dataset [1, 2, 4]. The NLM-WSD corpus contains 50 ambiguous terms with 100 instances for each term for a total of 5000 examples. Each example is basically a Medline abstract containing one or more occurrences of the ambiguous word. The instances of these ambiguous terms were disambiguated by 11 annotators who assigned a sense for each instance . The assigned senses are semantic types from UMLS. When the annotators did not assign any sense for an instance, then that instance is tagged with “none”. Only one term “association” with all of its 100 instances were annotated none and so dropped from the testing.
On this benchmark corpus, we have carried out some text preprocessing steps.(i)Converting all words to lowercase.(ii)Removing stopwords: removing all common function words like “is” “the” “in”,… and so forth.(iii)Performing word stemming using Porter stemming algorithm .Moreover, unlike other previous work, words with less than 3 or more than 50 characters are not ignored currently (unless dropped by the stopword removal step). Also words with parentheses or square brackets are not ignored and part of speech is not used.
After the text preprocessing is completed, for each word we convert the instances into numeric feature vectors. Then, we use SVM for training and testing with 5-fold cross validation 5FCV such that 80% of the instances are used for training and the remaining 20% are used for testing, and this is repeated five times by changing the training-testing portions of the data. The accuracy is taken as the mean accuracy of the five folds and the accuracy is computed as We also use the baseline method which is the most frequent sense (mfs) for each word.
Initially, we evaluated our WSD method with all the 49 words (excluding association as mentioned previously) such that, a word is included in the evaluation only if it has at least two or more senses with each sense having at least two instances annotated with it. This lead, to a total of 31 words tested in this evaluation, and 18 words were dropped because they do not have at least two instances annotated for each one of two senses. For example, the word “depression” has two senses: mental or behavioral dysfunction and functional concept. Out of the 100 instances of depression, 85 instances are tagged with the first sense, and remaining 15 instances are tagged with “None” (i.e., no instances tagged with a second sense), and so it was excluded in this evaluation. Likewise, the word “discharge” was not tested as it has only one instance tagged with the first sense, 74 instances tagged with the second sense, and 25 instances tagged with None. We used , and the window size is 5. The accuracy results of this first evaluation (EV1) are shown in Table 4. The detailed results of this evaluation are included in Table 5.
In the second evaluation (EV2) and third evaluation (EV3), we changed the parameter and the word/features selection formula. In EV2, we set , and window size is still 5. In EV3, we kept , window = 5, and changed the word/feature selection formula to M2 defined in (3). Table 5 contains the results of EV2 and EV3. To judge on performance of our method and compare our results with similar techniques, we included several reported results from three recent publications from 2008 to 2010 [1, 2, 4] with our results in Table 6 under the same experimental settings.
4.2. Species Disambiguation
In biomedical text, named entities, like gene name, are used the same way irrespective of the species of the entity. As a result, it will be difficult to extract relevant medical information automatically from texts using information extraction system. In biomedical named entity species disambiguation, for a given entity name, for example, c-myc, we want to disambiguate this entity name, c-myc, based on the species (e.g., human versus mouse) . In one instance, c-myc might refer to a human gene, while in another instance it refers to a mouse gene.
For example, in Table 3, the biomedical entity name BCL-2 (a protein name) in the first text (no. 1) is human while in the second one is a mouse protein. We examined our system on this task of species disambiguation. We obtained the data from the project of Wang et al. . From their data, we tested the biomedical entity names that occur in at least two species with at least 3 occurrences in each species. This enables us to use two instances for training and one for testing and repeat it three times. If the entity has 5 or more occurrences in one species, we repeat five times using 5FCV as in Section 4.1. We extracted and tested our system on a total 465 instances of entity names with an average of 8 instances per species for each entity name. In the original dataset (gold standard), 90% of the terms have all their instances occurring in only one species  and so cannot be tested in our system. Our system requires that each term should have instances in two or more species with at least 3 occurrences in each species. The results of Wang et al. are shown in Table 7, whereas the results of our proposed system are shown in Table 8 in terms of precision, recall, and F1.
5. Discussion and Conclusion
The main weakness of the supervised and machine-learning-based methods for WSD is their dependency on the annotated training text which includes manually disambiguated instances of the ambiguous word [2, 17]. However, over the time, the increasing volumes of text and literature in very high rates and the new algorithms and techniques for text annotation and concept mapping will alleviate this problem. Moreover, the advances in ontology development and integration in the biomedical domain will facilitate even more the process of automatic text annotation.
In this paper, we reported a machine learning approach for biomedical WSD. The approach was evaluated with a benchmark dataset, NLM-WSD, to facilitate the comparison with the results of previous work. The average accuracy results of our method, compared to some recent reported results (Table 6), are promising and proving that our method outperforms those recently reported methods. Table 6 contains the results for 11 methods: baseline method (mfs), our method (last column), and 9 other methods from recent work published in 2008 to 2010 (from [1, 2, 4]). The average accuracy of our method is the highest (90.3%), and the closest one is NB (86.0%).
Our method also outperforms all 10 other methods in 12 out of 31 words followed by NB which outperforms the rest in 7 words.
Stevenson et al. in their paper  report extensive accuracy results of their method (we call it Stevenson-2008) along with four other methods including Joshi-2005 and McInnes-2007, with various combinations of words from NLM-WSD corpus used for testing. For example, Joshi-2005 tested their system on 28 words (out of the whole set 50 words) and other techniques used 22 words, 15 words, or the whole set . In Table 6, the results of the three methods (Joshi-2005, McInnes-2007, and Stevenson-2008) are taken from Stevenson et al. . These three methods are supervised methods and used various machine learning algorithm and wide sets of features. For example, Stevenson-2008 used linguistic features, CUI’s, MeSH terms, and combination of these features. They employed three learners VSM (vector space model), Naïve Bayes (NB), and SVM. The results included in Table 6 are their best results with VSM and (linguistic + MeSH) features . The method of Joshi-2005 uses five supervised learning methods and collocation features, while McInnes-2007 uses NB .
Our evaluation is done on 31 words (as explained in Section 3). We obtained the results of the other methods on these 31 words from the references shown in Table 6 to allow for direct comparison. The best result reported in their paper is 87.8% using all words with VSM model and for McInnes 85.3% also with the whole set . The best result of Stevensons-2008 for subsets was 85.1% using a subset of 22 words defined by Stevenson et al. .
The results of the three methods (single, subset, full) in Table 6 are taken directly from Agirre et al. . As shown in Table 6, the average accuracy of these three methods (68.8%, 59.7%, and 63.5%) on the 31 words is significantly lower than our method (90.3%) and also the average accuracy of their method on the whole set (65.9%, 63.0%, and 65.9%); we note that their method is unsupervised and does not require tagged instances . In another work, Jimeno-Yepes and Aronson evaluate four unsupervised methods on the whole NLM-WSD set  as well as NB and combination of the four methods. The accuracy of the four methods ranges from 58.3% to 88.3% (NB) on the whole set, and NB was found to be the best performer followed by CombSW (76.3%) . The average accuracy results of NB and two combinations (NB, CombSW, and CombV) on our 31 word-subset are 86%, 73.1%, and 72.1% respectively which are lower than our results, see Table 6.
When we applied our system onto the species disambiguation task, the results are also encouraging as shown in Table 8. The evaluation results of our method compare very well with those reported in  as shown in Table 7. From their results (Table 7), we notice that the best overall performance was obtained with the ML method (machine learning) with precision, recall, and F1 values being equal at 82.69. Our results as shown in Table 8 are not directly comparable with those in Table 7 due to the difference in the size of test set. However, we can see that our method’s performance is reasonably well standing in terms of precision, recall, and F1. The main strength of this method is in using MI values as weights encoded in the feature vectors. These weights enable the learner to induce quite reliable models for sense disambiguation. As the components of the vectors, +MI and −MI, are the common information between context word and class labels, the induced learners are finely calibrated towards the disambiguation task.
All the results showed that the technique is fairly successful and effective in the disambiguation task. Thus, more research work should be exerted to carry out further improvements on the performance of this technique. In future work of this research, we plan to investigate the possibility of disambiguating entity names when all instances of that entity are occurring in one species. Currently, our method is supervised and required annotated instances in both classes to be able to test new samples.
- M. Stevenson, Y. Guo, R. Gaizauskas, and D. Martinez, “Knowledge sources for word sense disambiguation of biomedical text,” in Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing (BioNLP '08), pp. 80–87, 2008.
- E. Agirre, A. Soroa, and M. Stevenson, “Graph-based word sense disambiguation of biomedical documents,” Bioinformatics, vol. 26, no. 22, Article ID btq555, pp. 2889–2896, 2010.
- B. L. Humphreys, D. A. B. Lindberg, H. M. Schoolman, and G. O. Barnett, “The unified medical language system: an informatics research collaboration,” Journal of the American Medical Informatics Association, vol. 5, no. 1, pp. 1–11, 1998.
- A. J. Jimeno-Yepes and A. R. Aronson, “Knowledge-based biomedical word sense disambiguation: comparison of approaches,” BMC Bioinformatics, vol. 11, article 569, 2010.
- J. W. Son and S. B. Park, “Learning word sense disambiguation in biomedical text with difference between training and test distributions,” in Proceedings of the 3rd ACM International Workshop on Data and Text Mining in Bioinformatics (DTMBIO '09), pp. 59–66, November 2009.
- H. Xu, M. Markatou, R. Dimova, H. Liu, and C. Friedman, “Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues,” BMC Bioinformatics, vol. 7, article 334, 2006.
- H. Al-Mubaid and C. Ping, “Biomedical term disambiguation: an application to gene-protein name disambiguation,” in Proceedings of the 3rd International Conference on Information Technology: New Generations (ITNG '06), pp. 606–612, Las Vegas, Nev, USA, April 2006.
- G. K. Savova, A. R. Coden, I. L. Sominsky et al., “Word sense disambiguation across two domains: biomedical literature and clinical notes,” Journal of Biomedical Informatics, vol. 41, no. 6, pp. 1088–1100, 2008.
- X. Wang, J. Tsujii, and S. Ananiadou, “Disambiguating the species of biomedical named entities using natural language parsers,” Bioinformatics, vol. 26, no. 5, Article ID btq002, pp. 661–667, 2010.
- P. Chen and H. Al-Mubaid, “Context-based term disambiguation in biomedical literature,” in Proceedings of the 19th International Florida Artificial Intelligence Research Society Conference (FLAIRS '06), pp. 62–67, Orlando, Fla, USA, May 2006.
- H. Al-Mubaid, “Context-based technique for biomedical term classification,” in Proceedings of the IEEE Congress on Evolutionary Computation (CEC '06), pp. 5726–5733, Vancouver, Canada, July 2006.
- M. Stevenson, et al., “Disambiguation of biomedical text using a variety of knowledge sources,” BMC Bioinformatics, vol. 9, supplement 11, article S7, 2008.
- S. M. Humphrey, W. J. Rogers, H. Kilicoglu, D. Demner-Fushman, and T. C. Rindflesch, “Word sense disambiguation by selecting the best semantic type based on journal descriptor indexing: preliminary experiment,” Journal of the American Society for Information Science and Technology, vol. 57, no. 1, pp. 96–113, 2006.
- M. Stevenson, E. Agirre, and A. Soroa, “Exploiting domain information for Word Sense Disambiguation of medical documents,” Journal of the American Medical Informatics Association. In press.
- Y. Miyao and J. Tsujii, “Feature forest models for probabilistic HPSG parsing,” Computational Linguistics, vol. 34, no. 1, pp. 35–80, 2008.
- Y. Miyao, K. Sagae, R. Sætre, T. Matsuzaki, and J. Tsujii, “Evaluating contributions of natural language parsers to protein-protein interaction extraction,” Bioinformatics, vol. 25, no. 3, pp. 394–400, 2009.
- P. Chen and H. Al-Mubaid, “Context-based term disambiguation in biomedical literature,” in Proceedings of the 19th International Florida Artificial Intelligence Research Society Conference (FLAIRS '06), pp. 62–67, Orlando, Fla, USA, May 2006.
- G. Forman, “An Extensive Empirical study of feature selection metrics for text classification,” Journal of Machine Learning Research, vol. 3, pp. 1289–1305, 2003.
- L. Galavotti, F. Sebastiani, and M. Simi, “Experiments on the use of feature selection and negative evidence in automated text categorization,” in Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries, 2000.
- Y. Yang and J. P. Pedersen, “A comparative study on feature selection in text categorization,” in Proceedings of the 4th International Conference on Machine Learning and Computing, 1997.
- Z. Zheng and R. Srihari, “Optimally combining positive and negative feature for text categorization,” in Proceedings of the Workshop on Learning from Imbalanced Data Sets II (ICML '03), 2003.
- C. D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing, The MIT Press, 1999.
- T. Joachims, “Text categorization with support vector machines: learning with many relevant features,” in Proceedings of the 10th European Conference on Machine Learning, 1998.
- M. Weeber, J. Mork, and A. Aronson, “Developing a test collection for biomedical word sense disambiguation,” in Proceedings of the Symposium American Medical Informatics Association (AMIA '01), 2001.
- M. F. Porter, “An algorithm for suffix stripping,” Program, vol. 14, pp. 130–137, 1980.