Abstract

Computation of semantic similarity between words for text understanding is a vital issue in many applications such as word sense disambiguation, document categorization, and information retrieval. In recent years, different paradigms have been proposed to compute semantic similarity based on different ontologies and knowledge resources. In this paper, we propose a new similarity measure combining both superconcepts of the evaluated concepts and their common specificity feature. The common specificity feature considers the depth of the Least Common Subsumer (LCS) of two concepts and the depth of the ontology to obtain more semantic evidence. The multiple inheritance phenomenon in a large and complex taxonomy is taken into account by all superconcepts of the evaluated concepts. We evaluate and compare the correlation obtained by our measure with human scores against other existing measures exploiting SNOMED CT as the input ontology. The experimental evaluations show the applicability of the measure on different datasets and confirm the efficiency and simplicity of our proposed measure.

1. Introduction

In the last few years, the amount of available electronic information has increased sharply in many research areas such as biomedicine, education, psychology, linguistics, cognitive science, and artificial intelligence. As we know, most of the information sources are presented in unstructured or semistructured textual formats. Hence, it is an urgent issue to process the text information from a semantic perspective. Understood as the degree of taxonomical proximity, semantic similarity computes the likeness between words and plays a very important part in the above-mentioned fields such as word sense disambiguation [1], word spelling correction [2], automatic language translation [3], document categorization or clustering [4], information extraction and retrieval [57], detection of redundancy, and ontology learning [8, 9]. It is worth mentioning that many applications of semantic similarity computation are discussed in the biomedical domain due to the availability of numerous medical ontologies and resources that organize medical concepts into hierarchies. For example, semantic similarity between concepts of ontologies such as Gene [10, 11] was computed with the aim of assessing protein functional similarity [6].

As mentioned above, semantic similarity is relevant to many research areas. Designing accurate computing methods is important for improving the performance of applications dependent on semantic similarity. Essentially, semantic similarity measures assess a score between a pair of words making use of the information from some predefined knowledge sources (such as ontologies or domain corpora) containing the semantic evidence. Therefore, the accuracy of semantic similarity approaches relies on the knowledge sources. So far, many semantic similarity measures have been proposed, which can be classified according to the domain knowledge itself exploited and the theoretical principles. The semantic similarity measures can be roughly divided into several categories as follows: measures based on the taxonomical structure of the ontology, which are strategies estimating semantic similarity by counting the number of nodes or edges separating two concepts [1214]; even if these methods are the most intuitive and easy to implement, they suffer from the limitation that they work properly requiring consistent and rich ontologies; measures utilizing information content (IC) of concepts, which are methods exploiting the notion of IC defined as a measure of the fruitful semantic information of concepts and computed by counting the occurrence of words in large corpora [1517]; their shortcomings are that it is necessary to perform time-consuming analysis of corpora and that the IC values depend on the considered corpora; measures using the amount of cooccurrences between word contexts, which are approaches constructing context vectors of concepts by extracting contextual words (within a fixed window of context) from a corpus of textual documents including the evaluated concepts and computing the similarity of concepts as the cosine of the angle between their context vectors [11, 18, 19]. Similar to the methods motioned in category , the availability and suitability of corpora affect the applicability of these measures.

Usually, these measures can obtain good performance when we employ large and general purpose knowledge bases like WordNet [20]. Some of them have been applied to the biomedical field using domain information extracted from clinical data or relevant medical ontologies such as SNOMED CT (https://uts.nlm.nih.gov/home.html) [21, 22] or MeSH (http://www.nlm.nih.gov/mesh/meshhome.html) [22, 23] in the Unified Medical Language System (UMLS) [22] and authors compared these measures and analyzed and evaluated them over certain datasets to determine their advantages and limitations with respect to the background knowledge source [2426].

In this paper, firstly, we review and investigate different measures for semantic similarity computation. Then, we propose a new measure considering the multiple inheritance in ontologies and the common specificity feature of the evaluated concepts in order to obtain a more accurate similarity between concepts. Finally, we evaluate the proposed measure using two datasets of biomedical term pairs scored for similarity by human experts and exploiting SNOMED CT as the input ontology. We compare the correlation obtained by our measure with human scores against other measures. The experimental evaluations confirm the efficiency of the proposed measure.

Besides Section 1, the paper is organized as follows. Section 2 investigates the basic methods for semantic similarity including the taxonomy-based measures, the IC-based measures, and the context vector measures. Section 3 presents the proposed measure for semantic similarity and its main advantages. Section 4 evaluates and compares the measure against the analyzed measures using SNOMED CT as the input ontology. Section 5 analyzes and discusses the experimental results. The final section is the conclusions of the paper.

2. Existing Measures for Computing Semantic Similarity

The existing measures for semantic similarity are discussed as follows.

2.1. Measures Based on the Taxonomical Structure

The simplest way of computing similarity for concepts is the measure based on path length developed by Rada et al. [13]. The measure quantifies the shortest distance between the two concept nodes and :where and stand for the minimum number of is-a links from and to their LCS, respectively.

Wu and Palmer [14] introduced a measure based on path length that considers the depth of the concepts only in the hierarchy. It is based on the assumption that concepts lower down in the taxonomy are more similar than those higher up:where is the number of is-a relations from the LCS of the evaluated concepts to the root of the ontology. And the similarity value ranges from 1 (for identical concepts) to 0.

Leacock and Chodorow [12] proposed a measure for similarity in which the shortest path length between two concepts was scaled by twice the maximum depth of the hierarchy:

Besides, there are other measures for semantic similarity based on structure. Li et al. [27] developed a measure combining the depth of the ontology and the shortest path:where is the minimum depth of the LCS in the hierarchy, is the shortest path between two concepts, and and stand for the contribution of the shortest path and the depth, respectively. The optimal parameters for the measure were and .

Al-Mubaid and Nguyen [24] proposed a cluster-based measure combining path length and common specificity that considers the depth of the LCS of two concepts and the depth of ontology. They defined the clusters as the branches of the ontology with respect to the root node. The common specificity of concepts and is defined as follows:where is the depth of the cluster including concepts and . Thus the feature determines the “common specificity” of two concepts in the cluster. The smaller the common specificity value of two concept nodes is, the more information they share. Thus, they will be more similar. The semantic distance measure is defined as follows:where and (, ) are contribution factors of two features, is a constant which must be greater than or equal to 1 to insure that the distance is positive and the combination is nonlinear, and Path is the length of the shortest path between the two concept nodes.

Batet et al. [25] proposed a similarity measure which takes into account all the superconcepts (subsumers of the evaluated terms) belonging to all the possible paths between the concept nodes and defined the measure as the ratio between the amount of nonshared information and the sum of shared and nonshared information. The similarity measure of two concepts considering multiple inheritance is defined as follows:where and is the concept set.

From the measures introduced above, we can get a conclusion that the LCS of two concepts plays a vital part in the computation for semantic similarity of concepts. For some measures, it is enough that just only ontology is used as background knowledge (no corpus with domain data is needed), which makes them heavily depend on the ontology itself. Besides, the minimum path (shortest path) also plays an important part. For large ontologies, such as SNOMED CT, there is a phenomenon that one or both concepts inherit from several is-a hierarchies. In this case, Al-Mubaid and Nguyen [24] used the minimum path to get the maximum semantic similarity. However, it may omit much other available taxonomical knowledge. Here, multiple possible paths can exist between any two concepts, but only the shortest one is selected among those paths. To solve the problem, we can take all the superconcepts into account and try to get more semantic evidence in the case of multiple inheritance, which makes the measure for semantic similarity more accurate.

In a view of an independent domain, in order to get high accuracy, most path-based measures rely on large and general purpose taxonomy. Usually, researchers may choose WordNet to apply these measures because of its perfect structure. However, the coverage of biomedical terms in WordNet is so limited that the accuracy of similarity assessments for medical terms is poor [11, 28]. So Pedersen et al. [11], Al-Mubaid and Nguyen [24], and Batet et al. [25] adopted these measures to the biomedical domain by exploiting SNOMED CT as the input ontology.

2.2. Measures Based on Information Content

These measures evaluate the similarity of concepts depending on the amount of shared information between two concepts. According to the information theory, concepts are evaluated by their IC. IC can be considered as a measure which quantifies the amount of information that a given concept expresses when appearing in taxonomy. Resnik [17] stated that concept semantic similarity depends on the amount of shared information between them.

In Resnik’s seminal work [17], IC is computed according to representing the probability of occurrence of a concept in a corpus:

Usually, in a general context, estimation is severely hampered due to textual ambiguity and data sparseness problems [29]. In fact, the tagged corpora about domain information like biomedicine are limited. Authors [15, 17] estimated concept appearance from SemCor [30] that is a semantically tagged text consisting of 100 passages from the Brown Corpus based on WordNet. Because the manual tagging scheme is based on the fine grained structure of word senses covered by WordNet, the estimation is accurate but is limited to the coverage of corpora that covered less than 13% of the available word senses in WordNet.

To guarantee the consistency of similarity computation, coherence of computation based on taxonomical structure should be taken into account. Meanwhile, to compute the value, both the explicit appearances of concept and its specializations must be considered. Thus, Resnik [17] proposed the measure for calculating showed as formula (9). Considerwhere is the set of words subsumed by concept and is the total number of observed corpus terms excluding those that are not subsumed by any WordNet class.

Resnik [17] mentioned concepts in a lower level which are usually more specialized and share more information represented by LCS of both concepts in taxonomy. So the more the IC of the subsumer of concepts is, the more similar the concepts are. Based on the premise above, Resnik measures the similarity as the IC of the LCS of concepts:

To tackle the problem of Resnik’s measure that the similarity value will be the same if any two concepts have the same LCS, both studies by Jiang and Conrath [15] and Lin [16] improved Resnik’s measure.

Lin measured the similarity as the ratio between the IC of the LCS of the two concepts and the summation of the IC of the two concepts:

Jiang and Conrath calculated the dissimilarity between concepts illustrating the similarity of concepts with formula (12). Consider

As mentioned above, the IC-based similarity assessments will not get the best performance when we use a general purpose corpus such as WordNet or SemCor due to their limited coverage of biomedical terms. For this reason, Pedersen et al. [11] applied these measures to the biomedical domain by exploiting SNOMED CT and computing the IC of concepts using Mayo Clinic Corpus of Clinical Notes as a domain corpus.

2.3. Context Vector Measures Computing Semantic Relatedness

Patwardhan and Pedersen [19] proposed a measure of semantic relatedness that represents a concept with a context vector which performs more flexible than similarity measurements, since the information source for the context vectors is a raw corpus of text and the concepts do not need to be connected via a path of relations in ontology. They built gloss vectors corresponding to each concept in WordNet using the cooccurrence information along with the WordNet definitions. In their experiments, the glosses seemed to contain content rich terms. They would distinguish various concepts much better than text drawn from more generic corpus if authors chose the WordNet glosses. And the WordNet glosses can be viewed as a corpus of contexts consisting of about 1.4 million words. The gloss vector measure got the highest correlation with respect to human judgment using different benchmarks [19].

Pedersen et al. [11] constructed cooccurrence vectors that represent the contextual profile of concepts and applied the measure to the biomedical filed. In their study, they created context vectors corresponding to each concept in SNOMED CT using a set of word vectors. The word vectors for all words occurring in the clinical notes referred to here were produced with the Mayo Clinic Corpus of Clinical Notes. In these vectors, the window size of the context is one line of the text.

Then, the semantic relatedness of two concepts and is computed as the cosine of the angle between their context vectors with formula (13). Considerwhere and are the context vectors corresponding to and , respectively. Note that the context vector measure has different performance according to the different choices about clinical notes in the experiments. In other words, this measure depends heavily on the availability and quality of the corpora.

3. Proposed Measure for Computing the Semantic Similarity

According to the analysis of similarity measures above, we know measures based on IC or context vectors depend on corpora. As a matter of fact, corpora consist of unstructured or semistructured textual data. Corpora need to be preprocessed to obtain enough semantic information which brings a heavy computational burden. In the biomedical domain, it is very difficult to get enough clinical data due to the sensitivity of patient data, which may cause data sparseness problems. For these reasons, the applicability of these measures may be hampered by the availability of enough suitable data. On the contrary, path-based measures only use the structure of ontologies and do not require preprocessing of text data, which makes them have low computational complexity. But every coin has two sides. Path-based measures are very simple and it cannot get more semantic evidence to perform better than other measures like IC-based measures and context vector relatedness measures.

From the path-based measures, we know that when the path length between each of concepts and their LCS is calculated, only the minimum path length is found out and kept for use, even if all paths may be calculated. However, in the taxonomy, one or both concepts may inherit from several is-a hierarchies. For this reason, there exists a case of multiple inheritance [25, 31], especially in large and complex taxonomies (e.g., SNOMED CT) including thousands of interrelated concepts in the hierarchies. For example, in Figure 1, if we choose the minimum distance among all the paths between and , the contributions of the noncommon superconcepts of the two concepts in the taxonomy are omitted, which affects the accuracy of the measures.

In this paper, we improve significantly the measure of Al-Mubaid and Nguyen [24] and propose a new modified measure for semantic similarity combining both superconcepts of the evaluated concepts and common specificity feature which can capture more semantic evidence. The proposed measure can achieve better performance than other measures based on structure and keep their simplicity.

To take into account contributions of all noncommon superconcepts of the evaluated concepts, we consider the concepts themselves and all their noncommon superconcepts instead of the minimum path length to capture more semantic evidence for similarity. Besides, in our measure, we also consider the common specificity feature of the concepts nodes scaled by the depth of the concept nodes and the depth of their LCS. Thus, we combine noncommon superconcepts with common specificity in order to get more semantic information for computing similarity between two concepts.

Let stand for th concept of an ontology. Then is defined as the set of all superconcepts of including itself. Thus the number of noncommon superconcepts of concepts can be defined as follows:

Here the NonComSub value can be an indication of the path length of the two concepts. For example, the number of noncommon superconcepts for concepts and is 3 in Figure 1.

On the other hand, the common specificity feature is defined as follows [24]:where is the depth of the ontology and the ComSpec feature determines the common specificity of two evaluated concepts. The smaller the ComSpec value of two concepts, the more information they share, and thus the more similar they are.

Then, we use logarithm function of NonComSub and ComSpec to represent semantic distance which is inverse to semantic similarity. Therefore, the semantic distance between concepts and is defined as follows:

It is worth mentioning that any concept can be compared with itself. In this case, the semantic distance is 0. This measure can be applied to all concepts and does not need to check whether or not the two concepts compared are distinct.

4. Evaluation

Measures of semantic similarity are usually evaluated by comparing the computed similarity values of the measures against the human judgments using correlation coefficient. The higher the correlation value against the human experts’ similarity scores are, the better the measure is.

In the biomedical field, there are no standard human rating datasets for semantic similarity like manually rated concept sets created by Rubenstein and Goodenough [32] and Miller and Charles [33]. Pedersen et al. [11] stated that it is necessary to choose sets of words manually scored for the evaluation of concept semantic similarity measures in biomedicine. In their research, they created a set of 30 concept pairs regarding medical disorders with the help of Mayo Clinic experts. The set was annotated by three physicians and nine medical coders. All three physicians are specialists in the area of rheumatology. Finally, after a series of processing, the average of the similarity values of 30 concept pairs was normalized in a scale between 1 and 4. The average correlation between physicians is 0.68 while the average correlation between medical coders is 0.78. The 30 medical term pairs with averaged experts’ similarity scores are showed in Table 1.

Pedersen et al. [11] applied the standard to evaluate the path-based and the IC-based measures by exploiting SNOMED CT taxonomy as domain ontology and the Mayo Clinical Corpus and Thesaurus as corpora, respectively. Medical coders had a better understanding about the notion of similarity because they were pretrained during the construction of the original dataset. So medical coders’ ratings seem to reproduce better the concept of (taxonomic) similarity, whereas physicians’ ratings seem to represent a more general concept of (taxonomic and nontaxonomic) relatedness. Al-Mubaid and Nguyen [24] just compared their concept similarity value against coders’ ratings, while Batet et al. [25] made a comparison between concept similarity value obtained and scores of both of them.

In this paper, in order to compare the performance of our measure with other measures in the biomedical domain objectively, we used the set of 30 concept pairs from literature [11] (called Dataset 1) and the set of 36 biomedical term pairs from literature [6] (called Dataset 2) as experimental datasets. The 36 medical term pairs in Dataset 2 with averaged human’ similarity scores are showed in Table 2, in which the human scores are the average evaluated scores of reliable doctors. We use UMLS Knowledge Source (UMLSKS) browser (https://uts.nlm.nih.gov/home.html) for SNOMED CT to get information on the terms in the two datasets. The comparative results using Dataset 1 with respect to both physicians and medical coders and results using Dataset 2 with respect to human are shown in Table 3. Note that our measures compute semantic distance while human ratings represent similarity. So a linear transformation should be performed.

5. Discussion

For Dataset 1, there are 29 out of 30 concept pairs in SNOMED CT and the average correlation between physicians is 0.68 while the average correlation between medical coders is 0.78. The evaluation results show that the path-based similarity measures obtain lower correlations than 0.36 and 0.66 for physicians and coders, respectively. It indicates that the accuracy of path-based measures is limited. These measures utilize the minimum path length without considering multiple inheritances, which causes much useful semantic evidence to be ignored. For the IC-based measures, they improve the results of most measures based on path length generally with the highest value 0.75 for coders and 0.60 for physicians. And the lowest correlation is 0.62 for coders, which outperforms most structure-based measure with exception of (0.66) and (0.79). The coverage of biomedical terms in domain corpora is limited. Due to the high dependency on the availability of domain data, the accuracy of the IC-based similarity approaches is hampered.

The measure shows good performance considering both common and uncommon information of the evaluated concepts. However, when using to compute semantic similarity, we should check if the two concepts are the same; otherwise we would get an infinitely large similarity value.

With respect to the context vector measure, there are four cases with changing corpus size and corpus selection. From the results, we can see that the best correlations are 0.84 for the physicians and 0.75 for the coders under the condition of 1 million notes involving only the diagnostic section. That is, the correlation value with the context vector measure is higher than ours only if 1 million notes are used to create the vectors. Moreover, the data corpus used to create vectors was constructed by physicians of the same clinic and so there were some limitations in the way of interpreting and formalizing knowledge. Thus the context vector measure strongly depends on the amount and quality of the background corpus. The measure can show good performance just in the particular situation and domain. If we want to obtain higher correlation, we should try best to make preparations in choosing the size and quality of the information sources. The results show that the accuracy of the approach decreases for other corpus configurations noticeably. For example, when 100,000 notes involving all sections are used, the correlations drop to 0.41 for physicians and 0.53 for coders.

The correlations obtained by our measure are 0.67 for physicians and 0.77 for coders while the average correlation between physicians is 0.68 and the average correlation between medical coders is 0.78. The proposed measure has made obvious improvement to (0.77 versus 0.66). The correlation for physicians is higher than other measures except for context vector measure and the correlation for coders is higher than other measures except for (0.79). And the correlation is 0.75 considering both sets of experts, which is rather high among all the measures mentioned in this paper. Our measure gets higher correlation values than all the IC-based measures shown in Table 3.

From Table 3, we can see that the correlation values for coders are always higher than that for physicians except for context vector measure which explains that medical coders’ ratings with more pretraining are more reliable than physicians’ ratings. As a result, many similarity measures compare similarity against similarity scores of medical coders to obtain much better correlations.

For Dataset 2, we can find 34 out of 36 concept pairs in SNOMED CT. We compare our measure with other structure-based similarity measures with respect to human scores. The correlation of Al-Mubaid and Nguyen measure is 0.735 which is higher than other measures. However, the correlation value obtained by our measure is 0.774. The comparative result (0.774 versus 0.735) shows that the proposed measure outperforms other measures shown in Table 3.

Through the observation of the experimental results on two datasets, we find that our measure performs much better than almost all the similarity measures including the path-based measures and the IC-based measures. We make significant improvements to the measure of Al-Mubaid and Nguyen in the case of multi-inheritance which exists in SNOMED CT.

In addition, we adopt the Pearson correlation coefficient ( value) as a measure of the strength of the relation between human ratings of similarity and computational values [30]. The smaller the value is, the more significant the relation is. In all cases, the values for our results on both Dataset 1 and Dataset 2 are less than 0.001 (0.1% chance), which explains that the correlation values using our measure are significant statistically.

By all accounts, our measure is only based on an ontology structure and it can provide a comparatively high accuracy without any dependency on data preprocessing and availability. Meanwhile, it keeps the simplicity of structure-based measures and overcomes their shortcomings that multi-inheritance phenomenon does not fully considered. The measure nonlinearly combines the evaluated concepts’ noncommon information considering the different taxonomical hierarchies and their common specificity feature, which has made obvious improvement to other measures. Note that there exists the case of multiple inheritance in which concepts may be subsumed by several superconcepts in the largest and most widely used knowledge source such as SNOMED CT, MeSH in the UMLS, or WordNet. From the experiments, our measure has very good performance in ontologies with multi-inheritance. When the input ontology does not have the multiple inheritance of concepts, the measure can also be accurate. Therefore, the accuracy and application of our approach are very impressive and significant especially in the biomedical domain.

6. Conclusions

In the paper, we propose a similarity measure that nonlinearly combines the evaluated concepts’ noncommon information considering the different taxonomical hierarchies with their common specificity feature. The measure keeps the simplicity without parameter tuning. In the experiments, we use SNOMED CT, a large and detailed ontology with multiple inheritance between concepts, as input ontology. The experimental results show that our measure has rather high correlation values with respect to both physicians and coders and the measure outperforms most approaches based on taxonomical structure and IC and context vectors.

As we know, recently, measuring semantic similarity of concepts within multiple ontologies has become more and more important. As future work, we will extend the measure to multiple ontologies such as SNOMED CT, MeSH, and WordNet.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This paper was sponsored by Jilin Provincial Science and Technology Department of China (Grant nos. 20130206041GX and 20120302), Jilin Province Development and Reform Committee of China (Grant no. 2013C036-5, 779), Changchun Science and Technology Bureau of China (Grant no. 14KT009), and the Doctoral Program of Higher Education of China (no. 20110043110011), respectively. The authors appreciate the help of Professor David Sánchez in their research. He provided them with useful information and friendly advice.