Abstract

The integration of ontologies builds knowledge structures which brings new understanding on existing terminologies and their associations. With the steady increase in the number of ontologies, automatic integration of ontologies is preferable over manual solutions in many applications. However, available works on ontology integration are largely heuristic without guarantees on the quality of the integration results. In this work, we focus on the integration of ontologies with hierarchical structures. We identified optimal structures in this problem and proposed optimal and efficient approximation algorithms for integrating a pair of ontologies. Furthermore, we extend the basic problem to address the integration of a large number of ontologies, and correspondingly we proposed an efficient approximation algorithm for integrating multiple ontologies. The empirical study on both real ontologies and synthetic data demonstrates the effectiveness of our proposed approaches. In addition, the results of integration between gene ontology and National Drug File Reference Terminology suggest that our method provides a novel way to perform association studies between biomedical terms.

1. Introduction

In recent years, ontologies are becoming increasingly important in knowledge engineering. Generally speaking, an ontology is a collection of concepts and their relations. It has wide applications in computer science and life science. For example, in computer science, semantic web uses web ontology language (OWL) to represent knowledge bases [1]. In life sciences, numerous important data structures and tools are built on ontologies. One of the most popular ontologies is gene ontology. Researchers use it frequently to measure the enrichment of gene clusters and to identify potential biomarkers. Two most famous ontology databases in the biomedical field are Unified Medical Language System (UMLS) [2] and NCBO BioPortal (https://bioportal.bioontology.org/). The former has more than 100 ontology datasets and the latter has more than 300 ontology datasets.

Although ontologies can be modeled as a directed graph, many ontologies are in fact hierarchical trees or have hierarchical tree-like structures. In the BioPortal website, users can find basic hierarchical properties of an ontology, such as the maximum depth and the maximum number of children. In the UMLS, the hierarchical structure of an ontology is documented in the “MRHIER.RRF” file with each line being a path from a term to its root. We can build a hierarchical tree from these paths by merging the common nodes starting from the root. Because the hierarchical structures of some ontologies are in fact directed acyclic graphs, the hierarchical tree may contain some duplicated concepts. To simplify our study, we treat them as independent concepts in this work.

An important knowledge discovery task is to identify knowledge associations. In life science, this task includes finding the associations between diseases and genes [3, 4] and between phenotypes and genotypes [5]. With the presence of ontologies, such a task has been extended from identifying the associations between terms to the associations between ontologies as a whole. For the latter, we should not only consider the term associations, but also the term associations in the context of their ontological structures. For example, if the parent and children of term ( from ontology ) are similar to those of term ( from ontology ), Then may be a good choice to be associated with .

Early studies on ontology integration relied on domain experts to manually set up the integration rules [6]. However, this approach cannot meet the ever increasing volume of ontology datasets. Automatic ontology integration methods have been developed to address this issue. However, as we will see in the discussions of Section 1.3, these methods are often heuristic or have not demonstrated the effectiveness in integrating a large volume of ontology datasets. Thus, our goal is to develop an ontology integration method that is able to deliver optimal or close-to-optimal solutions for integrating a large volume of ontology datasets (particularly from the biomedical domain). As discussed above, we focus on ontologies with hierarchical tree-like structures which are often available in the biomedical domain. In addition, we assume the ontology term closeness measurement is available. This assumption is reasonable because many applications are able to identify ontology term similarities via additional data sources. For example, a closeness matrix between two sets of biomedical terms can be generated by using UMLS knowledge discovery methods such as KDLS [7]. Our problem can be formally described as follows.

1.1. Problem Formulation

The basic ontology integration problem in our work can be formulated as follows. Given ontology tree structures and and a closeness matrix , how can we efficiently generate an integrated ontology tree structure meeting the following two basic criteria?(1)For any two vertices and in (or ), the lowest common ancestor (or ) is contained by .(2)It holds that . Here is the entry value in the closeness matrix for the corresponding two vertices (one from and the other from ) contained in the node . if , a node of , contains only one vertex from or .We name the cohesion function of the integrated ontology and its value is the overall cohesion score of integrating and into . Correspondingly, we define function as the maximum cohesion function for integrating the ontologies and and its value is the maximum overall cohesion score (or, simply, maximum cohesion score). In a hierarchical ontology, the common part of any two terms can be described by their lowest common ancestor. For example, in Figure 1, flu and hepatitis B are both infectious diseases, and flu and cancer are both diseases. Thus, we use Criterion (1) to ensure that the basic logic of an ontology is preserved after integration.

An example of integrating two hypothetical ontologies that satisfy Criterion (1) is given in Figure 1. To facilitate the understanding of our problem definition, we also provide another example of integrating two ontologies in Figure 2. As we can see in Figure 2, the lowest common ancestor of nodes containing thyroid cancer and infectious disease is the node containing cancer instead of disease. We conclude that the integration is a violation of Criterion (1). In fact, we can easily see that there are multiple pairs of nodes with incorrect lowest common ancestors in Figure 2.

In Section 2.2, we will extend the basic problem definition to handle the integration of multiple (>2) ontologies. The two basic criteria will be extended correspondingly.

1.2. Main Contributions

We made the following major contributions in this work.(i)We proposed a novel ontology integration problem that optimizes the cohesion function. We identified optimal structures in this problem and developed optimal as well as efficient approximation solutions for this problem.(ii)We extended the basic problem to handle the integration of large number of ontologies, and we developed both greedy and fast approximation algorithms for the extended problem.(iii)We studied the proposed algorithms on both real and synthetic datasets and confirmed their effectiveness in integrating large volume of ontology datasets.

1.3. Related Work

Automatic ontology generation and integration are desirable in many applications and have been studied in the past decade. Although available methods for automatic ontology generation produce ontologies from a given type of data, such as gene networks [8], textual data [9], dictionary [10], and schemata [11, 12], they do not contribute to the integration of different types of ontologies which will bring innovative results on annotation/knowledge reuse and association studies. To address this issue, a number of studies have been focused on ontology integration [13, 14] and its medical domain applications [15]. The ontology integration methods available and used in these works can be generally classified into three categories.

Manually or Semiautomatic Setups. In [6], the authors presented a methodology for ontology integration by custom-tailored integration operations which include algebraic-specified 39 basic operations and 12 nonbasic operations derived from them. The authors identified a set of criteria, such as modularize, specialize, and diversify each hierarchy, for guiding the knowledge integration. In [16], the authors designed a semiautomatic approach for ontology merging. The ontology developers will be assisted by the system and guided to tasks needing their interventions.

Using Machine Learning Methods. Reference [17] describes an ontology mapping system GLUE that uses machine learning techniques for building ontology mappings. Specifically, GLUE uses multiple learning strategies. Each type of information from the ontologies is handled by a different learning strategy. The authors demonstrate that GLUE works effectively on taxonomy ontologies. Similarly, [18] also used multistrategy learning in matching pair identification. However, the ontologies used in the experiments of [18] contain less than 10 nodes. Although [17] studied the integration of larger ontologies in the experiments, those ontologies contain only around 34 to 176 nodes, much smaller than many ontologies used in the biomedical field.

Using Heuristic Approaches. Many automatic ontology integration methods [19, 20] fall into this category. They perform ontology integrations by using heuristic approaches from different perspectives. For example, [20] uses heuristic policies for selecting axioms and candidate merging pairs. From a quite different angle, [19] uses view-based query for guiding the ontology integrations.

These methods have a few major weaknesses including lack of a systematic measurement to quantify the goodness for the ontology integration; being generally heuristic with no theoretical results to show that the proposed integration approach is globally optimal or close to optimal; being not for integrating large volume of ontologies. These weaknesses motivated us to develop efficient and near optimal solutions for integrating large ontology datasets.

2. Methods

2.1. Integrating a Pair of Ontologies

In this section, we focus on the basic problem of integrating two ontologies as formulated in Section 1.1. We will prove optimal structures in the problem and propose an optimal and an efficient approximation solution for this problem. These solutions are also the basis for solving the problem of integrating a large number of ontologies as described in Section 2.2.

2.1.1. Brutal-Force and Heuristic Solutions

Given Criteria (1) and (2), a brutal-force approach will pick up a best solution from all the solutions that start with integration involving at least one of the roots of the two ontology trees and iteratively integrate their descendants. Considering an extreme case where each ontology tree is a path of vertices, we conclude that such a brutal-force approach needs to pick up a best solution from an exponential number of solutions. The brutal-force approach is clearly not acceptable for integrating large ontologies and may not even work for ontologies with only a few dozens of vertices.

A heuristic solution can be developed by following an idea similar to the above brutal-force approach. However, instead of trying all possibilities, the heuristic solution will greedily merge vertices following the topological order. When selecting a matching vertex for vertex from ontology , the heuristic approach will greedily select a vertex from allowable candidates in and iteratively apply such selections to descendants of . According to Criterion (1), if is associated with , then none of ’s descendants are allowed to be associated with vertices other than ’s descendants. In addition, if ’s child is associated with ’s child or its descendants, then none of ’s other children are allowed to be associated with or its descendants any more. Given this, a greedy choice may very easily end up in a local optimum by choosing a best matching vertex at one step, while denying integrating opportunities of other vertices that may lead to a better final solution.

It is easy to see that the deeper a vertex being chosen for integration is, the more integration opportunities are lost. To alleviate such a situation, we propose a greedy approach by considering the relative depth () of a chosen vertex with regard to an allowable vertex closest to the root. That is, given a vertex from , a vertex from is chosen when is maximized. When , the depth information does not take effective and when , each vertex will only be associated with an allowable vertex closest to the root.

Algorithm 1 describes the pseudocode of the heuristic integration. It starts by integrating virtual roots of the two ontologies. After that, the integration will be carried out iteratively from top to bottom by following Criterion (1) and the heuristic strategy described above. In the empirical study, we will see that the heuristic algorithm works better when the depth information is considered. However, in terms of the overall cohesion score, it is no match for our optimal and approximation solutions as described below.

push into queue ; {Integrating virtual roots of the two ontologies}
whiledo
;
for children of on do
  ;
  ;
  ;
  for children of on do
   if is chosen then
    continue; {The subtree rooted at can only be chosen once for integration, as illustrated in Figure 3.}
   else
    identify vertex in the subtree rooted at such that ;
    ifthen
     ;
     ;
     ;
    end if
   end if
  end for
  ifthen
   break;
  else
   save merge pair in ;
   mark as chosen;
   push into queue ;
  end if
end for
end while
return;

2.1.2. Optimal and Approximation Solutions

By dividing the integration of two trees into node merging and subtree integrations, we have identified optimal structures in the basic problem, as stated by Lemmas 1 and 2. These optimal structures make it possible for us to develop efficient algorithms (Algorithms 2 and 3) that achieve optimal or approximation solutions. In the following, we first describe the two important lemmas suggesting the optimal structures and their proofs before describing our proposed algorithms.

Sort vertices in and in the topological order;
for to do
for to do
  ;
  ;
  ;
  ;
end for
end for
return;

push into queue ;
whiledo
;
ifthen
  make as a subtree of rooted at ; is a sub tree of rooted at
else ifthen
  make as a subtree of rooted at ; is a sub tree of rooted at
else
  let and make a child of on ;
  ;
  ;
  ;
  if && then
   push matching results of with into ;
  else ifthen
   push matching results of with into ;
  else
   push matching results of with into ;
  end if
end if
end while
return;

Lemma 1. Let be the root of tree and the root of tree . Let and represent two sets of sub trees rooted at of and of , respectively. does not include and does not include . One has .

Proof. We can divide the integration of tree with tree into two cases according to the merging of their roots.(1)The roots of and are merged together.(2)The roots of and are not merged together.For case , it is clear that the cohesion score is .
For case , we conclude that either is integrated with ( is out of integration) or is integrated with ( is out of integration). Otherwise, we will have a merged tree with two roots and , a contradiction to the fact that is a tree. Therefore, the cohesion score is either or .
Combining cases and and according to Criterion (2), we have

Lemma 2. Let and represent two sets of trees obtained by removing the root vertices from and from . One hasHere , , and is a matching of trees in with trees in .

Proof. To prove this lemma, we first prove that for any tree , it can be integrated with no more than one tree in . We will prove this claim by contradiction. Assume there are two trees and and they integrate with a tree into an integrated tree . There are three cases for the root of , as illustrated in Figure 3:(1) contains only the root of ;(2) contains only the root of or ;(3) contains the root of and the root of or .For case (1), the lowest common ancestor of the roots of and in the integrated tree will no longer contain their lowest common ancestor in , a contradiction to Criterion (1). For cases (2) and (3), the root of (or the root of ) will be the descendant of the root of (or the root of ) in the integrated tree , a contradiction to Criterion (1). For integration involving more than two trees from , we can still follow the above procedure to reach contradictions. Thus, the claim is proven.
Without loss of generality, we can see that, for any tree , it can be integrated with no more than one tree in . Therefore, the integration between and corresponds to a matching in a weighted bipartite graph in which two sets of nodes represent trees from and , respectively, and edges represent corresponding cohesion scores. According to Criterion (2), and we conclude that corresponds to the weight of a maximum weighted matching in the above bipartite graph.

Given Lemma 2, we can see that the following corollary is correct.

Corollary 3. Define , where is a maximum weighted matching of trees in forests and given for any tree pair . One concludes that, for any two forests and , .

With Lemma 1 and Corollary 3, we are able to design an efficient dynamic programming algorithm achieving the global optimum for the ontology integration problem. The pseudocode for calculating the maximum cohesion score is described in Algorithm 2, which visits ontology vertices in reverse topological order when filling up the cohesion matrix.

At the end of Algorithm 2, the cohesion matrix is filled up with optimal cohesion scores and the maximum cohesion score is saved at entry , as described by Theorem 4.

Theorem 4. It holds that and .

Proof. We will prove this theorem by mathematical induction.
Let and ; it is easy to see that is optimal because and correspond to leaf nodes in the topological order. Thus and are empty sets and .
When and , according to Lemma 1, to integrate with , either (the leaf vertex) is merged with or is integrated with . The maximum score of integrating with is available at the time is being calculated because of the reverse topological visit. Thus, according to both Lemma 1 and Corollary 3, we conclude that . Similarly, we can conclude that .
When and , again, according to Lemma 1 and Corollary 3, we have . Due to the reserve topological visit, , , and are available at the time is being calculated.

Given the definition of , Theorem 4 in fact proves that the proposed approach achieves the global optimum. However, the global optimal solution is built upon the maximum weighted matching (recall Corollary 3). As discussed in Section 2.1.3, the maximum weighted matching is time-consuming and therefore we propose an approximation solution in that section.

Although Algorithm 2 builds the cohesion matrix with optimal cohesion scores, it does not construct the integrated ontological knowledge structure. We may save the ontology integration details along with the cohesion scores. However, that will cost memory space (assuming each ontology has vertices) and significantly reduce the capacity of the algorithm in handling large ontology integrations. Quite interestingly, we find that it is not necessary to save the integration details in order to construct the integrated ontology. The construction can be done by a process reverse to the construction of cohesion matrix, as described in Algorithm 3.

Algorithm 3 uses the cohesion matrix constructed by Algorithm 2 and builds the integrated ontology tree still by following Lemma 1 and Corollary 3, but in a reverse way of Algorithm 2. The construction is performed in a Breadth-First fashion which uses a queue to maintain triples. Each triple () is an association of three elements: is the matched vertex from ontology ; is the matched vertex from ontology ; and is their parent on the merged ontology . By following the basic idea of the proof of Theorem 4 we can show that Algorithm 3 builds an optimal integrated ontology with the cohesion matrix provided from Algorithm 2. We omit the proof for succinctness.

2.1.3. Time Complexity Analysis and an Approximation Solution

Assume an ontology size is . The cohesion matrix has entries to fill up. The computation for each entry is a matching whose time complexity depends on the implementation. The maximum weighted matching takes using the famous Hungarian algorithm [21], and although it achieves optimum, it is too costly for large ontologies. The maximal weighted matching, however, takes time and, more importantly, results in an overall time complexity for Algorithm 2. The analysis is given in the following. For each matching, the algorithm will access previously filled entries and each entry will be accessed only once and be involved only once in a sorting of time. This is because each entry corresponds to two vertices whose cohesion score will be accessed when calculating the cohesion score of their parents. Thus, we conclude that the total time complexity of calculating the cohesion matrix using maximal weighted matching is . Since building the new ontology has the same time complexity as building the cohesion matrix, this is also the total time complexity for integrating two ontologies by maximal weighted matching. The maximal weighted matching also has a guaranteed lower bound on the results. It achieves a ()-approximation solution (i.e., the overall cohesion score will be at least 1/2 of the optimal cohesion score) as pointed out in [22].

Since the maximal weighted matching results in an overall good performance on time complexity and approximation rate, we used the maximal weighted matching in our empirical study for Algorithms 2 and 3. Readers may also choose other matching algorithms (such as the one described in [22]) to achieve slightly better approximation rates. However, the weighted matching is a replaceable module for our algorithms and it is not the focus of this work to build a fast and close-to-optimal weighted matching algorithm.

Compared to the time complexity of integrating ontologies by the dynamic programming approach as described in Algorithms 2 and 3, the heuristic approach described at the beginning of this section also has in the worst case. However, we conjecture that the heuristic approach has a much smaller average time complexity because, in each step, the heuristic approach may exclude a large number of matching opportunities.

2.2. Integrating Multiple Ontologies

In the previous section we proposed methods for integrating two ontologies. In some biomedical applications [23, 24], we are interested in the associations involving more than two objects. Integration of multiple ontologies of these objects will generate an innovative view on these complex relationships. Similar to the basic problem formulation, we can formulate the multiple ontology integration as follows.

Given ontology trees and a closeness matrix for any two trees and , how can we efficiently generate an integrated ontology tree meeting the following criteria?(1)For any two vertices and in a tree , their lowest common ancestor is contained by .(2)It holds that . Here is the entry value in the closeness matrix for two vertices and (one from tree and the other from tree ) contained in the node . For a vertex from an original ontology, is defined as its original ontology ID.Again, we name the function the cohesion of the integrated ontology . For each node in the integrated ontology, we define its weight as . Correspondingly, we define function as the maximum cohesion function for integrating the ontologies . As we can see in the above formulation, the overall cohesion score of integration is the summed weight of each node, which is the sum of pairwise closeness scores.

The formulation of multiple ontology integration is similar to the basic version, and it is not difficult to show that optimal structures described in Lemmas 1 and 2 can be extended to a high dimension. However, the extension of algorithms described in Section 2.1.2 for integrating two ontologies is not feasible for solving the multiple ontology integration. This is because if we need to extend Algorithms 2 and 3 to this problem, we need to build a cohesion matrix of dimensions. It implies that we need at least operations to fill up the score matrix assuming the size of an ontology is . This is clearly not acceptable for high dimensional ontology integration.

2.2.1. Greedy Approach

From the above discussion we can see that direct extension of Algorithms 2 and 3 for integrating two ontologies is practically not feasible for integrating a large number of ontologies. However, we can still use these algorithms for integrating multiple ontologies, by iteratively integrating two ontologies and generating a new closeness matrix. Given the ontologies , we can first integrate and into and then build the closeness matrix between and using the relationship matrices between and and between and . Specifically, assume is a node on the integrated ontology , and contains a vertex from and a vertex from . Then, the entry of the closeness matrix between and is . After the new closeness matrix is generated, we can continue integrating and into and generating another new closeness matrix. We will eventually get the integrated ontology by repeating the above process. To facilitate the following discussions, we name the above approach the basic multiple integration approach.

Although the basic multiple integration approach can finish integrating multiple ontologies, it blindly integrates ontologies without using any cohesion information between ontologies that may lead to a better integration result. To improve the basic multiple integration, we propose a greedy approach that uses the cohesion information between ontologies to guide the integration. The basic steps of the greedy approach are outlined in Algorithm 4. To facilitate the understanding of Algorithm 4, we use Figure 4 to illustrate an example of the InterOntology matrix’s change at the first iteration of integrating four ontologies A, B, C, and D.

build the INTERONTOLOGY matrix;
for to do
 identify the active tree pair that corresponds
 to the highest score in the INTERONTOLOGY matrix;
 integrate and into ;
 mark and as inactive;
 update relationship matrices;
 update INTERONTOLOGY matrix;
end for
return;

The key idea in Algorithm 4 is to maintain an InterOntology matrix which guides the integration. Initially, this matrix is filled with the overall cohesion score of every pair of ontologies. In each step, this matrix is updated with overall cohesion scores between newly integrated ontology and existing active ontologies. The integration will take place between two active ontologies which have the highest score in the InterOntology matrix.

When we used the overall cohesion score between two ontologies to update the InterOntology matrix, we observed an interesting phenomenon that the integration in most cases is a process continuously expanding an integrated ontology. Consequently, the greedy approach is likely to yield a result similar to the basic approach.

This phenomenon can be explained by the definition of maximum cohesion function, which takes into account all pairwise closeness between merged terms. Thus, the more ontologies contained in an integrated ontology are, the more likely it will have high overall cohesion scores with other ontologies. As a result, it creates unfairness for the integration selection. To fix this issue, we use the adjusted overall cohesion scores in updating the InterOntology matrix as follows.

Given an ontology and an ontology where and are nonempty sets of ontology IDs, we define the adjusted cohesion score between and as , where is the integrated ontology built by Algorithms 2 and 3. The adjusted cohesion score is in fact the weight increase by integrating and , divided by the size of times the size of ; that is, . For each node merging, closeness scores will be added to the total weight when the merging takes place between vertices from ontology set and vertices from ontology set . Thus, the weight increase by integrating and is proportional to the number of ontologies in times the number of ontologies in , and consequently we averaged the weight increase by .

2.2.2. Fast Approximation Algorithm

Although the basic multiple integration and the greedy multiple integration approaches discussed above are able to integrate multiple ontologies, none of them provide any guarantee on the results in comparison with the optimal solutions. By studying the maximum cohesion scores between ontologies under a graph setting, we identified an approximation structure and developed an approximation algorithm for integrating multiple ontologies. We name it fast approximation algorithm because it not only has a lower bound on the results, but also runs faster than the greedy multiple integration algorithm proposed above.

The fast approximation algorithm for integrating multiple ontologies is sketched in Algorithm 5. It only calculates the maximum cohesion score between every pair of ontologies once during the initial stage, and uses this information throughout the integration process even after it becomes stale. More importantly, this approach not only saves the time for recalculating the maximum cohesion scores, but also provides a lower bound guarantee as stated in Theorem 5, whose correctness is built on two important lemmas (Lemmas 6 and 7) which will be described subsequently.

for to do
for to do
  push into a set ordered by the first element in descending order;
end for
end for
whiledo
;
if   adding does not form a cycle in then
  adding to ;
  integrate and into ; is a set of vertices including that form a connected component in .}
end if
end while
return;

Theorem 5. The tree weight of the integrated tree obtained by FASTMULTIINT algorithm is at least the weight of the optimal solution.

Proof. We will use Lemmas 6 and 7 to prove this theorem. The proofs of Lemmas 6 and 7 are provided after the proof of this theorem. To facilitate the proof of this theorem, we build a fully connected weighted graph in which each node corresponds to a tree for integration, and the weight of each edge corresponds to the weight increase (initially, this is the cohesion score) for integrating the corresponding trees. According to Lemma 6, (i.e., optimal cohesion score) is no more than the summed weight of edges in (Claim  1).
Given , the integration by Algorithm 5 is a process of node contractions. After each contraction, the adjacent edge weights (cohesion scores) will be updated accordingly. According to Lemma 7, the weight of an updated edge will only increase over (or at least remain the same as) the maximum weight of the two contracted edges (Figure 5 provides an illustration of vertex/edge contraction and weight updates.) Thus, the overall cohesion score of the integration by Algorithm 5 is no less than the weight of the maximum spanning tree of (Claim  2).
It is easy to see that the weight of a maximum spanning tree is no less than of the summed edge weight of , given the simple observation that each edge in is either an edge of the maximum spanning tree or adjacent to an edge of the maximum spanning tree with an equal or heavier weight (Claim  3).
Combining Claims 1, 2, and 3, we complete the proof of this theorem.

Lemma 6. It holds that .

Proof. According to the problem definition, is the cohesion score of whose integration is induced from .

Lemma 7. It holds that .

Proof. According to the problem definition, integrating with will result in an integrated tree where , and , where , , and are induced from . Since has been determined, we have . Without loss of generally, let us assume . Then, by restricting the integration between and in following the integration that leads to , we will get a cohesion score no less than . Thus, we complete the proof for .

2.2.3. Time Complexity Analysis

For the fast approximation algorithm (Algorithm 5), the time complexity for generating graph (calculating the overall cohesion score for every pair of ontologies) is , assuming we use maximal weighted matching. Each integration will take with an update of at most closeness matrices which takes . There are at most integrations; therefore the total time complexity is still .

Following the above analysis, we conclude that the time complexity of the greedy multiple integration algorithm is the same as the fast approximation algorithm. However, it requires updating of InterOntology matrix which takes an excessive time. The empirical study shows that the fast approximation algorithm is much faster than the greedy multiple integration algorithm.

Finally, it is easy to see that the basic multiple integration approach takes time and is the fastest, but its overall cohesion scores are the worst as we will see in the empirical study.

2.2.4. Limitations

Integrating multiple ontologies may face two potential problems in real applications. First, how can we efficiently generate a closeness matrix for every pair of ontologies to be integrated? Our current method KDLS or ONGRID is efficient for generating the closeness matrix for one pair of ontologies in most cases, but not efficient enough for generating closeness matrices for many pairs of ontologies. Second, not every pair of ontologies can be meaningfully integrated. It remains a problem to efficiently identify the feasibility of integrating a pair of ontologies. Therefore, the main purpose of Section 2.2 is to demonstrate that our proposed approach can be extended to integrate multiple ontologies, and we use synesthetic datasets in Section 3.3 to study the performance of algorithms proposed in Section 2.2.

3. Results and Discussion

We would like to study the performances of the proposed ontology integration methods by experiments on both real and synthetic datasets. We implemented five approaches in C++:(1)HEURISTIC: heuristic approach for integrating two ontologies as described in Section 2.1.1;(2)APPROXIMATE: approximate approach (Algorithms 2 and 3) with maximal weighted matching for guaranteeing the ()-approximation rate;(3)BASIC: basic multiple integration approach as described in Section 2.2.1;(4)GREEDY: greedy multiple integration approach (Algorithm 4);(5)FASTAPPROXIMATE: fast approximation multiple integration approach (Algorithm 5).In the following, we report our study on the performances of and for integrating two ontologies on real datasets and , , and for integrating multiple ontologies on synthetic datasets. All the experiments are carried out on a Linux cluster with 2.4 GHz AMD Opteron processors.

3.1. Integrating a Pair of Ontologies

The knowledge of drug-gene relationships is desirable in many pharmacology applications [25, 26]. By integrating the gene ontology and the drug ontology, we will be able to obtain rich information on the associations between drugs and genes under the ontology structures. Thus, in this set of experiments, we simulate real world knowledge discovery applications by integrating two real ontologies, gene ontology (GO) and National Drug File Reference Terminology (NDFRT). Both were obtained from the Unified Medical Language System (version: 2012AA). The closeness matrices between GO terms and drug terms were generated using ONGRID [27] with a 4-neighborhood broadcast range (i.e., with regard to [7]). ONGRID follows the KDLS approach [7] and measures the closeness between two concepts based on the discovered paths (with length greater than one) between them. However, unlike KDLS, ONGRID takes into consideration of concept semantic types in the closeness measurement. In the study performed in [27], the advantages of ONGRID over KDLS are well illustrated.

The overall cohesion scores of HEURISTIC and APPROXIMATE on integrating GO and NDFRT are listed in Table 1. To observe the integration over the ontology size change, in each experiment we use the ontology tree structure from the root to the specified depth (first column in Table 1) for integration. The sizes of the ontology terms involved in the integration are listed in the second and third columns of Table 1.

Recall in Section 2.1.1; is used to regulate the selection of vertices from high depths. Thus, we tested the HEURISTIC under (depth information is nullified) and (the vertex depth plays a critical role in the selection). Since nonleaf vertices in these datasets have around 6 children on average, we heuristically add a set of experiments by setting so that will be close to the number of vertices excluded from the future integration.

From Table 1, we can see that HEURISTIC performs better when using the depth information to regulate the selection of vertices. However, APPROXIMATE is much better than HEURISTIC at all settings. Compared to the best cohesion scores of HEURISTIC in each row of Table 1, APPROXIMATE constructs an integrated ontology with the overall cohesion score ranging from 5.4 times to 109.9 times that of HEURISTIC. This clearly demonstrates the effectiveness of the proposed APPROXIMATE approach. Nevertheless, the heuristic approach has a much faster average running time as a result of excluding a large number of matching opportunities in each step.

Although the running time of APPROXIMATE is longer than HEURISTIC, it takes less than two hours to finish integrating two ontologies with about and vertices. Most of the biomedical ontologies are smaller than or similar to these sizes and APPROXIMATE approach will benefit the association study of these ontologies. For extremely large ontology pairs in which APPROXIMATE is unable to finish the integration within a reasonable time, HEURISTIC may provide a quick view on their integration.

3.2. Understanding the Merged Ontology Terms

To understand what terms are merged in integrating real ontologies, we use the integration of GO and NDFRT at depth 6 as an example. Tables 2 and 3 list the top 5 pairs of merged terms (sorted by their closeness scores) by HEURISTIC and APPROXIMATE, respectively. As mentioned above, these scores are from the closeness matrix generated by ONGRID based on the discovered paths between them. For example, “C1155065:T-Cell Activation −  − is_physiologic_effect_of_chemical_or_drug −  −  > C0393002:Carcinoembryonic Antigen Peptide 1 −  − has_target −  −  > C0007082:Carcinoembryonic Antigen” is such a path.

From Tables 2 and 3 we can observe that the APPROXIMATE algorithm merges terms with much higher similar scores than the HEURISTIC algorithm. Quite interestingly, we observed that the top ranked merging in Table 3 is between “biological_process” and “VITAMINS.” The “biological_process” is an abstract term which is very close to the root of the GO ontology. Such a fact suggests that the top level terms will likely preempt the merging choices over their descendants. As a result of this greedy approach, the HEURISTIC algorithm will end at a local optimum which is far from being optimal.

A snapshot of ontology integration by APPROXIMATE as shown in Figure 10 provides a good insight on the algorithm work. In each bracket of two merged terms, the left part is the closeness score and the right part is the cohesion score. We can observe that most closeness scores are zero or close to zero, while the corresponding cohesion scores are much higher. This is understandable because the snapshot is primarily on the top level terms of both ontologies. For these terms, they have a large number of subclass (descendant) terms, and optimizing the integration of their subclass terms far outweighs integrating of themselves. The result of such integration provides novel knowledge of association between ontology terms. That is, even if two terms are not that close according to some closeness measurement, they can be structurally associated under their ontology context. For example, the GO term “biological_process” is merged with the NDFRT term “chemical ingredients”; even their closeness score is zero from the ONGRID output. However, such integration is interesting because it shows that the merging is trying to link the chemical compounds with the biological/cellular processes so that corresponding associations between the cellular processes and chemical structures can be established. This demonstrates the purpose of integrating two ontologies, that is, identifying associations with respect to both term similarities and structural contexts.

In fact, there are multiple studies to justify the structural associations seen in Figure 10, such as the association between “signaling” and “carbohydrates” [28] and the association between “extracellular region part” and “skin and connective tissue diseases” [29].

In addition, we have noticed a number of meaningful integrations between GO terms and neurological terms in the NDFRT. For example, synapse is a brain related structure and the term “symmetric synapse” is associated with “trauma,” and the term “asymmetric synapse” is associated with “brain neoplasms.” Similarly, it is reasonable to see that “neuronal RNA granule” is integrated with “granulomatous disease,” a granule associated disease. As another example, it is very interesting to notice that “zyxin” is associated with “cell adhesion involved in heart morphogenesis” and that provides a link with the formation of heart.

The above observations suggest a novel way of using our ontology integration method to perform association studies between biomedical concepts.

3.3. Integrating Multiple Ontologies

In the following experiments we will study the performances of BASIC, GREEDY, and FASTAPPROXIMATE in integrating multiple ontologies. All the three approaches are built upon APPROXIMATE, which performs very well in the previous study for integrating two real ontology datasets.

We use two sets of synthetic datasets in this study. In the first set of datasets, we fix the number of ontologies to be 10 and vary the size of each ontology from 100 to 1000. In the second set of datasets, we fix the size of each ontology to be 100 and vary the number of ontologies from 10 to 100. All the ontologies are randomly generated by constructing a minimal spanning tree from a random matrix. The relationship matrix between every pair of ontologies is also randomly generated with entry values ranging from 0 to 1. For each experiment, we generate 10 random datasets and the results reported in the following are the average results over the 10 random datasets.

The overall cohesion scores of the three approaches over different ontology sizes and over different numbers of ontologies were reported in Figures 6 and 7, respectively. FASTAPPROXIMATE outperforms all the other approaches in Figure 6, which is consistent with the analysis of its approximation rate. However, GREEDY slightly outperforms FASTAPPROXIMATE in Figure 7 especially when the ontology number is large. This is understandable because when the number of ontology () increases, the approximation rate (as stated in Theorem 5) decreases and becomes less significant. This result also justifies the choice of adjusted cohesion score for GREEDY as described at the end of Section 2.2.1.

The integration time of the three approaches over different ontology sizes and over different numbers of ontologies was reported in Figures 8 and 9, respectively. These figures are consistent with the time complexity analysis given in Section 2.2.3. In particular, we noticed that the integration time of GREEDY deteriorates sharply over the ontology number increase. In contrast, FASTAPPROXIMATE is much more scalable and has a time curve similar to BASIC.

These results suggest that FASTAPPROXIMATE has the best overall performance in integrating multiple ontologies.

4. Conclusions

In this work, we started with a basic problem on integrating a pair of ontology tree structures with a given closeness matrix, and later we advanced the basic problem to the problem of integrating large number of ontologies. We proved optimal structures in the basic problem and developed both optimal and efficient approximation solutions. Although the multiple ontology integration problem has similar optimal structures, it is not feasible to extend the optimal and efficient approximation solutions for the basic problem to efficiently handle multiple ontology integration. To tackle the challenge of integrating a large number of ontologies, we developed both an effective greedy approach and a fast approximation approach. The empirical study not only confirms our analysis on the efficiency of the proposed method, but also demonstrates that our method can be used effectively for biomedical association studies.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.