Abstract
Semantic collision is inevitable while building a domain ontology from heterogeneous data sources (semi)automatically. Therefore, the semantic consistency is indispensable precondition for building a correct ontology. In this paper, a modelcheckingbased method is proposed to handle the semantic consistency problem with a kind of middlemodel methodology, which could extract a domain ontology from structured and semistructured data sources semiautomatically. The method translates the middle model into the Kripke structure, and consistency assertions into CTL formulae, so a consistency checking problem is promoted to a global model checking. Moreover, the feasibility and correctness of the transformation is proved, and case studies are provided.
1. Introduction
Semantic web has been a great idea and a promising research area for a dozen years [1]. A main challenge of widening semantic web technologies is lack of semantic data, which is named ontology. So lots of researchers focus on how to transform the legacy mass web data into ontology. The legacy web data in varietal forms can be classified roughly into three types: structured data, such as relational databases; semistructured data, such as XML documents and emails; and nonstructured data, such as general text and video. Recently, technologies of automatically transforming semistructured data or structured data into a domain ontology through mediate modeling are promising [2–6]. In these technologies, some semantic collisions appear inevitably when the same domain ontology has been built from multiple data sources. A domain ontology is a formal expression of a domain knowledge, aiming to unify the general knowledge of a special domain in order to share contents, achieve interoperability, or integrate applications without specific authorization. Unfortunately, the unification is very tough even in the same organization, where the general knowledge exists in very different forms, in very distinct interpretation and in very dissimilar usage for most applications. So automatically creating domain ontology from heterogeneous sources becomes a big challenge. The semantical paradox and ambiguity must be of concern during the process of building the domain ontology, which means the validation of semantic consistency. The semantic consistency guarantees correct and concise specific domain ontology for all kinds of semantic web applications with multiple data sources.
In [2, 7–11], researchers transform structured data (relational database schema) into a middle model and then create a domain ontology from the model. They regard that the relational database is the only data source and that the database is well defined (no ambiguity), which is not always practical. Some semantic preserving properties on transforming are proved, but they do not concern the semantic consistency, which is left for the created domain ontology.
As for semistructured data sources, in [4–6], researchers employ a mediate model language for modeling semistructured data and then transform middle models into the domain ontology. The validity checking has been provided while collisions happen, which is generally a syntax checking. Semantic consistency checking is also missed.
For building domain ontology from heterogeneous data sources, the semantic consistency checking is necessary. In [12, 13], the same middle formal model language has been adopted to model both structured and semistructured data, so the method can be used to build the domain ontology from heterogeneous data sources. In this paper, we focus on the semantic consistency problem based on this method. The literature [14] develops a middle formal language to describe semistructured data for modelchecking purpose and [15] employs graphbased formalism to model semistructured data and queries based on the fixed point computation. We are inspired by the modelchecking technology [16], translate the mediate model into a Kripke structure, and encode all semantic query problems into CTL formulae and then we transform the semantic consistency checking problem into a global modelchecking procedure.
The following Section 2 recalls the mediatemodelbased method of building domain ontologies and the modelchecking technology. In Section 3, the mediate model language is introduced, and the model equivalence is analyzed. And the modelcheckingbased consistency checking technology is proposed in Section 4 in detail. Some cases are studied in Section 5. Section 6 gives a conclusion.
2. MediateModelBased Technology and Model Checking
In this section, the method of formally creating domain with the mediate model [12, 13] and a modelchecking technology [16, 17] are introduced.
2.1. Formally Creating Domain Ontology
Wgraph, a kind of graphbased formal language, is used to model semantic aspect of relationship databases [12]. The main idea is to execute SQL procedures to retrieve the semantic information of the database instances, transform the result sets into a Wgraph model, and then transform the model into the ontology autocompletely, which not only maps schemata to the middle model, but also populates the model with data stored in databases. This language is also used to model semantically XML documents in [5]. In [5], they provide XML documents for a semantical interpretation by the Wgraph language. The Wgraph model created from relationship databases or XML documents can be automatically transformed into a domain ontology (expressed in ontology web language—OWL [18]). And a Wgraph is defined as follows.
Definition 1. A Wgraph is a directed labeled graph , where is a finite set of nodes, is a finite set of atomic nodes depicted as ellipses, is a finite set of composite nodes depicted as rectangles, is a set of labeled edges of the form , is defined as , , is a set of labels, and is a symbol for nothing (empty label, can be read as bottom).
Nodes in Wgraph always represent objects, and edges represent relationships between nodes. There are two types of concrete Wgraph: instances and schemata. An instance can be formally defined as follows.
Definition 2. A Winstance is a Wgraph such that for each edge of and and for each node of .
In Figure 1 a Winstance is depicted. , where , , , , , , , .
It describes the information that two teachers, one 37 years old and another 40 years old, teach database course, and the student Smith attends this course. In Wgraph, edge attribute consists of two components, the color and the label, and the function returns a color and a label (possibly empty, ) for each node. Edge labels are stuck close to the corresponding edges, and node labels are written inside the rectangles representing the nodes. The set of colors denotes how the lines of nodes and edges are drawn (solid or dashed), and we also call this information the color of a node or edge. On the other hand, the function can be seen as the composition of the two single valued functions and , so can also be implicitly defined on edges: if , then and . Two nodes may be connected by more than one edge, provided that edge attributes are different.
For two subsets of , and is accessible from if for each node in the set there is a corresponding node in the set such that there exists a path from to in Wgraph . For example, in the Winstance of Figure 1, the set is accessible from the set .
The bisimulation semantics of the language is also given as follows.
Definition 3. Given two Wgraphs and , a relation is said to be a bisimulation between and (write ) if and only if:(1)for , , such that ,(2)for ,, , s.t. , and(3)for , , let . Then, , such that ; for , it holds that , .
2.2. Model Checking
Model checking is an automated technique that, given a finitestate model of a system and a formal property, systematically checks whether this property holds for (a given state in) the model. Here the finitestate model is always called Kripke structure (an automatalike state transition system), and the formal properties are always expressed by computation tree logic (CTL, a logic that is based on a branchingtime view) formulae.
Definition 4. A Kripke structure is a transition system over a set of atomic propositions, where is a set of states, is a set of actions, is a transition relations, and is an interpretation.
A path in a Kripke structure is an infinite sequence of states and actions ( denotes the th state in the path ), s.t.. For all it holds that and either , with , or there are no outgoing transitions from and for all it holds that is the special action (which is not in ) and .
Definition 5. Given the sets and of atomic propositions and actions, computation tree logic (CTL) formulae are recursively defined as follows: (1)each is a CTL formula;(2)if and are CTL formulae, , then , , , , , and are CTL formulae.
A and E are the universal and existential path quantifiers, while neXt (X) and Until (U) are the lineartime modalities. Composition of formulae of the form can be, respectively, defined by and , and the modalities Finally (F) and Generally (G) can be defined in terms of the CTL formulae: true and .
Definition 6. Satisfaction of a CTL formula by a state of the Kripke structure is defined recursively as follows: (i)if , then iff . Moreover, , and ;(ii) iif ;(iii) iff and ;(iv) iff there is a path s.t. and ;(v) iff for all paths , implies ;(vi) iff there is a path , and s.t. , , and and ;(vii) iff for all paths , s.t. , , , and and .
Definition 7 (the modelchecking problem). Local modelchecking: given a Kripke structure , a formula , and a state of , the local modelchecking is to verify whether . Global modelchecking: given a Kripke structure , and a formula , the global modelchecking is to find all states of such that .
If is finite, the global modelchecking problem for a CTL formula can be solved in linear running time on , where is the length of formula and is the number of elements in the set , is the elements number of the transition relations set [16].
3. Modeling Semantic Inconsistency
When a language has been used to semantically model different data sources for building the same domain ontology, the semantic collision becomes prominent, so the consistency checking is inevitable. Even from a single data source, the incremental procedure of building the semantical model may fail when a new snippet collides semantically with some model segments that existed. Therefore, ambiguities should be detected in order to get correct semantical model during building a domain ontology. The semantic consistency checking is a mechanism for checking whether the model is semantic unambiguity or paradox.
Two kinds of problems would be of concern: redundancy and paradox. The redundancyfree can reduce the size of the model and accelerate modeling procedure. For the middlemodel language Wgraph, according to Definition 3, two equivalent models (or model segments) cause a redundancy.
As far as the paradox is concerned, there are four types of inconsistency: concept inconsistency, relationship inconsistency, attribute inconsistency, and fact inconsistencies. Before discussing details of these inconsistencies, another special Wgraph, socalled Wschema, will be presented. The schema gives a pattern to organize data for an instance. The schema of Winstance is also a Wgraph, and it could be defined formally as follows.
Definition 8. A Wschema is a Wgraph such that for each edge of and and for each node of ; that is, a schema has no values.
For example, Figure 2 depicts the Wschema of the Winstance in Figure 1. So the Wgraph language can be used to model semantic aspects of the knowledge, nodes for concepts, edges for relationships between them, Wschemata for the patterns of the knowledge, and Winstances for the concrete contents of that knowledge. Now we will discuss the details of each inconsistency based on the Wgraph.
Concept Redundancy. During the procedure of building semantic model, if a new concept occurs, then we add this concept to the model (add a new node to the Wschema). If a concept paradox occurs, then we also add a new concept to the model. But concepts redundancy will occur if we find two concepts (different names, of course) getting the same attribute set and relationship set. Concepts redundancy always occurs in the Wschema. For example, Figure 3 shows that Teacher and Tutor are the same concepts here. There is not a new concept Tutor to be added into the model, but a new synonym of Teacher can be annotated by only one edge labeled “synonym.”
We define concept redundancy as follows.
Definition 9. In the Wgraph , two concepts and (expressed as nodes ) are redundancy when they get the same adjacent nodes set and edges set: , and for all and corresponding such that .
Here, means that after omitting concepts nodes , and their adjacent edges the subgraph including node bisimulates the subgraph including node . Computing is an iterative process. For deciding whether , we delete node and its adjacent edges from and node and its adjacent edges from and then decide whether the two smaller subgraphs are bisimulation. The iterative process will terminate because the nodes and edges are finite.
Relationships Inconsistency. When a new relationship between concepts during modeling process is found, the relationship cannot be added directly into the model, because sometimes the inconsistency will occur. Firstly, if the relationship is redundant according to the bisimulation, it should be ignored. Secondly, if the relationship is added into the model, we must ensure not to introduce some paradox into the model; otherwise, the relationship inconsistency occurs. For example, Figure 4 shows that the relationship “Student teaches Teacher” should cause a paradox. The relationship may be found from XML documents “Teacher teaches students knowledge, and meanwhile teacher also learns something from students.” The relationship “teacher learns from students” may be understood as “Student teaches Teacher,” but this contradicts “Teacher teaches Student” relationship, which is reasoned from “Teacher teaches Course” and “Student attends the Course.”
The paradox relationship can be formally defined as follows.
Definition 10. In the Wgraph , a paradox relationship between node and node is that and , where .
Attribute Inconsistency. The attributes of concepts are another kind of relationship, socalled “isa” or “hasa” relationship. So attribute inconsistency is a kind of relationship inconsistency, but simpler.
Fact Inconsistency. This kind of inconsistency occurs in the Winstance. Facts are the individuals of concepts or attributes. When creating Winstance by populating data from the heterogeneous sources, lots of facts inconsistencies may occur. In Figure 5(a), an attribute fact inconsistency will happen if the fact “age is 40” is added into the model, where the teacher named Charley gets two different ages. And in Figure 5(b), the Teacher2 gets all the same value set of attributes, so Teacher1 and Teacher2 are the concept fact inconsistency. Therefore, the two facts “age is 40” and “Teacher2” cannot be added into the model.
(a)
(b)
The fact inconsistency is formally defined as follows.
Definition 11. In the Winstance where , two facts are said to be inconsistent if (1)for , s.t. and or(2)for , and s.t. and .
For all these inconsistencies, the core problem is how to discovery the redundancy and paradox in the model. In fact, the procedure of discovery is a subgraph query problem in Wgraph.
Definition 12. For a Wgraph , a subgraph of is also a Wgraph , where and . Furthermore, given two sets of nodes , is accessible from if for each there is a node such that there is a path in from to .
Definition 13. A Wquery is a pointed Wgraph, namely, a pair with as a node of (the point). A Wquery is accessible if the set of nodes of is accessible from .
For example, in Figure 6 three Wqueries are depicted. The thick arrows are their points, and they are Wqueries. Intuitively, the meaning of the first query is to collect all the teachers aged 37. The second query asks for all the teachers that have declared an age (observe the use of an undefined node). The third query, instead, requires collecting all the teachers that teach some courses but not Database, where dashed nodes and lines are introduced to allow a form of negation.
(a)
(b)
(c)
According to Definition 13, an inconsistency checking problem is indeed a query problem in the partial Wgraph model. Meanwhile, the semantic equivalence must be of concern during query processing. We call this a semantical equivalence query problem. For the incremental procedure of building domain ontology, a semantical equivalence querying should be executed as soon as each new element (concept, relationship, attribute or fact) is added into the model. So, the semantic equivalence querying problem is the basic problem for the consistency checking.
However, the semantical equivalence querying problem for a Wgraph model is the subgraphisomorphism problem, which is NP problem in general. In this paper, we employ modelchecking technology to handle semantical equivalence query problem in order to avoid computing subgraph isomorphism.
4. Consistency Checking
In order to employ modelchecking technology, we can see a Wgraph model as a Kripke structure and a semantical equivalency query as a formula of the temporal logic CTL. In this way, the inconsistency checking problem is reduced to the problem of finding out the states of the model which satisfy the formula (the modelchecking problem) that can be done in linear time.
4.1. WGraph as Kripke Structure
With the following definition, we can build Kripke structure from Wgraph.
Definition 14. Let be a Wgraph, the Kripke structure over the set of atomic properties is defined as follows: (i) is the set of all node labels of , .(ii)The set of states .(iii)The set of actions ; that is, the set of actions includes all the edge labels, negative labels (express as ), and their inverse labels (express as ); note that negative labels are very different from inverse labels. A negative label edge expresses that there is no special relationship (i.e., “label” relationship) between two nodes, but an inverse label edge expresses that there is a special relationship (i.e., inverse relationship) between this two nodes.(iv)The ternary transition relation . Moreover, assume that, for each state with no outgoing edge in (a leaf in ), a selfloop edge labeled by the special action is added.(v)The interpretation function , where . That is, in each state the only formulae that hold are the unary atom and .
For instance, consider the Wgraph of Figure 1. It holds the following:(i), , , , , , ,(ii), , , ,(iii),, , , , , , , , , , , , , , , ,(iv), , , , , , , , , , , , , , , , , , ,, , ,,, ,, ,,, , , , , , , , , , , , , , , , , ,,, , ,(v), , , , , , , , , , , , , , .And this Kripke structure is shown in Figure 7.
4.2. Query as CTL Formula
Reviewing the Wquery in Figure 6(a), intuitively, the CTL formula must express the statement “the state Teacher formula is true and there is one next state reachable by an edge labeled age, where the 37 formula is true,” this is to say For the query in Figure 6(b), the CTL formula should be written as For the query of Figure 6(c), we get the following formula: This formula is true if “there is a node labeled Teacher and there exists one next node labeled Course, which is reachable by an edge labeled teaches, such that for all next to Course nodes labeled Database, the relation cName is not fulfilled”. Let us then consider how to encode Wqueries in CTL formulae. We study the situation that Wgraph and Wquery are both acyclic. Firstly, we define an auxiliary function to simply handle labels of nodes and edges in Wgraph.
Definition 15. Let be a Wgraph, for all nodes and for all edges ; an auxiliary function is defined as
And the query translation can be formally defined as follows.
Definition 16. Let be a acyclic Wgraph, , and let be an accessible query. The formula associated with is , where is defined recursively as follows: (i)let be the successors of , s.t. ,(ii)let be the successors of , s.t. (iii)for and , let be the edge which links to and the one which links to . If , then
The construction of the formula involves the unfolding of a directed acyclic graph; the size of the formula (written as ) can grow exponentially with respect to. Although that, it is easy to compute the formula without repetitions of subformulae and keep the memory allocation linear w.r.t. , so it is natural to compactly represent the formula using a linear amount of memory. This compact representation is allowed by the modelchecker NuSMV [19].
Theorem 17. Given a Winstance and a Wquery , let be the Kripke structure associated with and the CTL formula associated with . Consistency checking with can be done in linear time on .
Proof. For achieving the consistency checking procedure, there are three steps to follow, translating the Wgraph into a Kripke structure, encoding the Wquery into a CTL formula, and executing a modelchecking process.
Assuming that , where is the number of nodes in Winstance and is the number of edges in this graph, notating , expressing as the number of states, actions, and transition relations in , and writing as the length of the CTL formula , then the complexity issues can be analyzed.
Firstly, when creating the Kripke structure, for each node, action, and transitional relation that must be created one by one, the time complexity should be , according to Definition 14. Secondly, we consider the length of the encoded CTL formula , because the worst situation is to encode the whole Wgraph into a formula, where is the total number of edges and nodes in this query. The last step is to execute modelchecking procedure; the time complexity is according to [16]. So the total time complexity is , which is a linear time on .
If we think that the is always less than or equal to (intuitively, it is always so), then , which is a polynomial time over .
The equivalence between two segments of Wgraph is the equivalence between two CTL formulae, which can be formally proved in linear time. The consistency checking problem for the Wgraph model is promoted to the equivalence proof problem for the CTL formulae with a Kripke structure.
4.3. Checking Consistency
During the procedure of extracting a domain ontology gradually from heterogeneous data sources, semantic consistency has to be checked. According to Definitions 14 and 16, after encoding a Wgraph to a Kripke structure and a Wquery to a CTL formula, the semantic equivalent query problem on the Wgraph has been promoted to a global modelchecking problem.
Definition 18. Given a Winstance and a Wquery , let be the Kripke structure associated with and the CTL formula associated with . The Consistency checking with amounts to solve the global modelchecking problem with the Kripke structure and the CTL formula , namely, to find all the states of such that . Algorithm 1 describes this procedure.
So semantic consistency checking can be done with modelchecking technology. Let us consider each type of inconsistency defined above. As for concepts inconsistency, it is to find out whether there is another concept (equivalent to ) in Wschema , which has been built to be extended, when a new concept has been added into the Wschema . This is to say, a Wquery should be imposed on the for consistency checking. Let be the Kripke structure of and let be the CTL formula of and we should find all states of such that .
As far as relationship inconsistency is concerned, it is to query a Wgraph with a Wquery before a new relationship is added into the Wgraph model. Let be the Kripke structure of , and the CTL formula is The consistency checking is to find all states of such that . If the states like this cannot be found, then the model is consistent after adding the new relationship ; if any state has been found, then the model will be inconsistent and the new relationship cannot be added into the model. This is the relationship redundant inconsistency. As for paradox relationship paradox inconsistency, the CTL formula is
As for fact consistency checking, the Kripke structure of the Winstance is , and the CTL formula for the attribute fact (the instance of a concept has the concrete attribute value , i.e., ) inconsistency is The CTL formula for the concept fact (the instance of one concept to be added into the model) inconsistency is where are all successors of . If some states of satisfy the above formula, then the inconsistency occurs. This is to say, a fact cannot be added into the model if an equivalent fact has already existed in the model.
5. Cases Study
In this section, we will test the semantic consistency checking technique presented above by using the model checker NuSMV 2.5.4 [20]. NuSMV allows for the representation of finitestate machines (FSMs) and for the analysis of specifications expressed in computation tree logic (CTL) using symbol modelchecking techniques. For using the tool, we first rewrite the Kripke structure into a finitestate machine, where edges are not labeled, as follows: given a Kripke structure related to a Winstance, replace every labeled edge by the two edges , , where is a new node labeled action.
The input of the NuSMV tool is represented by a SMV program, which can express both the FSMs and CTL formulae. The Algorithm 2 is the SMV program to describe the Kripke structure of the Winstance in Figure 7 and some CTL specification of consistency checking mentioned above, where the line started with symbol “ ” is comment.

In this SMV program, the set of states of the Kripke structure is chosen by declaring the state variable state to assume values , where actions have also been seen as states. The transition relation of the Kripke structure is expressed by assigning (ASSIGN), for each value of the variable state, the list of nodes that can be reached from it through one edge. The variables label and nl are introduced to define the node label of each state identified by the value of the variable state. And CTL formulae have been defined by CTLSPEC. The formula & & says when a new concept Teacher, which has an age attribute, has a teaches action, and can teach some Course, is to be added into the model whether there exists an inconsistency. And & expresses whether a new relationship can be added into the model without semantic inconsistency. And & says whether the attribute relationship “the Teacher is 40 years old” can be added into the model without any redundancy and paradox. And & & expresses whether the fact “the 37year Teacher teaches Database Course” can be added into the model.
The results we have obtained on the 64bit windows 7 operation system are shown in Figure 8. The output true for a CTL formula says that at least one inconsistency exists when checking the Wgraph model with the new semantic segment (expressed by this formula), so the new segment cannot be added into the model. Otherwise, we can choose another initiate state to check again, until we finish all elements of initiated state set. If we always get a final false output, then the inconsistency has not occurred, and the new semantic segment can be added into the model.
This test confirms the possibility of solving semantic consistency checking problem by using modelchecking on polynomial time methods.
6. Conclusion
For validating semantic consistency during the increasing procedure of building a domain ontology from heterogeneous sources, we employ the modelchecking technology to avoid subgraph isomorphism problem, which is NP hard. In order to adopt modelchecking method, we formally transform the semantic model into a Kripke structure and the semantic equivalent querying problem into CTL formulae and then the semantic consistency is promoted to the global modelchecking problem. The effective experiment with the modelchecking tool NuSMV has also been introduced. In the future, the reasoning problem should be considered clearly; for example, some implicative semantic elements would be reasoned from the existing model. If a new semantic segment is equivalent to some implicative semantic elements, the inconsistency also occurs. In the near future, this type of consistency checking should also be regarded.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This paper is supported by the Natural Science Foundation of Guangxi under Grants nos. 2011GXNSFA018154 and 2012GXNSFGA060003, the Science and Technology Foundation of Guangxi under Grant no. 101691, Guangxi Scientific Research Project no. 201012MS274, and the starting fund of GXUN under Grant no. 2011QD017. This paper is supported also by Grant (2012HCIC04) of Guangxi Key Laboratory of Hybrid Computation and IC Design Analysis Open Fund.