Abstract

Semantic collision is inevitable while building a domain ontology from heterogeneous data sources (semi-)automatically. Therefore, the semantic consistency is indispensable precondition for building a correct ontology. In this paper, a model-checking-based method is proposed to handle the semantic consistency problem with a kind of middle-model methodology, which could extract a domain ontology from structured and semistructured data sources semiautomatically. The method translates the middle model into the Kripke structure, and consistency assertions into CTL formulae, so a consistency checking problem is promoted to a global model checking. Moreover, the feasibility and correctness of the transformation is proved, and case studies are provided.

1. Introduction

Semantic web has been a great idea and a promising research area for a dozen years [1]. A main challenge of widening semantic web technologies is lack of semantic data, which is named ontology. So lots of researchers focus on how to transform the legacy mass web data into ontology. The legacy web data in varietal forms can be classified roughly into three types: structured data, such as relational databases; semistructured data, such as XML documents and emails; and nonstructured data, such as general text and video. Recently, technologies of automatically transforming semistructured data or structured data into a domain ontology through mediate modeling are promising [26]. In these technologies, some semantic collisions appear inevitably when the same domain ontology has been built from multiple data sources. A domain ontology is a formal expression of a domain knowledge, aiming to unify the general knowledge of a special domain in order to share contents, achieve interoperability, or integrate applications without specific authorization. Unfortunately, the unification is very tough even in the same organization, where the general knowledge exists in very different forms, in very distinct interpretation and in very dissimilar usage for most applications. So automatically creating domain ontology from heterogeneous sources becomes a big challenge. The semantical paradox and ambiguity must be of concern during the process of building the domain ontology, which means the validation of semantic consistency. The semantic consistency guarantees correct and concise specific domain ontology for all kinds of semantic web applications with multiple data sources.

In [2, 711], researchers transform structured data (relational database schema) into a middle model and then create a domain ontology from the model. They regard that the relational database is the only data source and that the database is well defined (no ambiguity), which is not always practical. Some semantic preserving properties on transforming are proved, but they do not concern the semantic consistency, which is left for the created domain ontology.

As for semistructured data sources, in [46], researchers employ a mediate model language for modeling semistructured data and then transform middle models into the domain ontology. The validity checking has been provided while collisions happen, which is generally a syntax checking. Semantic consistency checking is also missed.

For building domain ontology from heterogeneous data sources, the semantic consistency checking is necessary. In [12, 13], the same middle formal model language has been adopted to model both structured and semistructured data, so the method can be used to build the domain ontology from heterogeneous data sources. In this paper, we focus on the semantic consistency problem based on this method. The literature [14] develops a middle formal language to describe semistructured data for model-checking purpose and [15] employs graph-based formalism to model semistructured data and queries based on the fixed point computation. We are inspired by the model-checking technology [16], translate the mediate model into a Kripke structure, and encode all semantic query problems into CTL formulae and then we transform the semantic consistency checking problem into a global model-checking procedure.

The following Section 2 recalls the mediate-model-based method of building domain ontologies and the model-checking technology. In Section 3, the mediate model language is introduced, and the model equivalence is analyzed. And the model-checking-based consistency checking technology is proposed in Section 4 in detail. Some cases are studied in Section 5. Section 6 gives a conclusion.

2. Mediate-Model-Based Technology and Model Checking

In this section, the method of formally creating domain with the mediate model [12, 13] and a model-checking technology [16, 17] are introduced.

2.1. Formally Creating Domain Ontology

W-graph, a kind of graph-based formal language, is used to model semantic aspect of relationship databases [12]. The main idea is to execute SQL procedures to retrieve the semantic information of the database instances, transform the result sets into a W-graph model, and then transform the model into the ontology autocompletely, which not only maps schemata to the middle model, but also populates the model with data stored in databases. This language is also used to model semantically XML documents in [5]. In [5], they provide XML documents for a semantical interpretation by the W-graph language. The W-graph model created from relationship databases or XML documents can be automatically transformed into a domain ontology (expressed in ontology web language—OWL [18]). And a W-graph is defined as follows.

Definition 1. A W-graph is a directed labeled graph , where is a finite set of nodes, is a finite set of atomic nodes depicted as ellipses, is a finite set of composite nodes depicted as rectangles, is a set of labeled edges of the form , is defined as , , is a set of labels, and is a symbol for nothing (empty label, can be read as bottom).
Nodes in W-graph always represent objects, and edges represent relationships between nodes. There are two types of concrete W-graph: instances and schemata. An instance can be formally defined as follows.

Definition 2. A W-instance is a W-graph such that for each edge of and and for each node of .

In Figure 1 a W-instance is depicted., where,, , , , , , .

It describes the information that two teachers, one 37 years old and another 40 years old, teach database course, and the student Smith attends this course. In W-graph, edge attribute consists of two components, the color and the label, and the function returns a color and a label (possibly empty, ) for each node. Edge labels are stuck close to the corresponding edges, and node labels are written inside the rectangles representing the nodes. The set of colors denotes how the lines of nodes and edges are drawn (solid or dashed), and we also call this information the color of a node or edge. On the other hand, the function can be seen as the composition of the two single valued functions and , so can also be implicitly defined on edges: if , then and . Two nodes may be connected by more than one edge, provided that edge attributes are different.

For two subsets of , and is accessible from if for each node in the set there is a corresponding node in the set such that there exists a path from to in W-graph . For example, in the W-instance of Figure 1, the set is accessible from the set .

The bisimulation semantics of the language is also given as follows.

Definition 3. Given two W-graphs and , a relation is said to be a bisimulation between and (write ) if and only if:(1)for , , such that ,(2)for ,,   , s.t. , and(3)for , , let  . Then, , such that ; for , it holds that , .

2.2. Model Checking

Model checking is an automated technique that, given a finite-state model of a system and a formal property, systematically checks whether this property holds for (a given state in) the model. Here the finite-state model is always called Kripke structure (an automata-like state transition system), and the formal properties are always expressed by computation tree logic (CTL, a logic that is based on a branching-time view) formulae.

Definition 4. A Kripke structure is a transition system over a set of atomic propositions, where is a set of states, is a set of actions, is a transition relations, and is an interpretation.

A path in a Kripke structure is an infinite sequence of states and actions ( denotes the th state in the path ), s.t.. For all it holds that and either , with , or there are no outgoing transitions from and for all it holds that is the special action (which is not in ) and .

Definition 5. Given the sets and of atomic propositions and actions, computation tree logic (CTL) formulae are recursively defined as follows: (1)each is a CTL formula;(2)if and are CTL formulae, , then , , , , , and  are CTL formulae.

A and E are the universal and existential path quantifiers, while neXt (X) and Until (U) are the linear-time modalities. Composition of formulae of the form can be, respectively, defined by and , and the modalities Finally (F) and Generally (G) can be defined in terms of the CTL formulae: true and .

Definition 6. Satisfaction of a CTL formula by a state of the Kripke structure is defined recursively as follows: (i)if , then iff . Moreover, , and ;(ii) iif ;(iii) iff and ;(iv) iff there is a path s.t. and ;(v) iff for all paths , implies ;(vi) iff there is a path , and s.t. , , and and ;(vii) iff for all paths , s.t. , , , and and .

Definition 7 (the model-checking problem). Local model-checking: given a Kripke structure , a formula , and a state of , the local model-checking is to verify whether . Global model-checking: given a Kripke structure , and a formula , the global model-checking is to find all states of such that .

If is finite, the global model-checking problem for a CTL formula can be solved in linear running time on , where is the length of formula and is the number of elements in the set , is the elements number of the transition relations set [16].

3. Modeling Semantic Inconsistency

When a language has been used to semantically model different data sources for building the same domain ontology, the semantic collision becomes prominent, so the consistency checking is inevitable. Even from a single data source, the incremental procedure of building the semantical model may fail when a new snippet collides semantically with some model segments that existed. Therefore, ambiguities should be detected in order to get correct semantical model during building a domain ontology. The semantic consistency checking is a mechanism for checking whether the model is semantic unambiguity or paradox.

Two kinds of problems would be of concern: redundancy and paradox. The redundancy-free can reduce the size of the model and accelerate modeling procedure. For the middle-model language W-graph, according to Definition 3, two equivalent models (or model segments) cause a redundancy.

As far as the paradox is concerned, there are four types of inconsistency: concept inconsistency, relationship inconsistency, attribute inconsistency, and fact inconsistencies. Before discussing details of these inconsistencies, another special W-graph, so-called W-schema, will be presented. The schema gives a pattern to organize data for an instance. The schema of W-instance is also a W-graph, and it could be defined formally as follows.

Definition 8. A W-schema is a W-graph such that for each edge of and and for each node of ; that is, a schema has no values.

For example, Figure 2 depicts the W-schema of the W-instance in Figure 1. So the W-graph language can be used to model semantic aspects of the knowledge, nodes for concepts, edges for relationships between them, W-schemata for the patterns of the knowledge, and W-instances for the concrete contents of that knowledge. Now we will discuss the details of each inconsistency based on the W-graph.

Concept Redundancy. During the procedure of building semantic model, if a new concept occurs, then we add this concept to the model (add a new node to the W-schema). If a concept paradox occurs, then we also add a new concept to the model. But concepts redundancy will occur if we find two concepts (different names, of course) getting the same attribute set and relationship set. Concepts redundancy always occurs in the W-schema. For example, Figure 3 shows that Teacher and Tutor are the same concepts here. There is not a new concept Tutor to be added into the model, but a new synonym of Teacher can be annotated by only one edge labeled “synonym.”

We define concept redundancy as follows.

Definition 9. In the W-graph , two concepts and (expressed as nodes ) are redundancy when they get the same adjacent nodes set and edges set:, and for all and corresponding such that .

Here, means that after omitting concepts nodes , and their adjacent edges the subgraph including node bisimulates the subgraph including node . Computing is an iterative process. For deciding whether , we delete node and its adjacent edges from and node and its adjacent edges from and then decide whether the two smaller subgraphs are bisimulation. The iterative process will terminate because the nodes and edges are finite.

Relationships Inconsistency. When a new relationship between concepts during modeling process is found, the relationship cannot be added directly into the model, because sometimes the inconsistency will occur. Firstly, if the relationship is redundant according to the bisimulation, it should be ignored. Secondly, if the relationship is added into the model, we must ensure not to introduce some paradox into the model; otherwise, the relationship inconsistency occurs. For example, Figure 4 shows that the relationship “Student teaches Teacher” should cause a paradox. The relationship may be found from XML documents “Teacher teaches students knowledge, and meanwhile teacher also learns something from students.” The relationship “teacher learns from students” may be understood as “Student teaches Teacher,” but this contradicts “Teacher teaches Student” relationship, which is reasoned from “Teacher teaches Course” and “Student attends the Course.”

The paradox relationship can be formally defined as follows.

Definition 10. In the W-graph , a paradox relationship between node and node is that and , where .

Attribute Inconsistency. The attributes of concepts are another kind of relationship, so-called “isa” or “hasa” relationship. So attribute inconsistency is a kind of relationship inconsistency, but simpler.

Fact Inconsistency. This kind of inconsistency occurs in the W-instance. Facts are the individuals of concepts or attributes. When creating W-instance by populating data from the heterogeneous sources, lots of facts inconsistencies may occur. In Figure 5(a), an attribute fact inconsistency will happen if the fact “age is 40” is added into the model, where the teacher named Charley gets two different ages. And in Figure 5(b), the Teacher2 gets all the same value set of attributes, so Teacher1 and Teacher2 are the concept fact inconsistency. Therefore, the two facts “age is 40” and “Teacher2” cannot be added into the model.

The fact inconsistency is formally defined as follows.

Definition 11. In the W-instance where , two facts are said to be inconsistent if (1)for , s.t. and or(2)for , and s.t. and .

For all these inconsistencies, the core problem is how to discovery the redundancy and paradox in the model. In fact, the procedure of discovery is a subgraph query problem in W-graph.

Definition 12. For a W-graph , a subgraph of is also a W-graph , where and . Furthermore, given two sets of nodes , is accessible from if for each there is a node such that there is a path in from to .

Definition 13. A W-query is a pointed W-graph, namely, a pair with as a node of (the point). A W-query is accessible if the set of nodes of is accessible from .

For example, in Figure 6 three W-queries are depicted. The thick arrows are their points, and they are W-queries. Intuitively, the meaning of the first query is to collect all the teachers aged 37. The second query asks for all the teachers that have declared an age (observe the use of an undefined node). The third query, instead, requires collecting all the teachers that teach some courses but not Database, where dashed nodes and lines are introduced to allow a form of negation.

According to Definition 13, an inconsistency checking problem is indeed a query problem in the partial W-graph model. Meanwhile, the semantic equivalence must be of concern during query processing. We call this a semantical equivalence query problem. For the incremental procedure of building domain ontology, a semantical equivalence querying should be executed as soon as each new element (concept, relationship, attribute or fact) is added into the model. So, the semantic equivalence querying problem is the basic problem for the consistency checking.

However, the semantical equivalence querying problem for a W-graph model is the subgraph-isomorphism problem, which is NP problem in general. In this paper, we employ model-checking technology to handle semantical equivalence query problem in order to avoid computing subgraph isomorphism.

4. Consistency Checking

In order to employ model-checking technology, we can see a W-graph model as a Kripke structure and a semantical equivalency query as a formula of the temporal logic CTL. In this way, the inconsistency checking problem is reduced to the problem of finding out the states of the model which satisfy the formula (the model-checking problem) that can be done in linear time.

4.1. W-Graph as Kripke Structure

With the following definition, we can build Kripke structure from W-graph.

Definition 14. Let be a W-graph, the Kripke structure over the set of atomic properties is defined as follows: (i) is the set of all node labels of , .(ii)The set of states .(iii)The set of actions ; that is, the set of actions includes all the edge labels, negative labels (express as ), and their inverse labels (express as ); note that negative labels are very different from inverse labels. A negative label edge expresses that there is no special relationship (i.e., “label” relationship) between two nodes, but an inverse label edge expresses that there is a special relationship (i.e., inverse relationship) between this two nodes.(iv)The ternary transition relation . Moreover, assume that, for each state with no outgoing edge in (a leaf in ), a self-loop edge labeled by the special action is added.(v)The interpretation function , where . That is, in each state the only formulae that hold are the unary atom and .

For instance, consider the W-graph of Figure 1. It holds the following:(i), , , , , , ,(ii), , , ,(iii),, , , , , , , , , , , , , , , ,(iv), , , , , , , , , , , , , , , , , , ,, , ,,, ,, ,,, , , , , , , , , , , , , , , , , ,,, , ,(v), , , , , , , , , , , , , , .And this Kripke structure is shown in Figure 7.

4.2. Query as CTL Formula

Reviewing the W-query in Figure 6(a), intuitively, the CTL formula must express the statement “the state Teacher formula is true and there is one next state reachable by an edge labeled age, where the 37 formula is true,” this is to say For the query in Figure 6(b), the CTL formula should be written as For the query of Figure 6(c), we get the following formula: This formula is true if “there is a node labeled Teacher and there exists one next node labeled Course, which is reachable by an edge labeled teaches, such that for all next to Course nodes labeled Database, the relation cName is not fulfilled”. Let us then consider how to encode W-queries in CTL formulae. We study the situation that W-graph and W-query are both acyclic. Firstly, we define an auxiliary function to simply handle labels of nodes and edges in W-graph.

Definition 15. Let be a W-graph, for all nodes and for all edges ; an auxiliary function is defined as

And the query translation can be formally defined as follows.

Definition 16. Let be a acyclic W-graph, , and let be an accessible query. The formula associated with is , where is defined recursively as follows: (i)let be the successors of , s.t. ,(ii)let be the successors of , s.t. (iii)for and , let be the edge which links to and the one which links to . If , then

The construction of the formula involves the unfolding of a directed acyclic graph; the size of the formula (written as ) can grow exponentially with respect to. Although that, it is easy to compute the formula without repetitions of subformulae and keep the memory allocation linear w.r.t. , so it is natural to compactly represent the formula using a linear amount of memory. This compact representation is allowed by the model-checker NuSMV [19].

Theorem 17. Given a W-instance and a W-query , let be the Kripke structure associated with and the CTL formula associated with . Consistency checking with can be done in linear time on .

Proof. For achieving the consistency checking procedure, there are three steps to follow, translating the W-graph into a Kripke structure, encoding the W-query into a CTL formula, and executing a model-checking process.
Assuming that , where is the number of nodes in W-instance and is the number of edges in this graph, notating , expressing as the number of states, actions, and transition relations in , and writing as the length of the CTL formula , then the complexity issues can be analyzed.
Firstly, when creating the Kripke structure, for each node, action, and transitional relation that must be created one by one, the time complexity should be , according to Definition 14. Secondly, we consider the length of the encoded CTL formula , because the worst situation is to encode the whole W-graph into a formula, where is the total number of edges and nodes in this query. The last step is to execute model-checking procedure; the time complexity is according to [16]. So the total time complexity is , which is a linear time on .
If we think that the is always less than or equal to (intuitively, it is always so), then , which is a polynomial time over .

The equivalence between two segments of W-graph is the equivalence between two CTL formulae, which can be formally proved in linear time. The consistency checking problem for the W-graph model is promoted to the equivalence proof problem for the CTL formulae with a Kripke structure.

4.3. Checking Consistency

During the procedure of extracting a domain ontology gradually from heterogeneous data sources, semantic consistency has to be checked. According to Definitions 14 and 16, after encoding a W-graph to a Kripke structure and a W-query to a CTL formula, the semantic equivalent query problem on the W-graph has been promoted to a global model-checking problem.

Definition 18. Given a W-instance and a W-query , let be the Kripke structure associated with and the CTL formula associated with . The Consistency checking with amounts to solve the global model-checking problem with the Kripke structure and the CTL formula , namely, to find all the states of such that . Algorithm 1 describes this procedure.

Require:   , // W-graph semantic model
// semantic segment expressed by W-query
Ensure:  consistent or inconsistent
encoding     to   ; // according to Definition 14
encoding     to   ; // according to Definition 16
; // model checking
if   is emptyset  then
  return  consistent;
else
  return  inconsistent;
end if

So semantic consistency checking can be done with model-checking technology. Let us consider each type of inconsistency defined above. As for concepts inconsistency, it is to find out whether there is another concept (equivalent to ) in W-schema , which has been built to be extended, when a new concept has been added into the W-schema . This is to say, a W-query should be imposed on the for consistency checking. Let be the Kripke structure of and let be the CTL formula of and we should find all states of such that .

As far as relationship inconsistency is concerned, it is to query a W-graph with a W-query before a new relationship is added into the W-graph model. Let be the Kripke structure of , and the CTL formula is The consistency checking is to find all states of such that . If the states like this cannot be found, then the model is consistent after adding the new relationship ; if any state has been found, then the model will be inconsistent and the new relationship cannot be added into the model. This is the relationship redundant inconsistency. As for paradox relationship paradox inconsistency, the CTL formula is

As for fact consistency checking, the Kripke structure of the W-instance is , and the CTL formula for the attribute fact (the instance of a concept has the concrete attribute value , i.e., ) inconsistency is The CTL formula for the concept fact (the instance of one concept to be added into the model) inconsistency is where are all successors of . If some states of satisfy the above formula, then the inconsistency occurs. This is to say, a fact cannot be added into the model if an equivalent fact has already existed in the model.

5. Cases Study

In this section, we will test the semantic consistency checking technique presented above by using the model checker NuSMV 2.5.4 [20]. NuSMV allows for the representation of finite-state machines (FSMs) and for the analysis of specifications expressed in computation tree logic (CTL) using symbol model-checking techniques. For using the tool, we first rewrite the Kripke structure into a finite-state machine, where edges are not labeled, as follows:given a Kripke structure related to a W-instance, replace every labeled edge by the two edges , , where is a new node labeled action.

The input of the NuSMV tool is represented by a SMV program, which can express both the FSMs and CTL formulae. The Algorithm 2 is the SMV program to describe the Kripke structure of the W-instance in Figure 7 and some CTL specification of consistency checking mentioned above, where the line started with symbol “- -” is comment.

--  SMV  program  for  consistency  checking
MODULE  main
VAR
  state:{n1,n2,n3,n4,n5,n6,n7,n8,teaches,inv_teaches,attends,
   inv_attends,age,inv_age,name,inv_name,cName,inv_cName,tao};
  label:{Teacher,Course,Student,Database,Smith,37,40};
ASSIGN
  init(state)  :=  n1;
  next(state)  :=  case
   state  =  n1  ∣  state  =  n2  :  {teaches,age};
   state  =  teaches  :  n3;
   state  =  age  :  {n5,n6};
   state  =  n3:{inv_attends,inv_teaches,cName};
   state  =  inv_attends  :  n4;
   state  =  inv_teaches  :  {n1,n2};
   state  =  cName  :  n7;
   state  =  n4  :  {attends,name};
   state  =  attends  :  n3;
   state  =  name  :  n8;
   state  =  n5  :  {inv_age,tao};
   state  =  inv_age  :  {n1,n2};
   state  =  tao  :  {n5,n6,n7,n8};
   state  =  n6  :  {inv_age,tao};
   state  =  n7  :  {inv_cName,tao};
   state  =  inv_cName  :  n3;
   state  =  n8  :  {inv_name,tao};
   state  =  inv_name  :  n4;
   TRUE  :  state;
  esac;
DEFINE
  nl  :=  case
   state  =  n1  ∣  state  =  n2  :  Teacher;
   state  =  n3  :  Course;
   state  =  n4  :  Student;
   state  =  n5  :  37;
   state  =  n6  :  40;
   state  =  n7  :  Database;
   state  =  n8  :  Smith;
   TRUE  :  state;  
  esac;
--  concept  redundancy  checking
CTLSPEC  (nl=Teacher)  &  EX(state=age)  &  
     EX(state=teaches  &  EX(nl=Course))
--  relationship  inconsistency  checking
--  need  to  be  changed  to  init(state)  :=  n4;
CTLSPEC  (nl=Student)  &  EX(state=attends  &  EX(nl=Course))
--  attribute  inconsistency  checking
CTLSPEC  (nl=Teacher)  &  EX(state=age  &  EX(nl=40))
--  fact  inconsistency  checking
CTLSPEC  (nl=Teacher)  &  EX(state=age&EX(nl=37))  &  
     EX(state=teaches  &  EX(nl=Course  &  
     EX(state=cName  &  EX(nl=Database))))

In this SMV program, the set of states of the Kripke structure is chosen by declaring the state variable  state to assume values , where actions have also been seen as states. The transition relation of the Kripke structure is expressed by assigning (ASSIGN), for each value of the variable  state, the list of nodes that can be reached from it through one edge. The variables  label and  nl are introduced to define the node label of each state identified by the value of the variable  state. And CTL formulae have been defined by  CTLSPEC. The formula & & says when a new concept Teacher, which has an age attribute, has a teaches action, and can teach some Course, is to be added into the model whether there exists an inconsistency. And & expresses whether a new relationship can be added into the model without semantic inconsistency. And & says whether the attribute relationship “the Teacher is 40 years old” can be added into the model without any redundancy and paradox. And & & expresses whether the fact “the 37-year Teacher teaches Database Course” can be added into the model.

The results we have obtained on the 64-bit windows 7 operation system are shown in Figure 8. The output true for a CTL formula says that at least one inconsistency exists when checking the W-graph model with the new semantic segment (expressed by this formula), so the new segment cannot be added into the model. Otherwise, we can choose another initiate state to check again, until we finish all elements of initiated state set. If we always get a final false output, then the inconsistency has not occurred, and the new semantic segment can be added into the model.

This test confirms the possibility of solving semantic consistency checking problem by using model-checking on polynomial time methods.

6. Conclusion

For validating semantic consistency during the increasing procedure of building a domain ontology from heterogeneous sources, we employ the model-checking technology to avoid subgraph isomorphism problem, which is NP hard. In order to adopt model-checking method, we formally transform the semantic model into a Kripke structure and the semantic equivalent querying problem into CTL formulae and then the semantic consistency is promoted to the global model-checking problem. The effective experiment with the model-checking tool NuSMV has also been introduced. In the future, the reasoning problem should be considered clearly; for example, some implicative semantic elements would be reasoned from the existing model. If a new semantic segment is equivalent to some implicative semantic elements, the inconsistency also occurs. In the near future, this type of consistency checking should also be regarded.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This paper is supported by the Natural Science Foundation of Guangxi under Grants nos. 2011GXNS-FA018154 and 2012GXNS-FGA060003, the Science and Technology Foundation of Guangxi under Grant no. 10169-1, Guangxi Scientific Research Project no. 201012MS274, and the starting fund of GXUN under Grant no. 2011QD017. This paper is supported also by Grant (2012HCIC04) of Guangxi Key Laboratory of Hybrid Computation and IC Design Analysis Open Fund.