Abstract

More and more linked data (taken as knowledge) can be automatically generated from nonstructured data such as text and image via learning, which are often uncertain in practice. On the other hand, most of the existing approaches to processing linked data are mainly designed for certain data. It becomes more and more important to process uncertain linked data in theoretical aspect. In this paper, we present a querying language framework for probabilistic RDF data (an important uncertain linked data), where each triple has a probability, called pSRARQL, built on SPARQL, recommended by W3C as a querying language for RDF databases. pSPARQL can support the full SPARQL and satisfies some important properties such as well-definedness, uniqueness, and some equivalences. Finally, we illustrate that pSPARQL is feasible in expressing practical queries in a real world.

1. Introduction

Resource Description Framework (RDF) [1] is the standard data model in the Semantic Web. In our real world, RDF data (as a knowledge base) possibly contains some uncertainty data due to the diversity of data sources, where RDF data are automatically extracted from different sources, such as YAGO [2]. For instance, some RDF data is generated from raw data via knowledge extraction and machine learning [3]. Indeed, uncertainty is generally a basic feature of data [46]. However, RDF model itself provides little support for uncertain data [7]. SPARQL [8], as a querying language for RDF data officially recommended by W3C [9], is unable to process uncertain data [4].

There are many approaches to processing probabilistic RDF [10]. Reference [4] proposes a probabilistic model for SQL over relational data. Reference [5] presents a Bayesian network to represent probabilistic relations in RDF. Reference [11] develops a framework for evaluating SPARQL conjunctive queries (i.e., basic graph patterns, BGP) on RDF probabilistic databases. Reference [12] proposes answering SPARQL queries with RDFS reasoning on probabilistic models that encode statistical relationships among correlated triples, where the proposed probability models are based on either probability distribution function or a disjunctive normal form probability problem. Reference [13] presents effective pruning mechanisms, as well as structural and probabilistic pruning for query answering of SPARQL conjunctive queries (i.e., BGP) over probabilistic RDF data graphs. Reference [14] presents a RESCAL-based approach to query processing in relational data via factorization. Reference [14] presents a heuristic algorithm for query answering of SPARQL conjunctive queries (i.e., BGP) over incomplete and uncertain RDF. Reference [15] presents a framework for SPARQL query answering over probabilistic databases by extending the rich semantics offered by ontologies with probabilistic information. Reference [16] presents a probabilistic knowledge base system, ARCHIMEDESONE, for query answering with inference by scaling up the knowledge expansion and statistical inference algorithms. Reference [17] proposes a probabilistic automata-based framework of query evaluation in the presence of uncertainty efficiently.

Although those approaches can query probabilistic RDF, most of them mainly process SPARQL conjunctive queries, that is, BGP queries. However, those existing probability models have little support for expressive operators (for instance, neither [13] nor [16] discusses OPTIONAL query for RDF) such as OPTIONAL, which is the least conventional operator of SPARQL [18], and DIFF, a difference operator in SPARQL 1.1 [19], which brings more expressivity [20].

In this paper, we present an extended querying language (called pSPARQL: probabilistic SPARQL) for probabilistic RDF databases with support of the full SPARQL fragment. We show that the semantics of pSPARQL can satisfy some important properties such as well-definedness, uniqueness, and some equivalences. Compared with the previous poster in ISWC 2016 [21], in this paper, we present a totally new probabilistic representation model and prove that the newly proposed model can preserve some important properties such as uniqueness and distributive law of equivalence.

The remainder of this paper is structured as follows: the next section recalls RDF and SPARQL. Section 3 introduces the syntax and semantics of pSPARQL and Section 4 discusses some important properties. Finally, we summarize our work in the last section.

2. RDF and SPARQL

In this section, we briefly recall the syntax and semantics of SPARQL. For more readings, please refer to the core SPARQL formalization in [22].

2.1. RDF Graphs

Let and be infinite sets of IRIs and literals, respectively, with . Let . A triple is called an RDF triple. An RDF graph is a finite set of RDF triples.

2.2. Syntax of SPARQL

Let be a set of variables. SPARQL patterns are inductively defined as follows:(i)Any triple from is a pattern (called a triple pattern).(ii)If and are patterns, then so are the following: UNION , AND , DIFF , and OPT .(iii)If is a pattern and is a constraint (defined next), then FILTER is a pattern; we call the filter, which is a Boolean combination of atomic constraints, one of the three following forms: (bound), (equality), and (constant equality), for and .

2.3. Semantics of SPARQL

Now, given a graph and a pattern , we define the semantics of on , denoted by , as a set of mappings (i.e., partial functions from to , in the following manner, where we use to denote the domain of )(i).(ii).(iii).Here, two mappings and are called compatible, denoted by , if for any(iv).(v) UNION .(vi).Here, for any mapping and filter , the evaluation of on , denoted by , is defined in terms of a three-valued logic with truth values true, false, and error. Recall that is a Boolean combination of atomic constraints.For a bound constraint , we defineFor an equality constraint , we defineThus, when and do not both belong to , the equality constraint evaluates to error. Similarly, for a constant-equality constraint , we defineA Boolean combination is evaluated using the truth tables given in Table 1.

3. Probabilistic RDF and pSPARQL

In this section, we present probabilistic RDF and introduce the syntax and semantics of pSPARQL.

3.1. Probabilistic RDF

A probabilistic RDF is a pair where is an RDF graph and is a total function from . Intuitively speaking, is a probability function mapping each triple to a probability.

For instance, let be a probabilistic RDF with and is a function from defined in Table 2.

Note that we assign a triple to a probability so that we could take triples as atoms in our scenario analogously treated in [13, 16]. This treatment is not direct to characterize the probability of subjects/objects in triples.

3.2. pSPARQL: A Probabilistic SPARQL

In this section, we introduce a probabilistic SPARQL (for short, pSPARQL).

The Syntax of pSPARQL. The syntax of pSPARQL is slightly different from the syntax of SPARQL [22] in filters, where we newly introduce a fixed variable to express constraints of probability.

A probabilistic atomic filter is one of the four following forms: and , where . The filter of pSPARQL is a Boolean combination of atomic filters and probabilistic atomic filters.

All patterns are called probabilistic patterns in pSPARQL.

The Semantics of pSPARQL. The semantics of probabilistic patterns are defined in terms of sets of pairs of the form (called a solution (with probability), denoted by ), where is a solution of probabilistic patterns and . Note that we only consider pairs of form where .

Now, given a probabilistic RDF graph and a probabilistic pattern , we define the semantics of on , denoted by , as a set of solutions with probability, in the following manner:By default, we set .(i)For a nonprobabilistic filter , .(ii)For a Boolean combination , .(iii)For a probabilistic filter , we define(iv)For a probabilistic filter , we define(v)For a probabilistic filter , we define(vi)For a probabilistic filter , we define

Example 1. Given a pattern FILTER? (i.e., we query those persons who have suffered from some illness with probability over 0.5), we can compute that where and . However, let , since .

Example 2. Given a pattern AND (i.e., we query those who have suffered from some illness and have been treated), we can compute that , where(i);(ii).

Example 3. Given a pattern UNION , (i.e., we query those who have suffered from schizophrenia or those who are treated by psychiatrists); we can compute that .

Note that ?p is slightly different from variables where the value of ?p is variable via probability computation, while the value of other variables is fixed. Moreover, we disallow the comparison of probability in filters.

4. Well-Definedness, Uniqueness, and Equivalence of pSPARQL

In this section, we discuss some important properties of pSPARQL.

Firstly, we introduce a property called well-definedness, which can ensure that the semantics of pSPARQL are well defined.

Proposition 4 (well-definedness). For any pSPARQL pattern , for any probabilistic RDF , for any solution, we have .

Proof. By induction on the structure of ,if is a triple pattern (), then ;if is of the form , then let us discuss the three cases:(i)if but , then ;(ii)if but , then ;(iii)if and , then by induction.If is of the form AND , then this claim holds by induction, since there exists some solution and some solution with such thatIf is of the form or , then this claim holds by induction, sinceFinally, if is of the form , then this claim holds by the cases of , , and by induction.

Proposition 5 (uniqueness). For any pSPARQL pattern , for any probabilistic RDF , for any two solutions , if , then .

Proof. By induction on the structure of , we have the following.
If is a triple pattern (), then this claim directly holds by definition, since .
If is of the form UNION , then let us discuss the three cases:(i)If but , then this claim holds by induction.(ii)If but , then this claim holds by induction.(iii)If there exist and , then this claim holds by induction, since .If is of the form , then this claim holds by induction, since there exists some solution and some solution with such that and . Therefore, is unique.
If is of the form or , then this claim holds by induction, sinceFinally, we discuss the equivalence of patterns in pSPARQL. Let and be two patterns in pSPARQL. We say that is equivalent to , denoted by , if for any probabilistic RDF .

Next, we show that pSPARQL satisfies the distributive law of equivalence, which is proven to be important in SPARQL.

Proposition 6 (distributive law). Let , and be three patterns in pSPARQL and let be a filter. The following holds:(1)   FILTER  UNION;(2) AND UNION;(3)ANDUNION;(4)DIFFUNION;(5)OPTUNION.

Proof (sketch). The first claim directly holds by the definition.
Now, we show the second item. Let be a probabilistic RDF of the form . If , then there must exist some solution . By Proposition 5, we can conclude that . Then .
On the other hand, ; then there must exist some solution . By Proposition 5, we can conclude that . Then .
Analogously, we can prove the third item and the fourth item.
Finally, we could prove the fifth item by using the third item and the fourth item.

5. A Practical Example

In this section, we illustrate the application of pSPARQL in a real world via a practical example, where a probabilistic RDF is introduced in [11] shown in Figure 1.

Consider the following four queries (Q1, Q2, Q3, Q4) in pSPARQL.

(1) Q1: What causes fatigue associated with some illness over 0.65 probability?

Q1 is formally expressed in pSPARQL as follows:

SELECT ?x ((Fatigue, CauseOf, ?x) AND ((?x, AssociatedWith, ?z) FILTER ?p >0.65).

The solution of Q1 is as follows:Note that = ,. Thus . Then , since .

(2) Q2: What are associated with cough over 0.7 probability?

Q2 is formally expressed in pSRARQL as follows:

SELECT ?x ((?x, AssociatedWith, ?y) FILTER ?y = ‘Cough’ ?p >0.7).

The solution of Q2 is as follows:

Note that is shown in Table 3.

Thus FILTER? y = = , , .

Then .

(3) Q3: What is probability of bronchitis associated with cough directly or indirectly?

Q3 is formally expressed as follows:

SELECT ?x ((?x, AssociatedWith, ?y) UNION ((?x, AssociatedWith, ?z) AND (?z, AssociatedWith, ?y)) FILTER ?x = ‘Bronchitis’?y = ‘Cough’).

The solution of Q3 is as follows:Note that is shown in Table 4.

Note that AND is shown in Table 5.

Thus UNION ((? AND is shown in Table 6.

Then

(4) Q4: What are associated with cough excluding bronchitis?

Q4 is expressed in pSPARQL as follows:

SELECT ?x (((?x, AssociatedWith, ?y) FILTER ?y = ‘Cough’) DIFF ((?x, AssociatedWith, ?y) FILTER ?x = ‘Bronchitis?y = ‘Cough’)).

The solution of Q4 is as follows:Note that FILTER? is shown in Table 7.

Note that FILTER? is shown in Table 8.

Then .

In short, we could express many interesting queries with respect to probabilistic RDF via pSPARQL, which are useful in a practical world. Compared with SPARQL, where we obtain only connection via SPARQL querying, we could quantize the connection via pSPARQL, so that we could obtain more specific solutions.

6. Conclusions

In this paper, we extended SPARQL to support querying over probabilistic RDF. In the future, we will discuss some foundational properties of pSPARQL and implement it in a prototype to provide the full SPARQL query answering services for probabilistic RDF. As a future work, we are interested in presenting probabilistic semantics of RDF graphs in a unified framework, where many applications could be supported.

Data Availability

No data were used to support this study.

Disclosure

An earlier version of this work was presented at “International Conference on Big Scientific Data Management 2018.”

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work is supported by the program of the key discipline “Applied Mathematics” of Shanghai Polytechnic University (XXKPY1604).