pSPARQL: A Querying Language for Probabilistic RDF Data

Fang, Hong

doi:https://doi.org/10.1155/2019/8258197

Complexity

On this page

Abstract Introduction Conclusions Data Availability Disclosure Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Analysis and Applications of Location-Aware Big Complex Network Data

View this Special Issue

Research Article | Open Access

Volume 2019 | Article ID 8258197 | https://doi.org/10.1155/2019/8258197

pSPARQL: A Querying Language for Probabilistic RDF Data

Hong Fang¹

Guest Editor: Jianxin Li

Received19 Dec 2018

Accepted19 Feb 2019

Published26 Mar 2019

Abstract

More and more linked data (taken as knowledge) can be automatically generated from nonstructured data such as text and image via learning, which are often uncertain in practice. On the other hand, most of the existing approaches to processing linked data are mainly designed for certain data. It becomes more and more important to process uncertain linked data in theoretical aspect. In this paper, we present a querying language framework for probabilistic RDF data (an important uncertain linked data), where each triple has a probability, called pSRARQL, built on SPARQL, recommended by W3C as a querying language for RDF databases. pSPARQL can support the full SPARQL and satisfies some important properties such as well-definedness, uniqueness, and some equivalences. Finally, we illustrate that pSPARQL is feasible in expressing practical queries in a real world.

1. Introduction

Resource Description Framework (RDF) [1] is the standard data model in the Semantic Web. In our real world, RDF data (as a knowledge base) possibly contains some uncertainty data due to the diversity of data sources, where RDF data are automatically extracted from different sources, such as YAGO [2]. For instance, some RDF data is generated from raw data via knowledge extraction and machine learning [3]. Indeed, uncertainty is generally a basic feature of data [4–6]. However, RDF model itself provides little support for uncertain data [7]. SPARQL [8], as a querying language for RDF data officially recommended by W3C [9], is unable to process uncertain data [4].

There are many approaches to processing probabilistic RDF [10]. Reference [4] proposes a probabilistic model for SQL over relational data. Reference [5] presents a Bayesian network to represent probabilistic relations in RDF. Reference [11] develops a framework for evaluating SPARQL conjunctive queries (i.e., basic graph patterns, BGP) on RDF probabilistic databases. Reference [12] proposes answering SPARQL queries with RDFS reasoning on probabilistic models that encode statistical relationships among correlated triples, where the proposed probability models are based on either probability distribution function or a disjunctive normal form probability problem. Reference [13] presents effective pruning mechanisms, as well as structural and probabilistic pruning for query answering of SPARQL conjunctive queries (i.e., BGP) over probabilistic RDF data graphs. Reference [14] presents a RESCAL-based approach to query processing in relational data via factorization. Reference [14] presents a heuristic algorithm for query answering of SPARQL conjunctive queries (i.e., BGP) over incomplete and uncertain RDF. Reference [15] presents a framework for SPARQL query answering over probabilistic databases by extending the rich semantics offered by ontologies with probabilistic information. Reference [16] presents a probabilistic knowledge base system, ARCHIMEDESONE, for query answering with inference by scaling up the knowledge expansion and statistical inference algorithms. Reference [17] proposes a probabilistic automata-based framework of query evaluation in the presence of uncertainty efficiently.

Although those approaches can query probabilistic RDF, most of them mainly process SPARQL conjunctive queries, that is, BGP queries. However, those existing probability models have little support for expressive operators (for instance, neither [13] nor [16] discusses OPTIONAL query for RDF) such as OPTIONAL, which is the least conventional operator of SPARQL [18], and DIFF, a difference operator in SPARQL 1.1 [19], which brings more expressivity [20].

In this paper, we present an extended querying language (called pSPARQL: probabilistic SPARQL) for probabilistic RDF databases with support of the full SPARQL fragment. We show that the semantics of pSPARQL can satisfy some important properties such as well-definedness, uniqueness, and some equivalences. Compared with the previous poster in ISWC 2016 [21], in this paper, we present a totally new probabilistic representation model and prove that the newly proposed model can preserve some important properties such as uniqueness and distributive law of equivalence.

The remainder of this paper is structured as follows: the next section recalls RDF and SPARQL. Section 3 introduces the syntax and semantics of pSPARQL and Section 4 discusses some important properties. Finally, we summarize our work in the last section.

2. RDF and SPARQL

In this section, we briefly recall the syntax and semantics of SPARQL. For more readings, please refer to the core SPARQL formalization in [22].

2.1. RDF Graphs

Let and be infinite sets of IRIs and literals, respectively, with . Let . A triple is called an RDF triple. An RDF graph is a finite set of RDF triples.

2.2. Syntax of SPARQL

Let be a set of variables. SPARQL patterns are inductively defined as follows:(i)Any triple from is a pattern (called a triple pattern).(ii)If and are patterns, then so are the following: UNION , AND , DIFF , and OPT .(iii)If is a pattern and is a constraint (defined next), then FILTER is a pattern; we call the filter, which is a Boolean combination of atomic constraints, one of the three following forms: (bound), (equality), and (constant equality), for and .

2.3. Semantics of SPARQL

Now, given a graph and a pattern , we define the semantics of on , denoted by , as a set of mappings (i.e., partial functions from to , in the following manner, where we use to denote the domain of )(i).(ii).(iii). Here, two mappings and are called compatible, denoted by , if for any (iv).(v) UNION .(vi). Here, for any mapping and filter , the evaluation of on , denoted by , is defined in terms of a three-valued logic with truth values true, false, and error. Recall that is a Boolean combination of atomic constraints. For a bound constraint , we defineFor an equality constraint , we defineThus, when and do not both belong to , the equality constraint evaluates to error. Similarly, for a constant-equality constraint , we defineA Boolean combination is evaluated using the truth tables given in Table 1.

3. Probabilistic RDF and pSPARQL

In this section, we present probabilistic RDF and introduce the syntax and semantics of pSPARQL.

3.1. Probabilistic RDF

A probabilistic RDF is a pair where is an RDF graph and is a total function from . Intuitively speaking, is a probability function mapping each triple to a probability.

For instance, let be a probabilistic RDF with and is a function from defined in Table 2.

Note that we assign a triple to a probability so that we could take triples as atoms in our scenario analogously treated in [13, 16]. This treatment is not direct to characterize the probability of subjects/objects in triples.

3.2. pSPARQL: A Probabilistic SPARQL

In this section, we introduce a probabilistic SPARQL (for short, pSPARQL).

The Syntax of pSPARQL. The syntax of pSPARQL is slightly different from the syntax of SPARQL [22] in filters, where we newly introduce a fixed variable to express constraints of probability.

A probabilistic atomic filter is one of the four following forms: and , where . The filter of pSPARQL is a Boolean combination of atomic filters and probabilistic atomic filters.

All patterns are called probabilistic patterns in pSPARQL.

The Semantics of pSPARQL. The semantics of probabilistic patterns are defined in terms of sets of pairs of the form (called a solution (with probability), denoted by ), where is a solution of probabilistic patterns and . Note that we only consider pairs of form where .

Now, given a probabilistic RDF graph and a probabilistic pattern , we define the semantics of on , denoted by , as a set of solutions with probability, in the following manner:By default, we set .(i)For a nonprobabilistic filter , .(ii)For a Boolean combination , .(iii)For a probabilistic filter , we define(iv)For a probabilistic filter , we define(v)For a probabilistic filter , we define(vi)For a probabilistic filter , we define

Example 1. Given a pattern FILTER? (i.e., we query those persons who have suffered from some illness with probability over 0.5), we can compute that where and . However, let , since .

Example 2. Given a pattern AND (i.e., we query those who have suffered from some illness and have been treated), we can compute that , where(i);(ii).

Example 3. Given a pattern UNION , (i.e., we query those who have suffered from schizophrenia or those who are treated by psychiatrists); we can compute that .

Note that ?p is slightly different from variables where the value of ?p is variable via probability computation, while the value of other variables is fixed. Moreover, we disallow the comparison of probability in filters.

4. Well-Definedness, Uniqueness, and Equivalence of pSPARQL

In this section, we discuss some important properties of pSPARQL.

Firstly, we introduce a property called well-definedness, which can ensure that the semantics of pSPARQL are well defined.

Proposition 4 (well-definedness). For any pSPARQL pattern , for any probabilistic RDF , for any solution, we have .

Proof. By induction on the structure of , if is a triple pattern (), then ; if is of the form , then let us discuss the three cases:(i)if but , then ;(ii)if but , then ;(iii)if and , then by induction.If is of the form AND , then this claim holds by induction, since there exists some solution and some solution with such thatIf is of the form or , then this claim holds by induction, sinceFinally, if is of the form , then this claim holds by the cases of , , and by induction.

Proposition 5 (uniqueness). For any pSPARQL pattern , for any probabilistic RDF , for any two solutions , if , then .

Proof. By induction on the structure of , we have the following.
If is a triple pattern (), then this claim directly holds by definition, since .
If is of the form UNION , then let us discuss the three cases:(i)If but , then this claim holds by induction.(ii)If but , then this claim holds by induction.(iii)If there exist and , then this claim holds by induction, since .If is of the form , then this claim holds by induction, since there exists some solution and some solution with such that and . Therefore, is unique.
If is of the form or , then this claim holds by induction, sinceFinally, we discuss the equivalence of patterns in pSPARQL. Let and be two patterns in pSPARQL. We say that is equivalent to , denoted by , if for any probabilistic RDF .

Next, we show that pSPARQL satisfies the distributive law of equivalence, which is proven to be important in SPARQL.

Proposition 6 (distributive law). Let , and be three patterns in pSPARQL and let be a filter. The following holds:(1) FILTER UNION;(2) AND UNION;(3)ANDUNION;(4)DIFFUNION;(5)OPTUNION.

Proof (sketch). The first claim directly holds by the definition.
Now, we show the second item. Let be a probabilistic RDF of the form . If , then there must exist some solution . By Proposition 5, we can conclude that . Then .
On the other hand, ; then there must exist some solution . By Proposition 5, we can conclude that . Then .
Analogously, we can prove the third item and the fourth item.
Finally, we could prove the fifth item by using the third item and the fourth item.

5. A Practical Example

In this section, we illustrate the application of pSPARQL in a real world via a practical example, where a probabilistic RDF is introduced in [11] shown in Figure 1.

Consider the following four queries (Q₁, Q₂, Q₃, Q₄) in pSPARQL.

(1) Q₁: What causes fatigue associated with some illness over 0.65 probability?

Q₁ is formally expressed in pSPARQL as follows:

SELECT ?x ((Fatigue, CauseOf, ?x) AND ((?x, AssociatedWith, ?z) FILTER ?p >0.65).

The solution of Q₁ is as follows:Note that = , . Thus . Then , since .

(2) Q₂: What are associated with cough over 0.7 probability?

Q₂ is formally expressed in pSRARQL as follows:

SELECT ?x ((?x, AssociatedWith, ?y) FILTER ?y = ‘Cough’ ?p >0.7).

The solution of Q₂ is as follows:

Note that is shown in Table 3.

Thus FILTER? y = = , , .

Then .

(3) Q₃: What is probability of bronchitis associated with cough directly or indirectly?

Q₃ is formally expressed as follows:

SELECT ?x ((?x, AssociatedWith, ?y) UNION ((?x, AssociatedWith, ?z) AND (?z, AssociatedWith, ?y)) FILTER ?x = ‘Bronchitis’?y = ‘Cough’).

The solution of Q₃ is as follows:Note that is shown in Table 4.

Note that AND is shown in Table 5.

Thus UNION ((? AND is shown in Table 6.

Then

(4) Q₄: What are associated with cough excluding bronchitis?

Q₄ is expressed in pSPARQL as follows:

SELECT ?x (((?x, AssociatedWith, ?y) FILTER ?y = ‘Cough’) DIFF ((?x, AssociatedWith, ?y) FILTER ?x = ‘Bronchitis’ ?y = ‘Cough’)).

The solution of Q₄ is as follows:Note that FILTER? is shown in Table 7.

Note that FILTER? is shown in Table 8.

Then .

In short, we could express many interesting queries with respect to probabilistic RDF via pSPARQL, which are useful in a practical world. Compared with SPARQL, where we obtain only connection via SPARQL querying, we could quantize the connection via pSPARQL, so that we could obtain more specific solutions.

6. Conclusions

In this paper, we extended SPARQL to support querying over probabilistic RDF. In the future, we will discuss some foundational properties of pSPARQL and implement it in a prototype to provide the full SPARQL query answering services for probabilistic RDF. As a future work, we are interested in presenting probabilistic semantics of RDF graphs in a unified framework, where many applications could be supported.

Data Availability

No data were used to support this study.

Disclosure

An earlier version of this work was presented at “International Conference on Big Scientific Data Management 2018.”

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work is supported by the program of the key discipline “Applied Mathematics” of Shanghai Polytechnic University (XXKPY1604).

References

“RDF primer, W3C Recommendation, February 2004”.
View at: Google Scholar
F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: a core of semantic knowledge,” in Proceedings of the 16th International World Wide Web Conference (WWW '07), pp. 697–706, Alberta, Canada, May 2007.
View at: Publisher Site | Google Scholar
U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From data mining to knowledge discovery in databases,” AI Magazine, vol. 17, no. 3, pp. 37–53, 1996.
View at: Google Scholar
N. Dalvi and D. Suciu, “Efficient query evaluation on probabilistic databases,” in Proceedings of the VLDB’04, pp. 864–875, 2004.
View at: Google Scholar
Y. Fukushige, “Representing probabilistic relations in RDF in,” in Proceedings of the ISWC-URSW’05, pp. 106-107, 2005.
View at: Google Scholar
D. Suciu, Probabilistic Databases, Encyclopedia of Database Systems, Springer, 2009.
O. Udrea, V. Subrahmanian, and Z. Majkic, “Probabilistic RDF,” in Proceedings of the 2006 IEEE International Conference on Information Reuse & Integration, pp. 172–177, Waikoloa Village, HI, USA, September 2006.
View at: Publisher Site | Google Scholar
“SPARQL query language for RDF, W3C Recommendation, January 2008”.
View at: Google Scholar
P. T. Wood, “Query languages for graph databases,” SIGMOD Record, vol. 41, no. 1, pp. 50–60, 2012.
View at: Publisher Site | Google Scholar
A. Khan and L. Chen, “On uncertain graphs modeling and queries,” in Proceedings of the PVLDB Endowment, vol. 8, pp. 2042-2043, 2015.
View at: Google Scholar
H. Huang and C. Liu, “Query evaluation on probabilistic RDF databases,” in Proceedings of the WISE’09, pp. 307–320, 2009.
View at: Publisher Site | Google Scholar
C. Szeto, E. Hung, and Y. Deng, “SPARQL query answering with RDFS reasoning on correlated probabilistic data,” in Proceedings of the WAIM’11, pp. 56–67, 2011.
View at: Publisher Site | Google Scholar
X. Lian and L. Chen, “Efficient query answering in probabilistic RDF graphs,” in Proceedings of the the 2011 international conference, p. 157, Athens, Greece, June 2011.
View at: Publisher Site | Google Scholar
D. Krompaß, M. Nickel, and V. Tresp, “Querying factorized probabilistic triple databases,” in Proceedings of the ISWC’14, pp. 114–129, 2014.
View at: Google Scholar
J. Schoenfisch, “Querying probabilistic ontologies with SPARQL,” in Proceedings of the KI’14, pp. 2245–2256, 2014.
View at: Google Scholar
X. Zhou, Y. Chen, and D. Z. Wang, “ArchimedesOne: Query processing over probabilistic knowledge bases,” Proceedings of the VLDB Endowment, vol. 9, no. 13, pp. 1461–1464, 2016.
View at: Publisher Site | Google Scholar
T. Andronikos, A. Singh, K. Giannakis, and S. Sioutas, “Computing probabilistic queries in the presence of uncertainty via probabilistic automata,” in Proceedings of the ALGOCLOUD’17, pp. 106–120, 2017.
View at: Publisher Site | Google Scholar
X. Zhang and J. Van den Bussche, “On the primitivity of operators in SPARQL,” Information Processing Letters, vol. 114, no. 9, pp. 480–485, 2014.
View at: Publisher Site | Google Scholar | MathSciNet
“SPARQL 1.1 query language, W3C Recommendation, March 2013”.
View at: Google Scholar
X. Zhang, J. Van den Bussche, K. Wang, and Z. Wang, “On the satisfiability problem of patterns in SPARQL 1.1,” in Proceedings of the AAAI’18, pp. 2054–2061, 2018.
View at: Google Scholar
H. Fang and X. Zhang, “pSPARQL: a querying language for probabilistic RDF (extended abstract),” in Proceedings of the ISWC’16, Posters, 2016.
View at: Google Scholar
J. Pérez, M. Arenas, and C. Gutierrez, “Semantics and complexity of SPARQL,” ACM Transactions on Database Systems (TODS), vol. 34, no. 3, pp. 1–45, 2009.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2019 Hong Fang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1248

Downloads

1099

Citations