Abstract

Subgraph matching on a large graph has become a popular research topic in the field of graph analysis, which has a wide range of applications including question answering and community detection. However, traditional edge-cutting strategy destroys the structure of indivisible knowledge in a large RDF graph. On the premise of load-balancing on subgraph division, a dominance-partitioned strategy is proposed to divide a large RDF graph without compromising the knowledge structure. Firstly, a dominance-connected pattern graph is extracted from a pattern graph to construct a dominance-partitioned pattern hypergraph, which divides a pattern graph as multiple fish-shaped pattern subgraphs. Secondly, a dominance-driven spectrum clustering strategy is used to gather the pattern subgraphs into multiple clusters. Thirdly, the dominance-partitioned subgraph matching algorithm is designed to conduct all isomorphic subgraphs on a cluster-partitioned RDF graph. Finally, experimental evaluation verifies that our strategy has higher time-efficiency of complex queries, and it has a better scalability on multiple machines and different data scales.

1. Introduction

The problem of subgraph matching is one fundamental issue in graph search, which is an NP-complete problem [1]. Specifically, given a query graph q and a large data graph , the problem of subgraph matching is to extract all isomorphic subgraphs of q on . However, one aspect is that the complex structure of a query graph depresses the query accuracy and performance on large data graphs, as the eruptive growth of data scale the in real world. Another aspect is that the data of social network tend to be organized as a rich-semantical structure. In this paper, we devote to research the subgraph matching problem on the large rich-semantical RDF graph.

Despite the complexity of knowledge structure and the polynomial-time problem of subgraph matching, recent existing researches have made significant advances in improving the performance of subgraph matching on large knowledge graph in a distributed environment.

One aspect is to encapsulate RDF data into triple-based relational tables [2, 3], which ensures the completeness of triple-based indivisible knowledge. Since the relational methods ignore the inherent graph-like structures of RDF data, an expensive cost is incurred to consume the excessive join operations over relational tables. Another aspect is to manage RDF data into native graph formats, which typically employs adjacency lists to index RDF data [46]. Since the minimum edge-cutting strategies on large graphs depress the structure of indivisible knowledge, the enormous intermediate results are rigorous to balance the loading of partitioned RDF subgraphs.

To ensure the completeness of indivisible knowledge on graph-based formats, most researchers devoted to decompose the pattern graph into special-shaped subgraphs. StarMR [7, 8] decomposed query graphs to a set of star-shaped subgraphs, and then, two optimization techniques were employed to filter invalid input data and reduce the data of stars. CFLMatch [9] postponed the aggregate operations on a tree-shaped index constructed from the core-forest-leaf query partitioned model.

1.1. Contributions

In this paper, we devoted to decompose the pattern graph into fish-shaped subgraphs, considering the attributed and topological structures of the RDF graph. Then, the fish-shaped subgraphs are clustered and used to guide the partition of large RDF graph. Finally, a subgraph matching algorithm is designed to conduct all isomorphic subgraphs on partitioned RDF subgraphs. Our contributions are illustrated as follows:We proposed a dominant connected pattern graph to extract the dominating relationships of pattern graph, including node denotative relationship and node connotative relationship. The node denotative and connotative relationships discover the dominant and semidominant nodes in pattern graph. Then, fish-shaped pattern subgraphs are obtained through the dominant node-centered expansion.We design a dominance-partitioned pattern hypergraph to model the fish-shaped pattern subgraphs. Each hypernode refers to a fish-shaped pattern subgraph, and each hyperedge denotes the common subgraph between fish-shaped pattern subgraphs.We employ a dominance-driven spectrum clustering strategy to gather the fish-shaped pattern subgraphs to multiple clusters. A dominance-partitioned weighted matrix is first constructed from dominance-partitioned pattern hypergraph. Then, the spectrum clustering strategy is employed to gather the hypernodes into multiple clustering based on the weighted matrix.We design a state transition model to describe the transition states of changed candidates, which consists of three states and six transition rules. Based on the state transition model, we analyze the influence of changed candidates to adjacent region and design our incremental maintenance strategy.We propose a dominance-partitioned subgraph matching algorithm to conduct all isomorphic subgraphs on a cluster-partitioned RDF graph.

The rest of this paper is organized as follows: Section 2 introduces the preliminaries about problem definitions and related works. A framework of a dominance-partitioned RDF graph is provided in Section 3, including a dominant-connected pattern graph, dominance-partitioned pattern hypergraph, and dominance-driven spectrum clustering strategy. Section 4 presents a dominance-partitioned subgraph matching algorithm. Experimental results are reported in Section 5. A conclusion is given in Section 6.

2. Preliminaries

In this section, the definitions of RDF graph and subgraph matching are first given. Then, the related researches are introduced.

2.1. Problem Definitions

Resource description framework (RDF) [10] is a standard semantic model designed by a W3C group\footnote {https://www.w3.org/community/kg-construct/}, which is represented by a set of triples . Each triple consists of three components: a subject, a predicate, and an object. Further, a triple is formed as I × I × IL, where I denotes an IRI (Internationalized Resource Identifier) and L represents a literal.

Definition 1 (RDF graph). An RDF graph is a directed labeled graph, formed as . Here, is a set of vertices, E ⊆  ×  represents a set of directed edges, L denotes a set of labels about vertex and edge, and  ∪ E ⟶ L indicates a labeling function that assigns vertex and edge with the instantiated labels.
The labels of a RDF graph are classified as instance-label, relation-label, attribute-label, and type-label according to the resource and interresource relationship of RDF data. An RDF triple is considered, o is named as type-label if and only if both s and o are IRIs, and p is a typed predicate, e.g., rdf : type, rdf : subclass of. The s and o are called as instance-label, p is named as relation-label if and only if both s and o are IRIs and p is not a typed predicate, and p is called as attribute-label if and only if o is a literal.
Considering an RDF graph in Figure 1, each vertex is labeled by an instance-label or a type-label or an attribute-label or a literal and each edge is labeled by a relation-label. The set of instance-labeled vertex is collected as {Person_A, Person_B, Publication_A, Course_B, Course_C, Department_A, University_A}. The set of type-labeled vertex is illustrated as {Publication, GraduateCourse, Course, FullProfessor, ResearchAssistant, Department, University}. The set of literal-labeled vertex is described as {[email protected]}. All the labels of edges are mapped to relation-labels.

Definition 2 (pattern graph). A pattern graph is a directed labeled graph, formed as , where is a set of vertices, ℰ ⊆  ×  represents a set of directed edges, denotes a set of labels about vertex and edge, and ψ :  ∪ ℰ ⟶ var indicates a labeling function that assigns vertex and edge with the conceptual labels.
Considering a pattern graph in Figure 2, each vertex is mapped by type-label or attribute-label and each edge is mapped by relation-label. The difference of RDF and pattern graph is that the pattern graph does not contain the instance-labels. Further, the pattern graph is a conceptual network and each query graph is a subgraph of the pattern graph.

2.1.1. Subgraph Matching

The problem of subgraph matching is to search all possible subgraphs of data graph that are isomorphic to query graph q. The subgraph matching is formally defined as a problem of subgraph isomorphism, described in Definition 3.

Definition 3 (subgraph isomorphism). Given a data graph and query graph q(, ℰ′, ℒ′, and ψ′), q is subgraph isomorphic to if and only if there exists a bijective mapping M from to such that ∀u ∈ ,M(u) ∈ :  ℒ′[u] ⊆ L[M(u)] and ∀u, u′ ∈ , ∃(u, u′) ∈ ℰ′ :  (M(u), M(u)) ∈ E and ℒ′ [u, u′] = L[(M(u), M(u))].
A query graph q is subgraph isomorphic to a data graph if there exists a subgraph isomorphic mapping (subgraph mapping for short) of q on . Simply, considering data and query graph in Figures 2 and 3 respectively, q is subgraph isomorphic to since there exists a subgraph isomorphic mapping M1 {<Person_A, FullProfessor>, <Person_B, ResearchAssistant>, <Course_B, GraduateCourse>}.
The k-partition problem on the RDF Graph is shown in Definition 4.

Definition 4 (K-partition problem on RDF graph (RG-KP)). Given an RDF graph , the K-partition problem on the RDF graph refers to divide q into k subgraphs, satisfying , such that overlapped cost is minimum and subgraph cost satisfies the condition .
In this paper, our research of the RG-KP problem focuses on the dominance-partitioned strategy to divide the topological and tree-shaped structures of pattern graph. Then, a dominance-driven spectrum clustering is used to gather the dominance-partitioned pattern subgraphs into multiple clusters. Finally, dominance-partitioned subgraph matching algorithm is designed to conduct all isomorphic subgraphs on a cluster-partitioned RDF graph.
In this paper, we focus on the directed labeled graphs. Both q and are directed labeled graphs, and the directed or undirected edges cannot affect the execution scheduling of subgraph matching. Thus, the dominant connected pattern subgraph and dominance-partitioned pattern hypergraph are defined as an undirected graph without the graph-labels. The detailed notations and meanings are described in Table 1.

2.2. Related Works

In this section, we mainly review the related works on triple-based relational and graph-based traversal strategies in distributed environment.

2.2.1. Triple-Based Relational Strategy

Most RDF systems store and index RDF data as a set of triple tables in a relational database. SW-store [11] vertically partitioned RDF triples into multiple attribute tables. RDF-3X [2, 12] and hexastore [3] implemented index-based query schemes through directly storing multiple arrangements of triple redundantly in -tree. Peng et al. [13] designed an RDF graph storage scheme to optimize graph division and balance the query loading, where the query processing was classified as two stages: scanning and joining. During the scanning phase, the query engine decomposed the SPARQL query into a set of triple patterns. In the joining phase, the scanned intermediate results are first bound into a left-joining tree, and then, the query results are conducted through the left-joining tree.

Distributed systems H-RDF-3X [14] and SHARD [15] horizontally divided RDF data into multiple computing nodes and used Hadoop as the communication layer for cross-node queries. H-RDF-3X divided the RDF graph into the specified number of partitioned data subgraphs through the minimum edge cut method METIS [16]. Then, the strategies of 1-hop or 2-hop replication were employed to extend the partitioned boundary of data subgraphs which ensured that small-diameter queries can obtain complete answers within single partitioned data subgraphs. The query processing of two systems used the reduce-side strategy that RDF triples are scanned in the mapping phase and the intermediate results are combined to the final results iteratively in the reducing phase. However, the iterative mapping and reducing operations of RDF triples can conduct expensive time-consumption on the complex topological structure of query graphs.

To reduce the complexity of topological query graph, a query decomposition model, called as TwinTwig [17], was designed to an efficient subgraph enumeration algorithm on distributed undirected graph. S2RDF [18] converted SPARQL queries into RDD operations on the spark-distributed computing framework. Even though the offline indexes were built to speed up the online subgraph matching processing, the expensive time-consumption of index construction needs to be paid to match the large-scale data graph. TriAD [4] combined join-ahead pruning via the form of RDF graph summarization with a locality-based horizontal partitioning of RDF triples into a grid-like distributed index structure.

2.2.2. Graph-Based Traversal Strategy

The graph-based traversal strategies were employed to store RDF data in native graph format, which focused on the construction of data indexes and pruning rules of redundant intermediate results.

The constructed indexes of large data graphs were used to shrink the search space of candidate intermediate results. BitMat [19] proposed a compressed bit-matrix structure to store the huge RDF graphs, and a variable-binding-matching algorithm was directly designed to produce the final results without indexing the intermediate results. TripleBit [20] presented a fast and compact system for storing and accessing RDF data, which designed two auxiliary index structures to minimize the cost of index selection during query evaluation. A signature technology was proposed by gStore [21], which stored RDF data in disk-based adjacency lists and transformed an RDF graph into a data signature graph by encoding each entity and class vertex. Then, -tree was proposed over the data signature graph with light maintenance overhead. To enhance gStore, a redesigned gStore [22] was given a new query plan generation module that generated different query plans according to the structures of query graphs. Furthermore, it redesigned the vertex encoding strategy to achieve more pruning power and a new multijoint algorithm to speed up the subgraph matching process.

The researches of pruning rules were employed to cut the redundant intermediate results in the processing of subgraph matching, such as Trinity.RDF [23] and WuKong [5]. Trinity.RDF was a distributed memory-based graph engine for web-scale RDF data. Instead of managing the RDF data in triple stores or as bitmap matrices, the engine stored RDF data in its native graph format to support the graph-based operations on RDF graphs, e.g., random walks, reachability, and community discovery. However, Trinity.RDF only used one-hop pruning rules to avoid the redundant path-based intermediate results, and a master machine was needed to aggregate all positive intermediate results. The researches [4, 5] found that the single-machine aggregate operations can easily conduct the bottleneck of a big query graph because the huge intermediate results can cause memory overflow on the single-master machine. Further, experiment [5] shows that the aggregate operations consumed more than 90% of the total matching time. Based on the experimental verification, WuKong adopted a full history pruning strategy to reduce the redundant intermediate results previously. However, a cost model of aggregate operation was designed to guide a matching order on the relational database. The expensive time-consumption of full-join Cartesian products should restrict the time-efficiency of conducted final results, but it only uses the cost model based on the predicate connection in the relational method to guide the query execution.

Most existing studies are devoted to ensure the completeness of indivisible knowledge. Xu et al. [24] studied the problem of multiobjective spatial keyword query with semantics and designed the LIR-tree index to integrate the spatial and semantic information of all objects in a balanced way. Wang et al. [25] created a new social network with more complete knowledge and proposed a k-Dcore framework to retrieve effective communities in the directed social network. Chen et al. [26] proposed a pivot-based hierarchical indexing structure -tree to integrate spatial and semantic information in a seamless way, which carefully designed a space mechanism to transform the high-dimensional semantic vectors to a low-dimensional space so that more effective pruning effect can be achieved. Cheng et al. [27] studied to automatically repair the graph with some repairing rules that designed a decomposition and join strategy to solve the polynomial time complexity of finding isomorphic subgraphs of graph data on graph-repairing rules. DistR [28] was an efficient-distributed strategy to solve the problem of reachability query over large uncertain graphs that found all of the maximal subgraphs of an original graph on the step of distributed graph reduction and transform the problem into a relational join process on the step of distributed consolidation. Deep NBCN [29] discovered the homogeneous and multibranch architecture to model the complex internal relationship between amino acid sequence and protein secondary structure sequence.

In this paper, we devote to decompose the pattern graph into partial subgraphs for reducing the time-consumption of aggregate operations. Our research motivation is induced by the previous researches on a special structure discovery of the pattern graph. The first existing study was StarMR [7], which decomposed query graphs to a set of attribute stars to filter redundant input star-shaped data. The second existing study demonstrated [30, 31] that a topological structure was discovered by the analysis of anchored and followed relationships to reduce the discontinuous intermediate results. The third empirical study [32] is our previous work for subgraph matching on static knowledge graph that constructed a flow-based subgraph index to reduce redundant RDF data.

Benefiting from the previous researches, a dominance-partitioned pattern subgraph is designed to encapsulate the topological and attribute structures of pattern graph. Then, the large RDF graph can be partitioned in the aid of our dominance-partitioned pattern subgraphs, and the framework of the partitioned RDF graph is introduced detailedly in Section 3.

3. Framework of Dominance-Partitioned RDF Graph

In this section, a dominance-partitioned subgraph matching framework is proposed to conduct the subgraph mappings of query graphs on large data graphs. Firstly, a dominant-connected pattern graph is acquired from a pattern graph. Secondly, the large data graph is partitioned by a method of pattern-driven spectrum clustering. Finally, the subgraph mappings of query graphs on partitioned data graphs are conducted iteratively.

A pseudocode of DP-SM is described in Algorithm 1. A dominant-connected pattern subgraph (DCPG for short) is acquired from a pattern graph , which is formed as . Firstly, a model of flow graph is employed to extract the dominated vertices and dominating relationships of the pattern graph. Then, the is constructed from dominated vertices and dominating relationships of pattern graph (Line 1 and Section 3.1). Secondly, a method of spectrum clustering is used to divide the larger data graph based on the hypergraph of (Line 2 and Section 3.2). Finally, the subgraph mappings of query graph q on partitioned data graph are conducted iteratively (Line 3 and Section 4).

Input: a pattern graph , a query graph q, and a data graph
Output: the set M of all subgraph mappings of q on
(1)-extraction from;
(2)k-Partition of on ;
(3)M ⟵ subMatching (q, );
(4)Return M;
3.1. Dominant Connected Pattern Graph

The dominant-connected pattern subgraph refers to a subgraph of pattern graph , satisfying  ⊆ , which is expanded from the theory of Dominator Tree [33]. Considering a pattern graph and vertices u, u′ ∈ , if there exists an artificially designed root ur ∈ , such that u is the one necessary vertex on the path from ur to u′, then u is a dominant node of u′, formed as u ≺ u′. Similar to the dominating relationship among query vertices, if there exists an edge e of , satisfying e is a necessary edge on the path from ur to u′, then e ≺ u′. On the basis of dominating relations in , the definition of dominant connected pattern graph is described in Definition 5.

Definition 5 (dominant connected pattern subgraph). Given a pattern graph and a root node ur ∈ , a pattern subgraph is a dominant-connected pattern subgraph, formed as , if and only if it satisfies the conditions: (1) for any a node u′ ∈ (), it always finds a node u ∈  such that u ≺ u′ on the paths . (2) For any an edge e′ ∈ () and the end-node u′ of e, satisfying e ≺ u′ on the paths where is the paths from root node ur to query vertex u′. The dominating relationships u ≺ u′ and e ≺ u′ on refer that u and e are the necessary vertex and edge on the path from ur to u′, respectively. We define the collector of dominant nodes in as , and then, it satisfies the condition  ⊆  ⊆ , and the set of nodes dominated by u is denoted as a dominant set of u, formed as dom(u).
Considering a pattern graph and an artificial root node u1 in Figure 4(a), the of is shown in Figure 4(b). The dominating relationship is used to acquire the topological and attribute structures of the pattern graph; thus, the directionality of pattern graph cannot be considered in the calculation of . Considering the in Figure 4(b), if any a node of is deleted, such that is invalid, then the is minimum. Regarding a deleted node u3, satisfying edges (u6, u3) and (u2, u3) are also deleted from , and then will be invalid, because there cannot find an edge dominating u3 in .
In this paper, we explore the characteristics of to analyze the node denotative and connotative relationships in a pattern graph . The node denotative relationship is discovered by the dominating relationships of vertices in pattern graph, as described in Theorem 1.

Theorem 1 (node denotative relationship). Given a DCPG , if exists a dominant node u ∈ , satisfying dom(u) is nonempty, and then, the nodes of dom(u) are constructed as a tree rooted by u.

Proof. (for Theorem 1).A dominant set dom(u) is considered, satisfying u is the one necessary vertex on the path , thus dom(u) and u can be combined into a flow graph originating from u.Different edges e, e′ ∈  is considered, if the common end-node u′ is contained by them, then e and e′ are not the dominant edges of u′. The nondominating relationship between vertex and edge contradicts the condition (2) in Definition 5.
The node connotative relationship is discovered by the vertex semidominant relation in a pattern graph. A DCPG and a dominated vertex u ∈  are considered, if there exists a dominant vertex u on the path , satisfying u ≺ u′, such d(ur, u′) is a minimum distance of D(ur, dom(ur, u′)), then u is a semidominant node of u′, formed as u ≼ u′. Here, dom(ur, u′) is a set of vertices dominating u′ on the paths , d(ur, u′) denotes the distance from ur to u′, and it is collected into the set D(ur, dom (ur, u′)), satisfying u′ ∈ dom (ur, u′), and d(ur, u′) ∈ D(ur, dom (ur, u′)). The node connotative relationship is defined in Theorem 2.

Theorem 2 (node connotative relationship). Given a minimal DCPG , if exists a dominant node u ∈  semidominated by u′ ∈ , then the paths from u′ to u are combined as a node or single-circular graph or multicircular graph.

Proof (for Theorem 2). A minimal DCPG refers to the DCPG generated the least dominant nodes of a pattern graph. DCPG is considered, if exists a dominant node u ∈  semidominated by u′ ∈ , satisfying there is a path from u′ to u, and then, there must exist a smaller DCPG that does not contain u, thus is not a minimal DCFG. Therefore, if u′ and u are different dominant nodes, there must exist multiple paths from u′ to u, which should be combined as a single-circular or multicircular graph, otherwise, they are the common dominant node.

3.2. Dominance-Partitioned Pattern Hypergraph

In this section, the method of dominance-partitioned pattern hypergraph (DPPG for short) is first given to construct the dominant nodes and dominating relationship. Then, a method of dominance-driven spectrum clustering is employed to divide a pattern graph as k subgraphs. The dominance-partitioned pattern hypergraph is defined in Definition 6.

Definition 6 (dominance-partitioned pattern hypergraph). Given a minimal DCPG , a dominance-partition pattern hypergraph is a hypergraph of DCPG, formed as where is a set of dominant nodes in and  ⊆  ×  denotes a set of edges.
Both hypernodes and hyperedges of DPPG indicate the geometries of a pattern graph , which are conducted on the basic of node denotative and connotative relationships in .
The geometry of a node is a fish-shaped subgraph of . A dominating set dom(u) is considered, we define the denotative and connotative relationships of u as a tree-pattern subgraph and circular-pattern subgraph , respectively. The combination of a tree-pattern subgraph and circular-pattern subgraph of dominating node u is denoted as (u). Thus, (u) is a fish-shaped graph with a semidominant node of (u) as the fish head and leaves of (u) as fish tail. The geometry of edge indicates the circular-pattern common subgraph, which is composed of multiple paths between any dominant nodes in .
Considering the pattern graph and dominant-connected pattern subgraph in Figure 4, the dominance-partitioned pattern subgraphs are illustrated in Figure 5, where the rounds filled with diagonal lines denote the dominant nodes, and the rounds filled with vertical line indicate the semidominant nodes. Each dominance-partitioned pattern subgraph is composed of the multiple paths from a semidominant node to the dominant one and the dominated tree-shape structure of dominant node. Regarding a dominance-partitioned pattern subgraph (u), it is composed of the multiple paths from u1 to u3 and the dominated tree-shaped structure {u3, u4, and u5}.
Thus, the k-partition problem of a pattern graph can be converted to divide k subgraphs of hypergraph , which is denoted in Definition 7.

Definition 7 (dominance-driven k-partition problem). Given a dominance-partitioned pattern hypergraph and  = {u1, … , un}, the dominance-driven k-partition problem refers to divide into k clusters, satisfying C = {C1, … ,Ck}, such that overlapped cost is minimum and subgraph cost satisfies the condition
where denotes the pattern-clustered subgraph cost of and indicates the overlapped cost of pattern-clustered subgraph , …, .
In this paper, we abbreviate () as . Then, the overlapped cost of a pattern-clustered subgraph is redefined as , and subgraph cost is represented as |. Here, corresponds to the i-th of cluster Ci. Actually, the overlapped cost is equivalent to , proven in Lemma 1.

Lemma 1 (node connotative relationship). Given pattern-clustered subgraphs and , the overlapped cost is equivalent to .

Proof. (for Lemma 1).Considering the node denotative relationship in Theorem 1, regarding dominant nodes u ∈  and u′ ∈ , if there exists (u) ∧  ≠ ∅, then u ≺ u′ or u′ ≺ u. Regarding the dominating relationship u′ ≺ u, there must exist a semidominant node u″ ≺ u, such that u″ ≺ u′ or u″ = u′. Consider the node connotative relationship in Theorem 2, u′ must be contained into . Thus, (u) ∧  ≠ ∅.

3.2.1. Dominance-Pattern Weighted Matrix

The dominance-pattern weighted matrix models the similarity of pattern-partitioned subgraphs through its bidirectional adjacency matrix of size n × n, where n =  and denotes a set of dominant nodes. Further, represents a subgraph cost of and represents an overlapped cost between and ,  ≠ 0 if and only if  ∧  ≠ ∅, otherwise  = 0.

The subgraph cost of pattern-partitioned subgraph is evaluated by the quantity of data triples mapped to pattern triples. A pattern-partitioned subgraph (u) is considered, and the subgraph cost of (u) is the quantity of data triples mapped to all pattern triples in (u), formed as . Pattern-partitioned subgraphs and are considered, and the overlapped cost is the common subgraph cost of pattern subgraphs, formed as . The subgraph and overlapped costs are described in the following formulas:

3.2.2. K-Partition on Dominance-Pattern Weighted Graph

The k-partition is used to cut the graph DCPG into k subgraphs that are not connected to each other. We define the sets of k dominant nodes in as C1, …, Ck, satisfying Ci ∩ Cj = ∅ and Ci … Ck = , 1 ≤ i, j ≤ k.

For the dominant node sets Ci and Cj of any pattern subgraphs and , we define the weight of the cut-graph between Ci and Cj as the following formula:

Thus, for the dominant node sets C1,…, Ck of k pattern subgraphs , the weight of the cut-graph on k pattern graphs is defined as the following formula:where is the complementary of , satisfying .

3.3. Dominance-Partitioned Algorithm on Large RDF Graph

In this section, the k-partition algorithms are designed to divide the large data graph as multiple-distributed subgraphs. We first give the construction of DPPG in Algorithm 2, and then, the dominance-driven k-partition is designed in Algorithm 3.

Input: a pattern graph
Output: a dominance-partitioned pattern hypergraph
(1)i = 0, DFS[ur] = i,  ⟵ ur;
(2)For u ∈ ur.successors do
(3)If u is unvisited then;
(4)  DFS[u] = i++, Continue to ur.successors;
(5)Else if u is visited then
(6)   ⟵ u;
(7)For DFS[u] = I from to 0 do
(8)For u′ ∈ u.precursors do
(9)  If DFS[u] ∈ DFS[u′] then
(10)    ⟵ (u, u′), Continue to u′.precursors;
(11)  Else if sdom[u] > sdom[u′] then
(12)    ⟵ (u, u′), sdom[u] = sdom[u′];
(13) ⟵ |sdom[u], u);
(14)For u ∈  in descending order of DFS do
(15)For u′ ∈  − u do
(16)  If DFS [u] < DFS [u′] < DFS [sdom[u]] then
(17)  Removing u from ;
(18)For u ∈ do
(19) ⟵ (u, u.precursors),  =  ∪ ;
(20)For u ∈  and u′ ∈  − u do
(21)If ∧  ≠ ∅ then
(22)   ⟵ (u, u′);
(23)Return;
Input: a dominant connected pattern graph , an RDF graph
Output: a set of k-partitioned data subgraph
(1)Initializing weighted matrix of size ;
(2)For each t ∈ do
(3);
(4);
(5)Fordo;
(6);
(7), ;
(8)C ← K-Means(F), C = {C1, …, C2};
(9)For Ci ∈ C and t ∈ do
(10)If t ∈ (Ci), then
(11)  ;
(12)Return;

The first core work of DPPG construction is to select a root node of a pattern graph. Intuitively, we tend to choose the node with the smallest local matching results and the largest degree as the root node. A query vertex conducting the smallest matching results means the minimal network transmission cost, and the one with the greatest degree means the maximal probability to prune the negative node pairs. The evaluated formula of root node is described in the following equation:where M(u) denotes the entities typed and attributed by u, satisfying M(u) ∈ . W3C group provides a set of vocabularies (as part of the RDF standard) to encode rich semantic information on RDF graphs. For example, type predicates (rdfs : type) provide the function of grouping vertices of RDF graphs into different categories. Different from general label graphs, the vertices of RDF graphs identify entities/text information, and the semantic characteristics of entities result in the same type of vertices usually having similar predicate combinations, which is convenient for statistics. Thus, the entities of can be easily mapped to pattern nodes of through typed and attributed predicates on entities themselves.

The construction of DPPG is shown in Algorithm 2. The input is a pattern graph , and the output is a dominance-partitioned pattern hypergraph . An artificial root node is first selected by formula (5). The root node ur is encoded by an initial sequential number DFS[ur] = 0 and added to the dominant set (Line 1). Then, the four modules are sequentially executed in the construction of DPPG.

The first module is to encode a sequential unique number to each vertex of by the order of depth-first searching (Lines 2–6). Each node u of is encoded iteratively through searching the successors of visited node ur (Line 2). If u is unvisited, u is encoded and the successors of u are deployed into the encoder (Lines 3-4). If u is visited, u is taken as a dominant node and added to the dominant set (Lines 5 6).

The second module is to identify the semidominant nodes of in a descending order of depth-first searching (Lines 7–12). Each node u′ of is traced iteratively through searching the precursors of encoded node u (Line 8). If DFS[u] < DFS[u′], the edge (u, u′) is collected into a circular-pattern subgraph and the precursor of u′ are deployed into the tracer (Lines 9-10). If DFS[u] > DFS[u′]} and sdom[u] > sdom[u′], the semidominant node of u is replaced by u′ and the edge (u, u′) is collected into circular-pattern subgraph (Lines 11-12).

The third module is to minimize the dominant set (Lines 13–16). Considering a dominant node u ∈ , if there exists a node u′ ∈  − u, satisfying DFS[u] < DFS[u′] < DFS[sdom[u]], then u′ is removed from .

The fourth module is to acquire the DPPG and dominance-partitioned pattern subgraphs (Lines 17–21). The dominated node of one node u is added to tree-pattern subgraph (u), and the union of (u) and (u) is updated to (u) (Lines 17-18). Considering dominance-partitioned pattern subgraphs (u) and (u′) if (u) ∧ (u′) ≠ ∅, a hyperedge (u, u′) is constructed into .

3.3.1. Example for DPPG Construction

An example of DPPG construction is illustrated in Figure 6, where the rounds filled with diagonal lines denote the dominant nodes and the rounds filled with vertical lines indicate the semidominant nodes.

A root node is first selected as u1 by formula (5), which should contain with the smallest local matching results and the largest degree. Then, the four modules are sequentially executed in the construction of DPPG. The first module is for encoding a sequential unique number to each vertex of by the order of depth-first searching. The sequential unique numbers are encoded as the subscripts of nodes, and the double visited nodes are collected to a dominant set , satisfying  = {u3, u6, u9, u15}.

The second module is to identify the semidominant nodes of in descending order of depth-first searching, and the semidominating relationships are acquired as {u1 ≼ u3, u1 ≼ u6, u7 ≼ u9, u1 ≼ u15}. In the acquiring process of semidominating relationships, the circular-pattern edges are inserted into circular-pattern subgraphs. Regarding a dominant node u3, the precursor u6 of u3 satisfies DFS[u3]} < DFS[u6]}, thus precursors of u6 should be expanded continuously and (u6, u3) is inserted to . After the semidominant node of u3 are found as u1, the ascending path is inserted to (u3). The dominance-partitioned pattern subgraphs are illustrated in Figure 6.

The third module is to minimize the dominant set. Considering a dominant node u ∈ , if there exists a node u ∈  − , satisfying DFS[u] < DFS[u′]} < DFS[sdom[u]], then u′ is removed from . The fourth module is to acquire the DPPG and dominance-partitioned pattern subgraphs. The dominated tree-shaped structures are combined with circular-pattern subgraphs, which are used to conduct the compact dominance-partitioned pattern subgraphs.

The dominance-driven k-partition is described in Algorithm 3. The inputs are a dominant-connected pattern graph and an RDF graph . The output is a set of k-partitioned data subgraphs . The dominance-driven k-partition algorithm consists of three modules.

The first module is to construct a similar matrix of a dominance-partitioned pattern hypergraph (Lines 1–4). The dominance-pattern weighted matrix initializes the similarity of pattern-partitioned subgraphs through its bidirectional adjacency matrix of size (Line 1). represents a subgraph cost of (Line 3) and represents an overlapped cost between and (Line 4).

The second module is to calculate the degree matrix through accumulating elements in each line of a similar matrix (Lines 5-6). Then, the method of Graph Laplacians is employed to calculate the feature matrix, satisfying L = () and fTLf = . Thus, the similar matrix can be abstracted as a feature matrix F of size (Line 7). Further, a k-means clustering method is used to gather the feature matrix line by line as k clusters (Line 8).

Finally, the third module is to generate k-partitioned data subgraphs through the mappings of data graphs on k clusters (Lines 9–11). Considering a pattern-clustered subgraph of cluster Ci, if there exists a data triple t ∈ , satisfying t ∈ , then t is a mapped data triple of (Line 11).

3.3.2. Example for Dominance-Driven K-Partition

Each dominance-partition pattern subgraph is taken as a hypernode of and each edge indicates the common part between dominance-partition pattern subgraphs. Considering circular-pattern subgraphs (u3) and (u6) in Figure 6, the edge (u1, u6) is the common part of (u3) and (u6). Regarding a dominance-driven 3-partition strategy, the clustered pattern subgraphs are illustrated in Figure 5. An RDF graph and a pattern graph are considered in Figures 1 and 2, the divided pattern graphs and the partitioned RDF graphs are illustrated in Figure 7. The divided pattern subgraphs (= (Department)) and (= (GraduateCourse)) are shown in Figure 7(a), where and are fish-shaped pattern subgraphs semidominated by the common node ResearchAssistant. Regarding a dominance-driven 2-partition strategy, the divided pattern subgraphs do not need to be clustered, because the clustered pattern subgraphs are still the divided ones. Large RDF graph is partitioned into different servers based on the divided pattern subgraphs and , illustrated in Figures 7(b) and 7(c).

4. Subgraph Matching on k-Partitioned RDF Graph

In this section, we introduce the subgraph matching algorithm on a k-partitioned RDF graph. Given a query graph q and a data graph , the subgraph matching problem refers to search all isomorphic subgraphs of q on . Here, the query graph is defined as a subgraph of , formed as q().

4.1. Feasibility of Dominating Relationship

For corresponding to the flow characteristics of a pattern graph in this paper, we attach a query graph with a virtual root that isotopic to the artificial root ur of the pattern graph. The root-attached query graph is redefined as , where is the virtual root node. Given a pattern graph () and an artificial root ur, if there exists a refined query graph , satisfying ur ∈ , then u′ is a nonempty node and ur = , otherwise, u′ is an empty node. Considering the refined query graphs q1 and q2 in Figure 8(a), node u1 of q1 is a real root node of a pattern graph in Figure 9 and node ur of q2 is a virtual root node which is isotopic to node u1. To ensure the reachability of q2, the virtual edges are constructed from u2 to the nearest nodes u1 and u7 on the reachable paths from u1 in Figure 9.

The dominating relationships in a pattern graph are still suitable for the refined query graph, which is proved in Theorem 3.

Theorem 3 (feasibility of dominating relationship). Given a pattern graph and query graph , if there exists nodes u, u′ ∈  ∩ , satisfying u ≺ u′ in , then dominating relationship u ≺ u′ are still suitable in q.

Proof. (for Theorem 3). We use contradiction to prove the feasibility of dominating relationship. Considering a pattern graph and query graph , if there exists nodes u, u′  ∩ , satisfying uu′ in , such that dominating relationship uu′ cannot be suitable in q. Then, there must exist another node u″, satisfying u″ is the reachable node on path . Since is isotopic to ur of , u″ must be the reachable node on path . However, the u will not dominate u′ in , because there exists two-node u and u″ , which are the reachable nodes on path .
Regarding the refined query graphs q1 and q2 in Figure 8, the query graphs inherit the dominating relationships of the pattern graph, because it can always find a query root node (real or virtual node) that is isotopic to the root node of a pattern graph for any query graph.
Benefiting from the feasibility of dominating relationship in Theorem 3, the combination of a computational paradigm of subgraph matching and the concept of DCPG can conduct many outstanding characteristics to accelerate the subgraph matching on a RDF graph. The smallest calculated unit of subgraph matching is a data-query vertex pair (node pair for short), that describes a mapping from query vertex u to data vertex , formed as . A solution of subgraph matching is defined as a subgraph mapping, described as , n = . A DCPG-based characteristic is denoted in Lemma 2.

Lemma 2. Given node pairs and in a subgraph mapping M, if there exists a virtual node pair , satisfying uu′ on , then  ≺  on .

Proof. (for Lemma 2). Regarding a subgraph mapping of q on , it must satisfy the edge constraint of subgraph isomorphism in Definition 2: ∀u, u′ ∈ , ∃(u, u′) ∈ Eq: (M(u), M(u′)) ∈ E. Thus, consider node pairs and in M, if there exists the dominating relationship u ≺ u′, (=M(u)) must dominate (=M(u′)) according to the transitivity of edge constraint.
Further, an interesting characteristic can be mined from the positive and negative node pair. A node pair is called as a positive node pair if and only if it satisfies the candidate verification, shown in Definition 8.

Definition 8 (candidate verification). A node pair is positive if and only if it satisfies the following constraints: (1) , (2) ∀u′ ∈ N(u), , and (3) ∀(u, u′) ∈ Eq,
where denote the labeling function of vertices and edges, respectively, represents the vertex-label set coupled with vertex u and LE (u, u′) represents the edge-label set coupled with edge (u, u′).

Lemma 3. The node pairs dominated by cannot conduct any one subgraph mapping containing if is negative.

Proof. (for Lemma 3). A negative node pair is considered, it does not satisfy the constraints of subgraph isomorphism in Definition 2, thus M(u) ≠ . Further, the node pairs dominated by cannot conduct any one subgraph mapping containing .
Therefore, we employ a circular-pattern first matching order to guide the iterative processing of subgraph matching. Considering a refined query graph q1 containing a real root u1 in Figure 8(a), the nodes of circular-pattern (u15) are first ordered in the process of subgraph matching. Considering a refined query graph q2 containing a virtual root ur in Figure 8(b), the nodes of circular-patterns and are first ordered in the process of subgraph matching. Since q2 contains two circular-patterns, it needs to select a priority for executing the multiple circular-patterns (u9) and (u3). The priority of multiple circular-pattern is selected by the density of circular-patterns, which is formed as the proportion of edge to a vertex in a circular-pattern. Regarding the circular-patterns (u9) and (u3), the densities are calculated as 1 and 2/3, then (u3) is executed in front of (u9). Note that the calculation of density does not consider the virtual root and edges for a query graph coupled with a virtual root, similar to q2.

4.2. Physical Storage

We design the physical storage of patterned RDF graph to accelerate the acquisition of triple-based RDF data. Before the introduction of our physical storage, a dictionary encoding mapping table is first designed to encode the strings of RDF triples as integers.

The dictionary encoding mapping table consists of data and semantic dictionaries. The data dictionary corresponds instance-labels of RDF triple to integer unique numbers and the semantic dictionary corresponds predicates, type-labels, and attribute-labels of RDF triple to integer unique numbers.

Considering the RDF graph in Figure 1, the semantic and data dictionaries are illustrated in Table 2. Both data and semantic dictionary consist of two parts: the first part records the unique integer encodes (e.g., .) and the second part records the instance-label, predicate, type-label, or attribute-label (e.g., Person_A, and advisorBy) corresponded to unique integer encodes in the first part. The v-id denotes a data dictionary, where each line encapsulates a unique integer and an instance-label (e.g., ). The semantic dictionary is shown on the table of p/type/attribute, where each line encapsulates a unique integer and a predicate or a type-label or an attribute-label (e.g., , , and ).

Regarding the divided RDF graph in Figures 7(b) and 7(c), our physical storage is displayed in Figure 10, which employs the structural layout of hash mapping table. The Key is formed as [v-id|p/type|dir], and the Value is assigned as p/type, where v-id is a unique integer encode, p/type represents a predicate or a type-label or an attribute-label, and dir indicates the direction of RDF triple on a graph. For example, a key [1|0|0] on server 1 refers to the labels of in-edges to vertex numbered as 1, that is the predicate advisorBy marked as 1. A key (1|8|1) indicate the type-label of the vertex numbered as 1, which is the label FullProfessor marked as 10. A key [0|8|0] denotes the vertices coupled with an in-edge label rdf:type marked as 8, that are Person_A, Person_B, Course_A, and Course_B, marked as 1, 2, 6, and 7, respectively.

4.3. Physical Storage

The subgraph matching of a k-partition RDF graph is described in Algorithm 4. The input is a query graph q(ur, , ) and the outputs are the subgraph mappings ℳ of q on .

Input: a query graph
Output: the subgraph mappings ℳ of q on
(1)If i =  then;
(2)Output M ⟶ ℳ;
(3)Else
(4)If i = −1 then;
(5)  Continue to ui.successor;
(6)Else
(7)  Foreach ∈ M(ui, ) do
(8)   If candidateValid(, ui) then
(9)     ⟶ M;
(10)    subMatching(q(ui.successor));
(11)  subMatching(q(ui.precursor));

The subgraph matching algorithm is originating with until all subgraph mappings are conducted. The number i is used to count the positive node pairs and a subgraph mapping is conducted to ℳ if is equivalent to (Lines 1-2). A circular-pattern first matching order is employed to guide the iterative processing of subgraph matching which benefits from Lemmas 2 and 3. An isotopic virtual root is attached to the original query graph and assigned to an initial number i = −1 (Line 4). Since is a virtual node, the successors of are expanded to explore the real node pairs (Line 5).

An iterative processing inserts the positive node pairs to ℳ (Lines 7–11). Considering a selected query vertex ui, the node pairs of ui are previously extracted from (Line 7). If the node pair is positive and satisfies the partial subgraph isomorphism, is extended to ℳ and the successor of ui is expanded to exploring the read node pairs. Otherwise, other node pairs are sequentially verified by partial subgraph isomorphism and candidate verification (Lines 8–10). If node pairs of a query vertex ui are negative or do not satisfy the partial subgraph isomorphism, the precursor of ui is backtracked and repeating the extending processing until all subgraph mappings are conducted to ℳ (Line 11).

4.3.1. Example for Subgraph Matching Algorithm

In the subgraph matching algorithm, we employ a circular-pattern first matching order to guide the iterative processing of subgraph matching. A ordered query graph is shown in Figure 11, where the rounds filled with left-diagonal line denote the first executed region, the rounds filled with right-diagonal line indicate the second executed region, and the nonfilled rounds refer to the final executed region. The filled rounds are included in circular-pattern subgraphs and u7 is the regional juncture. Then, the subgraph matching is iteratively conducted by our circular-pattern first matching order.

5. Experimental Evaluation

In this section, we verify the effectiveness and scalability of algorithms that are experimented on synthetic and real datasets, and we mainly analyze the experiments with current memory-based distributed SPARQL query processing strategies.

5.1. Experimental Settings

All experiments are conducted on a distributed cluster including six identical computing nodes. Each computing node uses an Intel(R) Core(TM) [email protected] GHz 8-core processor, and the node communication is deployed on the Ethernet of 1000 Mbps. The physical memory is 16 GB, and the hard disk size is 1 T.

Experimental evaluation employs the four generated scales of synthetic data set LUBM (Lehigh University BenchMark) and the real data set YAGO2 (yet another great ontology 2). The related information of datasets is shown in Table 3, where #T, #S, #O, and #P represent the numbers of triples, subjects, objects, and different predicates, respectively.

5.1.1. Datasets

The synthetic data set LUBM [34] was developed by Lehigh University, which is a standard and systematic semantic Web repository evaluation benchmark for university ontology. This benchmark aims to evaluate the extended queries of a single real ontology on a large data set. The two datasets of different sizes are generated from the data generator UBA 1.7\footnote {http://swat.cse.lehigh.edu/projects/lubm}. YAGO2\footnote {http://yago-knowledge.org} is a linked data knowledge base that mainly integrates data from three sources: Wikipedia, WordNet, and GeoNames which contains 120 million triples and more than 10 million entities (such as individuals, organizations, and cities).

5.2. Analysis of Experimental Results

We compare the query performance of our DP-SM algorithm with TriAD [4] and Wukong [5]. The query performances are deployed on six calculated nodes (including a master node) and evaluated on the LUBM-2560 dataset. A benchmark is used to generate different scales of query graph, which is employed in the research of many distributed RDF systems and is published in [12].

Experiments are evaluated into two groups of query graphs, illustrated in Figure 12. The first group of query graph L1, L2, and L3 correspond to Q1, Q3, and Q7, respectively, in [19], our PD-SM algorithm is faster by 1.4–2.2 times than WuKong algorithm. Actually, the final results of L2 are empty. Even though there exists a large number of predicate relationships mapped to the query graph L2, the verified candidates of query nodes are empty. The candidate verification of our algorithm can previously find the candidates of all query nodes before the matching processing is executed, while the algorithms of WuKong and TriAD need to find the candidates with a time-consuming traversal on large intermediate results. The final results of L1 and L3 are conducted as 65,000 and 1,000 data subgraphs, respectively. The experiments verify that the graph-based exploration method has nearly one order of magnitude faster than the relationship-based joining model. A circular-pattern first matching order is employed to guide the iterative processing of subgraph matching in our algorithm, which can prune the redundant intermediate results previously. Thus, our algorithm obtains a greater improvement of matching performance.

The second group of query graphs L4, L5, and L6 corresponds to the extended Q2, Q1, and Q7 in [19], which employs the more complex and denser topological structures than the L1, L2, and L3. The query graph of L4 is a noncircular topological structure and the intermediate results are larger without the verification of partial subgraph isomorphism. Thus, our algorithm DP-SM has a small improvement than TriAD. Compared with algorithm WuKong, the improved matching performance of our algorithm benefits from the strategy for postponing the cluster-connected calculations of Cartesian products. The query graph L6 contains more dense circular topological structures than L5. The circular-pattern first matching strategy can speed up the acquisition of subgraph results.

The average matching time on YAGO2 dataset is evaluated in Figure 13, where the simple and complex query graphs are denoted in [8]. Similar to the experimental evaluation on LUBM dataset, the matching performance of complex query graphs Y4, Y5, and Y6 is similar to the experimental evaluation of complex query graphs on LUBM dataset, illustrated in Figure 13(b). Our algorithm of DP-SM proposed is 1.5–2.5 times faster than algorithms WuKong and TriAD. The difference is that the matching time-performances of simple query graphs Y1, Y2, and Y3 are faster than the simplex ones on LUBM datasets, because the nodes of simple query graphs are limited by constant values that can conduct the smaller search space of intermediate results, shown in Figure 13(a). Compared with Wukong and SDSM, since our algorithm has a time-consumption in the orchestration of matching order, it is negligible with the overall running time of matching algorithms.

5.3. Experimental Scalability

The scalability of algorithms is evaluated based on the number of machines and the size of dataset.

The scalability based on the number of machines are evaluated in Figure 14, where the number of machines is gradually increased from 2 to 6. The experimental results show that the matching time-performances of query graph L1, L3, L4, L5, and L6 gradually improved in increasing order of the machine number. The trend of experimental evaluation proved that our DP-SM algorithm can effectively conduct the subgraph results in distributed environments. Since the candidates of L2 are verified as empty in the previous candidate verification, the matching time appears as a constant trend. For complex queries L4, L5, and L6, the decreasing magnitudes in matching time-performances are slightly lower than ones of L1 and L3, because the query graphs crossing multiple partitioned pattern subgraphs increase the time-consumption of transmission on the partitioned RDF graphs.

The scalability based on the size of dataset LUBM is evaluated in Figure 15, where the number of machines is fixed as 6. The different scales of LUBM dataset are generated to evaluate the matching time-performances of algorithms, which are located in the range from 5.3 M to 346 M. The matching time-performances of our DP-SM algorithm can maintain a nearly linear growth without the complex topological structures of the query graph. Our algorithm employs a circular-pattern first matching strategy to previously prune the redundant RDF and postpone the subgraph-connected calculation of Cartesian products. Then, the partial intermediate results are linked slightly without the huge matching time-consumption on noncircular pattern subgraphs.

6. Conclusions

In this paper, we propose a dominance-partitioned subgraph matching on a large RDF graph. Firstly, a dominance-connected pattern graph is extracted from a pattern graph to construct a dominance-partitioned pattern hypergraph, which divides a pattern graph as multiple fish-shaped pattern subgraphs. Secondly, a dominance-driven spectrum clustering strategy is used to gather the pattern subgraphs into multiple clusters. Thirdly, a dominance-partitioned subgraph matching algorithm is designed to conduct all isomorphic subgraphs on a cluster-partitioned RDF graph. Finally, experimental evaluation verifies that our strategy has higher time-efficiency of complex queries, and it has better scalability on multiple machines and different data scales.

Data Availability

The LUBM data used to support the findings of this study have been deposited in the web repository (http://swat.cse.lehigh.edu/projects/lubm). The YAGO2 data used to support the findings of this study have been deposited in the web repository (http://yago-knowledge.org).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant no. 61976032).