Abstract

Graph pattern matching is widely used in big data applications. However, real-world graphs are usually huge and dynamic. A small change in the data graph or pattern graph could cause serious computing cost. Incremental graph matching algorithms can avoid recomputing on the whole graph and reduce the computing cost when the data graph or the pattern graph is updated. The existing incremental algorithm PGC_IncGPM can effectively reduce matching time when no more than half edges of the pattern graph are updated. However, as the number of changed edges increases, the improvement of PGC_IncGPM gradually decreases. To solve this problem, an improved algorithm iDeltaP_IncGPM is developed in this paper. For multiple insertions (resp., deletions) on pattern graphs, iDeltaP_IncGPM determines the nodes’ matching state detection sequence and processes them together. Experimental results show that iDeltaP_IncGPM has higher efficiency and wider application range than PGC_IncGPM.

1. Introduction

Graph pattern matching is to find all the subgraphs that are the same or similar to a given pattern graph in a data graph . It is widely used in a number of applications, for example, web document classification, software plagiarism detection, and protein structure detection [13].

With the rapid development of Internet, huge amounts of graph data emerge every day. For example, the Linked Open Data Project, which aims to connect data across the Web, has published 149 billion triples until 2017 [4]. In addition, real-world graphs are dynamic [5]. It is often cost-prohibitive to recompute matches starting from scratch when or is updated. An incremental matching algorithm is needed, which aims to minimize unnecessary recomputation by analyzing and computing the changes of matching result in response to updates (resp., ) to (resp., ).

For example, Figure 1(a) is a pattern graph and Figure 1(b) is a data graph . The subgraph which is composed of , , , , , and the edges between them (for simplicity, denoted as ) is the only matching subgraph. Assuming that (, ) and (, ) are removed from the pattern graph, the traditional recomputing algorithm will compute the matches for the new pattern graph on the whole data graph. It is time consuming. The incremental algorithm will just check a part of nodes in G, that is, , and , and add new matching subgraphs (, ) to the original matching result.

At present, the study of incremental graph pattern matching is still in its infancy and existing work [612] mainly focuses on the updates of data graphs. In our previous study, we proposed an incremental graph matching algorithm named PGC_IncGPM, which can be used in scenarios where data graphs are constant and pattern graphs are updated [13]. PGC_IncGPM can effectively reduce the runtime of graph matching as long as the number of changed edges is less than the number of unchanged edges in . However, the improvement effect of PGC_IncGPM gradually decreases as the number of changed edges increases. In this paper, the bottleneck of PGC_IncGPM is further analyzed. An optimization method of nodes’ matching state detection sequence is proposed, and a more efficient algorithm called iDeltaP_IncGPM is designed and implemented.

Using Figure 1 as an example, suppose (B, E) and (C, D) are deleted from the pattern graph. PGC_IncGPM algorithm will first consider the deletion of (B, E), that is, checking B2, A2, B3, and A3, and then consider the deletion of (C, D), that is, checking C2, B2, A2, C3, B3, and A3. Thus B2, A2, B3, and A3 are all checked twice. iDeltaP_IncGPM considers the two deletions together; C2, B2, A2, C3, B3, and A3 are all checked only once.

The remainder of this paper is organized as follows. In Section 2, related work is reviewed. The model and definition are described in Section 3. In Section 4, our algorithm is presented. Section 5 is experimental results and comparison, and Section 6 presents the conclusion.

We surveyed related work in two categories: graph pattern matching models and incremental algorithms for graph matching on massive graphs.

Graph pattern matching is typically defined in terms of subgraph isomorphism [14, 15]. However, subgraph isomorphism is an NP-complete problem [16]. In addition, subgraph isomorphism is often too restrictive because it requires that the matching subgraphs have exactly the same topology as the pattern graph. These hinder its applicability in emerging applications such as social networks and crime detection. Thus, graph simulation [17] and its extensions [1822] are adopted for pattern matching. Graph simulation preserves the labels and the child relationship of a graph pattern in its match. In practical applications, graph simulation is so loosely that it may produce a large number of useless matches, which can flood useful information. Dual simulation [18] enhances graph simulation by imposing an additional condition, to preserve both child and parent relationships (downward and upward mappings). Due to the good balance and high practical value of dual simulation in response time and effectiveness, graph pattern matching is defined as dual simulation in this paper.

At present, the study of incremental graph pattern matching is still in its infancy; existing work [612] mainly focuses on the updates of data graphs. Fan et al. proposed the incremental graph simulation algorithm IncMatch [6, 7]. Sun et al. studied the Maximal Clique Enumeration problem on dynamic graph [8]. Stotz et al. studied incremental inexact subgraph isomorphic problem [9]. Wang and Chen proposed an incremental approximation graph matching algorithm, which transformed the approximate subgraph search into vector space relation detection [10]. When inserting or deleting on the data graph, the vectors of relevant nodes are modified and whether the new vectors still contain the vector of the pattern graph is rechecked. Choudhury et al. developed a fast matching system StreamWorks for dynamic graphs [11]. The system can real-time detect suspicious pattern graphs and early warn high-risk data transfer modes on constantly updated network graphs. Semertzidis and Pitoura proposed an approach to find the most durable matches of an input graph pattern on graphs that evolve over time [12]. In [13], an incremental graph matching algorithm was proposed for updates of pattern graphs.

In big data era [23], graph computing is widely used in different fields such as social networks [24], sensor networks [25, 26], internet-of-things [27, 28], and cellular networks [29]. Therefore, there is urgent demand for improving the performance of big graph processing, especially graph pattern matching.

3. Model and Definition

For graph pattern matching, pattern graphs and data graphs are directed graphs with labels. Each node in graphs has a unique label, which defines the attitude of the node (such as keywords, skills, class, name, and company).

Definition 1 (graph). A node-labeled directed graph (or simply a graph) is defined as , where is a finite set of nodes, is a finite set of edges, and is a function that map each node in to a label ; that is, is the attribute of .

Definition 2 (graph pattern matching). Given a pattern graph and a data graph , matches if there is a binary relation , such that(1)if , then ;(2), there exists a node in such that and (a) , there exists an edge such that ; (b) , there exists an edge such that .

Condition (2)(a) ensures that the matching node keeps the child relationship of ; condition (2)(b) ensures that maintains the parent relationship of .

For any and , there exists a unique maximum matching relation . Graph pattern matching is to find , and the result graph is a subgraph of that can represent .

Considering a real-life example, a recruiter wants to find a professional software development team from social network. Figure 2(a) is the basic organization graph of a software development team. The team consists of the following staffs with identity: project manager (PM), database engineer (DB), software architecture (SA), business process analyst (BA), user interface designers (UD), software developer (SD), and software tester (ST). Each node in the graph represents a person, and the label of node means the identity of person. The edge from node A to node B means that B works well under the supervision of A. A social network is shown in Figure 2(b). In this example, is (DB, DB1), (PM, PM1), (SA, SA1), (BA, BA1), (UD, UD1), (SD, SD1), (SD, SD2), (ST, ST1), (ST, ST2)}. Because BA2 does not have a child matching UD and SA2 does not have a parent matching DB, PM2 does not keep the child relationship of PM. For the same reason, SD3 (resp., ST3) does not match SD (resp., ST).

Definition 3 (incremental graph pattern matching for pattern graph changing). Given a data graph and a pattern graph , the matching result in for is . Assuming that changes , the new pattern graph is expressed as . As opposed to batch algorithms that recompute matches starting from scratch, an incremental graph matching algorithm aims to find changes of to in response to such that .
When is small, is usually small as well, and it is much less costly to compute than to recompute the entire set of matches. In other words, this suggests that we compute matches once on the entire graph via a batch-matching algorithm and then incrementally identify new matches in response to without paying the cost of the high complexity of graph pattern matching.
In order to get quickly, indexes can be prebuilt based on the selected data features of graphs to reduce the search space during incremental matching. The more indexes, the shorter the time to get and the larger the space to store indexes. For large-scale data graphs, both response time and storage cost are needed to be reduced. Considering the balance of storage cost and response time, in this paper, three kinds of sets generated in the process of graph matching are used as index. (1) First are candidate matching sets cand(); for each node in , cand(u) includes all the nodes in which only have the same label with . The nodes in cand() are called c-nodes. (2) The second are child matching sets sim(); for each node in , sim(u) includes all the nodes in which preserve the child relationship of . The nodes in sim() are called s-nodes. (3) The third are complete matching sets mat(); for each node in , mat(u) includes all the nodes in which preserve both the child and parent relationship of . The nodes in mat() are called m-nodes.

The symbols used in this paper are shown in Notions Section.

4. iDeltaP_IncGPM Algorithm

In this section, we propose the improved incremental graph pattern matching algorithm for pattern graph changing ().

4.1. The Idea of PGC_IncGPM Algorithm

The basic framework of PGC_IncGPM [13] is shown in Figure 3.

The graph pattern matching algorithm (GPMS) is first performed on the entire data graph for the pattern graph . It computes the matching result graph and creates the index needed for subsequent incremental matching. may include edge insertions () and edge deletions (). Incremental graph pattern matching algorithm PGC_IncGPM first calls the subalgorithm AddEdges for to get and and then calls the subalgorithm SubEdges for to get and . is the new matching result , and is the new index that can be used for subsequent incremental matching if the pattern graph changes again.

Edge insertions (resp., edge deletions) in are processed one by one by AddEdges (resp., SubEdges). For example, when deleting multiple edges from , the processing of PGC_IncGPM is as follows.

In the first step, the following operations are performed for each deleted edge (): for each , whether keeps the child relationship of in is checked. If keeps the child relationship of u, then is removed from cand(u) to sim(u) and the parents of in cand() are also processed.

In the second step, each node in sim() is repeatedly filtered according to its parents and children; the new generated m-nodes are added to mat().

In the first step, when deleting () from P, some nodes in cand(u) and cand() ( is an ancestor of ) may change from c-nodes to s-nodes. So when a c-node becomes an s-node, a bottom-up approach is used to find its parents and ancestors from cand(). If (, ) and (, ) are deleted, and and have a common ancestor , then cand() will be visited twice. In summary, there is a bottleneck of PGC_IncGPM for multiple deleted edges. There is the same problem for multiple inserted edges.

4.2. Optimization for Matching State Detection Sequence

Since PGC_IncGPM deals with edge insertions (resp., deletions) one by one, the efficiency of it gradually decreases as the number of changed edges increases. To overcome the bottleneck of PGC_IncGPM, multiple edge insertions (resp., deletions) should be considered together. In this paper, the optimization method for nodes’ matching state detection sequence is proposed. The optimization can be applied to both insertions and deletions on .

Taking SubEdges as an example, the optimization method is as follows.

First, analyze all edges deleted from to determine which nodes’ candidate matching sets may change. If cand(u) may change, then is added to set.

Secondly, is sorted by the inverse topological sequence of . There may be some strong connected components in . In this case, we first find out all the strong connected components in Pand, then, converge each strong connected component into a node to get a directed acyclic graph and find the inverse topological sequence of ; finally, we replace the strong connected component convergence node with the original node set. Thus, the approximate inverse topological sequence of is obtained.

Finally, for each in , cand() is processed in turn. Depending on whether there is a deleted edge from , two different filtering methods are used: (1) if has at least one out-edge to be deleted, then each node in cand() is likely to keep the child relationship of now. So whether they keep the child relationship of should be checked; (2) if does not have an out-edge be deleted, then only part of the nodes in cand() are needed to be checked. That is, a node in cand() will be checked only if it has at least one child which changes from c-node to s-node.

The visited times of some candidate matching sets can be reduced through the above optimization.

4.3. iDeltaP_IncGPM Algorithm

Based on the optimization method proposed in Section 4.2, iDeltaP_IncGPM is proposed. It uses the optimized method for both multiple inserted edges and multiple deleted edges. The optimization algorithm for edge deletions is shown in Algorithm 1. In Algorithm 1, contains all the nodes which have out-edge deleted. For a node in , if the changes of may result in some nodes in cand() becoming s-nodes, then . is sorted by the inverse topological sequence of (lines (1)–(5)). If has an out-edge removed, that is, , then all the nodes in cand() need to be checked whether they keep the child relationship of u (lines (7)–(12)). If and is not in , then only part of nodes in cand() are checked. That is, if has a child and is moved from cand() to sim() (), then whether is still an s-node will be checked (lines (14)–(20)).

(1) ; ;
(2) for each deleted edge   do
(3)  ;
(4)   and all ancestor nodes of ;
(5)  sort according to the inverse topological of ;
(6) for each node in   do
(7)  if   then
(8)   for each   do
(9)    check if keeps the child relationship of ;
(10)    if keeps the child relationship of   then
(11)     ; ;
(12)     ;
(13)  else
(14)   for each   do
(15)    if there exist which is a child of such that (, ), then
(16)    check if keeps the child relationship of ;
(17)    if keeps the child relationship of   then
(18)      ;
(19)     ;
(20)     ;
(21) repeatly filter according to the parent and child relationships of nodes in the
   subgraph constructed by to get added and updated and ;

Here we use an example to illustrate the implementation process of PGC_IncGPM and iDeltaP_IncGPM. The pattern graph is shown in Figure 4, assuming that (E, H), (G, I), and (C, G) are deleted from P.

The process of PGC_IncGPM is as follows. (1) the deletion of (E, H) is processed, and each in cand(E) is checked whether it keeps the child relationship of in . If keeps the child relationship of , then its parents founded from cand(B) (resp., cand(C)) are checked. If these nodes keep the child relationship of B (resp., C), then they are removed to sim(B) (resp., sim(C)). After that, their parents founded from cand(A) are checked; (2) the deletion of (G, I) is processed, and the nodes in cand(G), cand(C), cand(D), and cand(A) are checked in turn; (3) the deletion of (C, G) is processed, and the nodes in cand(C) and cand(A) are checked in turn. From the above steps, it can be seen that cand(C) and cand(A) are visited three times, cand(G), cand(D), cand(E), and cand(B) are visited once.

The process of iDeltaP_IncGPM is as follows: because of the deletion of (E, H), (G, I), and (C, G), some nodes in cand(E), cand(G), cand(C), cand(B), cand(D), and cand(A) may become s-nodes. The nodes in cand() are checked by the order G, E, D, C, B, A}. That is, the nodes in cand(G) are checked first, and the nodes in cand(A) are checked at last. E, G, and C all have out-edges deleted, so all the nodes in their candidate matching sets are checked. For the nodes in cand(B), cand(D), and cand(A), only if they have a child changing from c-node to s-node, they will be checked. Therefore, cand(C), cand(D), cand(A), cand(G), cand(E), and cand(B) are only visited once. In other words, the optimized scheme reduces the visited times of cand().

For multiple edges inserted to the pattern graph, the similar optimization method is adopted. nodes+ contains all the source nodes of inserted edges. If some nodes in sim() may become c-nodes because of edge insertions, then is in filtorder+. filtorder+ is ordered by the reverse topological sequence of the pattern graph. nodes+ and filtorder+ are used to reduce the visited times of sim() and mat().

5. Experiments and Results Analysis

The following experiments evaluate our proposed algorithm. Runtime is used as a key assessment of algorithms. In addition, in order to show the effectiveness of incremental algorithms visually, improvement ratio (IR) is proposed, which is the ratio of runtime saved by incremental matching algorithms to the runtime of ReComputing algorithm. Two real data sets (Epinions and Slashdot [30]) are used for experiments. The former is a trust network with 75879 nodes and 508837 edges. The latter is a social network with 82168 nodes and 948464 edges. In previous work, we experimented with normal size and large size pattern graphs, respectively, and the results show that the complexity and effectiveness of incremental matching algorithm are not affected by the size of pattern graph. Therefore, in this paper, by default, the number of nodes in P (||) is 9, the original number of edges in P (||) is 8 (resp., 16) for insertions (resp., deletions) and 9 for both insertions and deletions.

In order to evaluate the improvement of our proposed algorithm, iDeltaP_IncGPM, PGC_IncGPM, and ReComputing are all performed on Epinions and Slashdot under different settings. Each experiment was performed 5 times with different pattern graphs, and the average results are reported here. The experimental results are shown in Figure 5. The histogram represents the runtime of algorithm, and the line chart represents the improvement ratio of iDeltaP_IncGPM and PGC_IncGPM to ReComputing.

Figure 5(a) (resp., Figure 5(b)) shows the runtime of three algorithms over Epionions (resp., Slashdot) for insertions on pattern graphs. The -axis represents the number of insertions on P, “+2” represents that two edges are inserted to P, “+4” represents four edges are inserted to P, and so on. The figure tells us the following: (a) when insertions are no more than 10, the runtime of PGC_IncGPM and iDeltaP_IncGPM is significantly shorter than that of ReComputing, and iDeltaP_IncGPM has the shortest runtime; (b) when insertions are 12 (new inserted edges account for 60% of the edges in ), the runtime of PGC_IncGPM is longer than that of ReComputing, while iDeltaP_IncGPM still gets the shortest runtime; (c) the improvement ratio of iDeltaP_IncGPM and PGC_IncGPM decreases with the increase of edge insertion, but the decrease of iDeltaP_IncGPM is smaller. The more inserted edges, the better iDeltaP_IncGPM than PGC_IncGPM. When 12 edges are inserted to , the IR of iDeltaP_IncGPM is 40% on average, and the IR of PGC_IncGPM is 33% on average. Therefore, iDeltaP_IncGPM is better than PGC_IncGPM. The reason is that PGC_IncGPM processes the inserted edges one by one. Therefore, as insertion increases, its runtime grows almost linearly. However, iDeltaP_IncGPM integrates all the inserted edges, analyzes which matching sets are affected, and processes them in the appropriate order. This will prevent some matching sets to be processed repeatedly, which will shorten the running time.

Figure 5(c) (resp., Figure 5(d)) shows the runtime of three algorithms over Epionions (resp., Slashdot) for deletions on pattern graph. The -axis represents the number of deletions on , “−2” represents that two edges are deleted from , “−4” represents four edges are deleted from , and so on. It can be seen that (a) when deletion changes from 2 to 12, the runtime of all three algorithms increases, and iDeltaP_IncGPM always has the shortest runtime; (b) as the deletion increases, the IR of PGC_IncGPM decreases and the IR of iDeltaP_IncGPM slowly increases. For 12 deletions, the IR of PGC_IncGPM decreases to 7% on average, while the IR of iDeltaP_IncGPM increases to 78% on average. The reason is that as the deletion increases, the runtime of ReComputing increases dramatically, while the runtime of iDeltaP_IncGPM increases a little. iDeltaP_IncGPM is better than PGC_IncGPM because it compositely processes deleted edges and its runtime does not increase linearly as the number of deleted edges increases.

Figure 5(e) (resp., Figure 5(f)) shows the runtime of three algorithms over Epionions (resp., Slashdot) for both insertions and deletions on pattern graph. The -axis represents the number of insertions and deletions on P, “+2−2” means that two edges are inserted to and the other two edges are removed from , and so on. As shown in the figure, iDeltaP_IncGPM always has shorter runtime than the others do.

In conclusion, iDeltaP_IncGPM effectively improves the efficiency of PGC_IncGPM through the optimization strategy. For the same , the runtime of iDeltaP_IncGPM is shorter, and as increases, the runtime increases less; the decrease of IR is also more moderate. Therefore, iDeltaP_IncGPM can be applied to larger changes of the pattern graph, and it has a wider range of applications.

6. Conclusion

In this paper, we analyze PGC_IncGPM to find its efficiency bottleneck and propose a more efficient incremental matching algorithm iDeltaP_IncGPM. Multiple insertions (resp., deletions) are considered together and the optimization method for nodes’ matching state detection sequence is used. Experimental results on real data sets show that iDeltaP_IncGPM has higher efficiency and wider application range than PGC_IncGPM.

Next, we will study the distributed incremental graph matching algorithm. Real-life graphs grow rapidly in size and hyper-massive data graphs cannot be centrally stored in one data center and need to be distributed across multiple data centers. It is very worthy studying how to make efficient incremental matching on distributed large graphs.

Notations

:Pattern graph/data graph
:Nodes in
:Nodes in
:Changes of
:Changes of
:New pattern graph
:Nodes in that have same label with but do not keep child relationship of
:Nodes in that only keep child relationship of
:Nodes in that keep child and parent relationships of
index:The sets including , and
: in such that
:The maximum match in for
:The result graph, a subgraph represents .

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.