Research Article  Open Access
Yifan Chen, Xiang Zhao, Chuan Xiao, Weiming Zhang, Jiuyang Tang, "Efficient and Scalable Graph Similarity Joins in MapReduce", The Scientific World Journal, vol. 2014, Article ID 749028, 11 pages, 2014. https://doi.org/10.1155/2014/749028
Efficient and Scalable Graph Similarity Joins in MapReduce
Abstract
Along with the emergence of massive graphmodeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given threshold. Leveraging the MapReduce programming model, we propose MGSJoin, a scalable algorithm following the filteringverification framework for efficient graph similarity joins. It relies on counting overlapping graph signatures for filtering out nonpromising candidates. With the potential issue of too many keyvalue pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of keyvalue pairs. Furthermore, we integrate the multiway join strategy to boost the verification, where a MapReducebased method is proposed for GED calculation. The superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results.
1. Introduction
As the most commonly used abstract data structure, graphs have been widely used for modeling the data in the fields of bioinformatics, multimedia, social networking, and the like. As a consequence, efforts were dedicated to various problems in managing and analyzing graph data, for example, frequent subgraph mining [1], structure search and indexing [2, 3], similarity search [4, 5], and so forth.
This paper focuses on graph similarity join, one basic operation for processing graph data. Given two graph object sets and and a distance threshold, a graph similarity join returns all the pairs of graph objects, respectively, from and , the distances of which are no larger than the threshold. Graph similarity join has a wide spectral of applications, especially in preprocessing of graph mining, for example, structural data cleaning and near replicate structure detection.
The most widely applied measure for determining graph similarity is graph edit distance (GED) [6, 7]. Compared with alternative measures, GED has at least three advantages: (1) it allows changes in both vertices and edges; (2) it reflects the topological information of graphs; (3) it is a metric that can be applied to any type of graphs. Consequently, we employ GED to quantify graph similarity in this paper. It is shown that exact computation of GED is NPhard [8].
The stateoftheart algorithm for graph similarity join is GSimJoin [9], which adopts the filteringverification framework. In particular, signatures are generated for every graph with pathbased gram approach for count filtering (cf. Section 2.2). In the phase of verification, GED computation is invoked for candidate pairs via an based algorithm. GSimJoin is an inmemory algorithm, the performance and scalability of which are restricted to the available memory of a machine. The dataset size presented in the experimental study is limited in the thousands, since a graph similarity join operation in the worst case needs count filtering condition checks and similarity computations thereafter. The era of big data calls for scalable algorithms to support largescale data processing. This paper attempts to address the challenges on massive graph data.
MapReduce is a wellknown programming framework to facilitate processing largescale data in parallel [10]. MassJoin [11] is a MapReducebased algorithm for similarity join on strings. Nonetheless, there has been no existing distributed algorithm for graph similarity joins. Inspired by [11], this paper investigates graph similarity joins based on MapReduce.
We firstly propose MGSJoin, a MapReducebased algorithm following the filteringverification framework. It employs signatures of pathbase grams as keys and the corresponding graphs as values, forming keyvalue pairs. Through filtering, the graph pairs, whose common signatures are less than the threshold given by count filtering condition, are filtered out. The remaining pairs constitute the candidate set, sent to verification thereafter. Due to the potentially large number of keyvalue pairs that may incur large communication cost, we incorporate the Bloom filter technique by adding generated signatures to spectral Bloom filters. This effectively reduces the number of intermediate keyvalue pairs and, thus, the complexity of network transmission, while the filtering capacity is mostly preserved. Furthermore, we employ multiway join to improve the verification phase by condensing two MapReduce rounds into one while devising a MapReducebased method to calculate GED.
To the best of our knowledge, it is among the first attempts to present a MapReducebased graph similarity join algorithm. Our contribution can be summarized as follows.(i) We redesign the current inmemory graph similarity join algorithm and adapt it to the MapReduce framework. The resulting baseline algorithm is capable of processing largescale graph datasets.(ii) We propose to use Bloom filters to reduce intermediate keyvalue pairs in the filtering phase while sacrificing little filtering capacity. Besides, we present a multiway join optimized verification strategy such that the number of required MapReduce rounds is reduced too. Moreover, a MapReducebased method is designed for GED calculation, which can handle the calculation for large graphs.(iii) We implement the proposed algorithm MGSJoin and conduct a wide range of experiments on a real dataset. The results show that both the efficiency and the scalability of MGSJoin are superior to the current solutions.
This paper is constructed as follows. In Section 2, problem definition and background are provided. We propose the basic algorithm in Section 3, integrate Bloom filters in Section 4, and optimize the verification in Section 5. Section 6 lists the experimental results and analyses. Related works are described in Section 7, followed by conclusion in Section 8.
2. Preliminaries
2.1. Problem Definition
In this paper, we focus on simple graph, namely, an undirected graph without selfloops or multiple edges. A labeled graph can be represented as a quadruple , where is a set of vertices, and is a set of edges. and are label functions that assign labels to vertices and edges, respectively. denotes the vertex set of , and denotes its edge set. and represent the numbers of vertices and edges in , respectively. denotes the label of , and denotes the label of the edge between and , , . Additionally, we use to depict the size of graph .
Definition 1 (graph pair). A graph pair is a tuple denoted by , where , , and are two sets of graph objects, respectively.
Definition 2 (graph isomorphism). A graph is isomorphic to another graph , denoted by , if there exists a bijection such that(1);(2).
Definition 3 (graph edit operation). A graph edit operation is an edit operation to transform one graph to another. It can be one of the following six operations: (i) insert an isolated vertex into the graph;(ii) delete an isolated vertex from the graph;(iii) change the label of a vertex;(iv) insert an edge between two disconnected vertices;(v) delete an edge from the graph;(vi) change the label of an edge.
Definition 4 (graph edit distance). The graph edit distance (GED) between graphs and , denoted by , is the minimum number of edit operations that transform to a graph isomorphic to .
Example 5. Figure 1(a) illustrates the molecule named cyclopropanone, while Figure 1(b) shows another molecule which does not exist. When recording cyclopropanone into database, errors may be made and the molecule can become the one shown in Figure 1(b). Manual checking is required to find the errors, which is very difficult. Seeing that the two molecules are very similar, we can adapt the GED to measure the similarity so that graph similarity join can be applied to resolve the problem. Take the two molecules as an example. First change the bond from double to single and then transform one of the atoms bonded with into , through which a molecule isomorphic to the one in Figure 1(b) is obtained. So at least two edit operations are required, namely, the graph edit distance. Given the threshold , the two graphs are regarded similar.
(a) Cyclopropanone
(b) “Cyclopropanone”
Problem 6 (graph similarity join). Given two sets of graph objects and and a distance threshold as input, a graph similarity join returns a result set .
2.2. Count Filtering
Definition 7 (pathbased gram [9]). A pathbased gram in a graph is a simple path of length . “Simple” means that there is no repeated vertex in the path.
The pathbased grams of a graph constitute the graph signatures, denoted by , where ,, and . Let be graph ’s signature set. We say and are common if . Note there can be multiple grams that correspond to one particular signature.
Lemma 8 (count filtering [9]). Graphs and satisfy the distance constraints if the number of common signatures for is no less than : where (resp., ) is the maximum number of affected signatures in (resp., ) when one edit operation is invoked on (resp., ).
The count filtering condition check requires , where (resp., ) is the average degree of (resp., ). A graph similarity join requires pairwise count filtering condition checks, thus resulting in a complexity of . This can be intolerable on largescale datasets. Next, we present a scalable solution leveraging the MapReduce paradigm.
3. Framework
This section presents a MapReducebased graph similarity join algorithm, following the filteringverification fashion. The outline of the algorithm is listed below.
3.1. Filtering
We allocate two MapReduce jobs to the filtering phase.
Job 1. Job 1 counts the same type of common signatures for graph pairs. We use graphs as values and their corresponding s as keys to compose input keyvalue pairs. gram signatures (denoted by ) are generated in the Map task. We use the generated signatures as keys and s of graphs as values to form the output keyvalue pairs. As a consequence, in the Reduce task, we can obtain graphs with common signatures. For the same signature, a graph may appear more than once, since there may exist several grams in a graph corresponding to an identical signature. The function for Map task is shown as follows. (1)input or ;(2)generate gram signatures for each input graph;(3)emit or for each signature with the of its corresponding graph.
The Reducer gets in the Reduce task. We define (resp., ) to denote the occurrences of (resp., ) in the of Reduce input and to denote the occurrences of graph pairs that share the input key () of Reduce, . is calculated and output with the corresponding graph pair. The function of Reduce task is as follows. (1)get list of values consisting of and with the specific key ();(2)split the list of values into list of and list of ;(3)for each pair , where is from and is from , calculate and output .
Job 2. Job 2 counts the total common signatures for graph pairs and checks the count filtering condition, after which the set of candidate pairs is obtained. The Map function is listed as follows. (1)read a into Map task;(2)emit , which is exactly the same as it reads.
We input to the Map task and then output exactly the same keyvalue pair. Therefore, the Reduce task receives all the graph pairs generated in job 1 with the specific . As the graph pair may have more than one type of common signatures, is summed with the same graph pair for each identical signature, respectively. Subsequently, the graph pairs with less than (1) common signatures will be discarded. The remaining graph pairs are candidate pairs for verification. The function for Reduce is as follows. (1)receive output from Mappers, the specific , and a list of with ;(2)sum and calculate the number of common signatures for each pair ;(3)conduct the count filtering for each pair and output pairs to DFS whose common signatures are more than .
3.2. Verification
In the verification phase, candidate pairs are to be verified, where the graphs and are required; that is, join operations are necessary to retrieve graphs and by their ’s. Hence, we allocate two MapReduce jobs to join , , and .
Job 1. Job 1 replaces with graph . The map function takes graph set and as input and emits , which is listed as follows. (1)input candidate pair and graph set ;(2)for , emit and, for , emit it exactly, both of which take as the key.
The Reduce task gathers the list and graph for the key and then outputs the keyvalue pair . The function for Reducer is as follows. (1)receive a list of and graph set with the specific key ;(2)for , replace with and output pair .
Job 2. Job 2 replaces with graph , invokes label filtering conditions, and calculates GED to find the similar graph pairs.
The function for Map task is as follows. (1)input the candidate pair and graph set ;(2)emit the keyvalue pair it reads, where the group key is for both cases.
The function for Reduce task is as follows. (1)receive a list of values that consisted of graphs and corresponding to the key ;(2)replace with graph ;(3)calculate GED for pairs and output similar pairs.
3.3. Correctness and Complexity Analysis
All graph pairs that satisfy the edit distance constraints are returned errorfree, which justifies its correctness.
For algorithm complexity, we take all three phases—Map, Reduce, and Shuffle—into consideration. I/O reading overhead from distributed file system (DFS) is considered for Map task, whereas I/O writing overhead into DFS is analyzed for Reduce task. Both tasks also take time complexity into consideration. Shuffle considers the network communication cost.
Some parameters are defined preceding the analysis. denotes the average size of a graph. means candidate ratio, that is, the percentage of candidate pairs from all graph pairs, and means the ratio of similar pairs from pairs that passed count filtering. We assume that the size of keyvalue pair that contains only graph IDs is 1.
In job 1 of filter phase, Map reads the data of graph sets and from DFS, the cost of which is . Consider the worst case in Map task, where grams are generated for each graph , and the number of signatures generated is , so the time complexity for generating grams can be estimated as . Therefore, the time complexity for Map task is . As each generated gram signature forms an output keyvalue pair (the size for the pair is 1), the communication cost for Shuffle is . In the Reduce task, all graphs represented by containing the same signature are acquired and paired. Thus, the I/O cost is . Then, the occurrences of and are counted. In other words, all the keyvalue pairs generated from the Map task are counted, so the time complexity for Reduce is .
Consider job 2 in filter phase, where all the output keyvalue pairs from job 1 are read into Map task of job 2, so the IO cost is . Map outputs the keyvalue pairs it reads, so the time complexity for Map task and communication cost for Shuffle are . In Reduce task, all pairs are traversed, so the time complexity is . As we have candidate keyvalue pairs, the IO cost for Reduce task is .
In verification phase of job 1, Map reads the candidate pairs represented by and graph set from DFS, where the IO overhead requires . The time complexity of Map task and communication cost of Shuffle are also , because in Map function we just emit what has been exactly input into. Then, in Reduce task, we replace with graph , the time complexity is , and the IO overhead is .
In verification phase of job 2, Map reads the candidate pairs represented by and graph set from DFS, where the IO overhead requires . The time complexity of Map task and communication cost of Shuffle are also for the same reason above. Then, in Reduce task, we first replace with graph and then calculate GED for graph pairs. The GED calculation by the based algorithm requires . In the worst case, the time complexity of Reduce is . Finally, similar graph pairs are emitted into DFS, which requires .
4. Incorporating Bloom Filters
In the filtering phase of Algorithm 1, two MapReduce jobs are required, with many intermediate keyvalue pairs generated and transmitted. These increase the I/O and communication cost, which can be fairly timeconsuming. This section introduces the Bloom filter technique to reduce such cost. Next, we first recall the concept of spectral Bloom filters.

4.1. Spectral Bloom Filter
Bloom filters [12] are space efficient data structures which allow fast membership queries over a given set. A Bloom filter uses hash functions to hash elements into an array of size . For an element in the set, the bit at positions in the array is set to 1. Given a query item , we check its membership in the set by examining the bits at positions of the array. The item is reported to be contained in the set if (and only if) all the aforementioned bits are 1. This method brings a small probability of falsepositive; that is, it may return a positive result for an item which actually is not contained in the set but no falsenegative while gaining substantial space savings [13].
Spectral Bloom filter (SBF) [14] generalized the basic Bloom filter to be able to record the element frequency, which is thus adopted in this paper. SBF is represented by , where is a set and is a map from to natural numbers; that is, , where is the universe of natural numbers. SBF replaces the bit vector with a vector of counters. For insertion of item , the counters are increased by 1 for insertion and decreased by 1 for deletion. Let denote the frequency of . A basic query for SBF on an item returns an estimation on ; that is, . Note that, similar to Bloom filters, SBF also never underestimate .
4.2. Algorithm
Incorporating SBF not only reduces the number of keyvalue pairs but also contracts the two MapReduce jobs into one in the filtering phase. In particular, the Map task takes graph sets and as input. Then, we create a SBF for each graph by adding the gram signatures. Cartesian product is conducted for the output keyvalue pairs and . In the Reduce task, for each graph pair, their SBFs are intersected to estimate the number of common signatures. By intersecting two SBFs with counters and , respectively, it returns another SBF with counters , , . Hence, the number of common signatures could be estimated by . Subsequently, the graph pairs, which have less than common signatures given by count filtering condition, will be discarded, and the remaining pairs form the candidate set.
We provide the pseudocode of the aforementioned process in Algorithm 2.

4.3. Correctness and Complexity Analysis
There is small probability that the false positive case happens with SBF. Specifically, a query for an item in SBF on , may be larger than . Therefore, the number of common signatures estimated this way may be larger than the actual value. Nonetheless, falsenegative will never happen, which ensures the correctness of the algorithm; in a certain case, the pruning power of count filtering will be impaired. Besides, falsepositive will be less likely to happen if one carefully chooses the hash functions and configures the sizes of counters.
Then, we analyze the complexity. Let be the size of a SBF. The Map task reads the entire sets and , so its I/O cost is . Then, signatures are generated and added to SBF, and hash values are calculated for each signature. Thus, the time complexity for Map is . Map emits the SBF for each input graph, and then Cartesian product is conducted. The communication cost for Shuffle is . In the Reduce task, for each graph pair, the number of common signatures is calculated and count filtering condition is checked. Thus, the time complexity is . Regardless of falsepositive cases, the I/O cost is the same as before. Denoting the improved algorithm by “+SBF,” we summarize the complexity results in Table 1.

5. Optimizing Verification Phase
In verification phase, we need to calculate the GED of candidate pairs. Nevertheless, it is not capable of finishing the calculation with large graphs and threshold . Therefore, we devise a MapReducebased method for GED calculation, which is able of handling largescale graphs. Besides, join operations are required preceding the GED calculation to get the entire graph. Thus, three relations are joined to obtain the input of the GED algorithm, where is the output of the filtering phase. Inspired by the idea of multiway join, we can reduce the number of required MapReduce jobs from two to one.
5.1. MapReduce for GED Calculation
The GED calculation is based on algorithm. constructs a searchtree, the node of which represents a mapping status. A mapping status is stored in an array (denoted by ), the index of which stands for different vertices in graph and the corresponding value stands for the vertices of graph . explores the space of all possible vertex mappings between two graphs in a bestfirst search fashion with function (denoted by ) established to determine the order in which the search visits vertex mappings. is a sum of two functions: the distance from the initial state to the current state (denoted by ); a heuristic estimate of the distance from the current state to the goal (denoted by ). and are calculated by the following equations: consists of the vertices that have been mapped and edges connecting them, while consists of the vertices unmapped yet as well as their resident edges. The equation for represents the label difference of two graphs.
The search space for based approach is very large, requiring . In order to boost up the searching procedure, parallelization is a common way to think about. The naive way is to allocate different branches of searchtree to different workers so that the searching procedure can proceed in parallel. However the load is not balanced this way so that the final runtime is determined on the worker with the heaviest load. As a consequence, we devise a MapReducebased method to calculate GED, denoted by MRGED, which reallocates the works after each MapReduce round.
The format of the keyvalue pairs to be manipulated by MapReduce is . The procedure of searching for the result is through iterations that each iteration walks down one layer of the searchtree.
5.2. Multiway Join
In relational database, a multiway join can process together in one round, where is a relational table with attributes and (Algorithm 3). Following the same idea, we can consolidate the two MapReduce jobs required for verification in Algorithm 1. Specifically, let be a hash function with range , where is the number of Reducers. We associate each Reduce task with a pair , where . Each tuple is sent to the Reducer numbered , while each tuple in (resp., ) is sent to Reducers numbered (resp., ), for any (resp., ). Each Reduce task computes the join of the tuples it receives. It is shown that multiway join is more efficient in practice than two simple joins [15].

We encapsulate the improved verification procedure in Algorithm 4.

5.3. Correctness and Complexity Analysis
One may immediately verify that Algorithm 4 correctly conducts the verification.
In the multiway join based verification phase, the Map task takes graph sets and and the candidate pairs represented by their s as input. The input I/O cost is . The keyvalue pair (resp., ) is sent to Reduce tasks numbered (resp., ) for any (resp., ), whereas the keyvalue pair is sent to the only Reduce task numbered . As a result, the communication cost for Shuffle is , where is the number of Reducers. In the Reduce task, all candidate graph pairs go through edit distance computation. Pairs with larger size go through MRGED, while pairs with smaller size go through . For simplicity, the complexity of GED calculation is regarded the same as the baseline algorithm. Labelling the resulting algorithm with “+MJ,” we summarize the complexity results in Table 2.

6. Experiments
6.1. Experiment Setup
We conducted experiments on several publicly available real datasets but only present the results on Pubchem (http://pubchem.ncbi.nlm.nih.gov) due to the interest of space. The dataset is constructed by sampling 1,000,000 graphs from Pubchem.
Amazon cloud services were used as our experiment platform. Specifically, we used Elastic Compute Cloud (EC2), in which the computing nodes are called instances. In the experiment 31 instances were used by default—one set as master node and others as worker nodes. The standard configuration of all EC2 instances is m1.small, one CPU of single core with 1.7 GB memory running Hadoop 1.1.2 (Table 3).

6.2. Evaluating Filters
In order to evaluate the effectiveness of our filtering techniques, we use the term “Basic” for the baseline algorithm for processing graph similarity joins based on MapReduce. “+SBF” denotes filtering improved algorithm of Basic by incorporating SBF.
The algorithm efficiency has been studied and shown in Figure 2(a). However, the pruning power is somewhat impaired, so we conducted the experiment to record the increase of candidate pairs through “+SBF” (cf. Figure 2(b)). It can be revealed that “+SBF” outweigh “Basic” in efficiency by sacrificing little pruning power. When , less than 300 more candidate pairs are generated while about 5,000 seconds are reserved.
(a)
(b)
6.3. Evaluating Verification
The verification was evaluated with candidate pairs generated by Basic. Term “+MJ” denotes applying multiway join in verification, while “+MRGED” denotes adapting alternative MapReducebased GED calculation. We use term “MGSJoin” to indicate the basic algorithm incorporating both techniques. Figure 3(a) shows the runtime comparison between +MRGED and MGSJoin. The result illustrates the superiority of applying multiway join. When , algorithm with multiway join is about 6,000 seconds faster than with ordinary joins. Figure 3(b) presents the result of evaluating MRGED, where Basic and +MRGED are compared. It can be observed that the runtime of both algorithms grows exponentially. When equals 1, the Basic finishes quicker than +MRGED (162 s and 204 s, resp.), while equals 2; the +MRGED outweighs Basic (the runtime is 712 s and 580 s, resp.). This is because the calculation required for is small, where the MRGED is clumsy compared with , whereas when more calculation is required so that the advantage of MRGED comes out. When is within the range of 3–5 Basic is unable to finish because the large calculation drives Basic out of memory.
(a)
(b)
6.4. Comparing with StateoftheArt Method
We compared our algorithm with the stateoftheart method, GSimJoin. In Figure 4(a), we chose 10,000 graphs in order to compare with GSimJoin and the result witnesses the obvious superiority of MGSJoin over GSimJoin. Figure 4(b) is drawn in log scale, which varies the number of graphs and records the elapsed time. Both algorithms grow linear in the figure, which reflect their exponential growth. MGSJoin rises mush slower than GSimJoin. The runtime of GSimJoin is about 10 times longer than MGSJoin when joining 100 graphs and 100 times longer than MGSJoin when joining 10,000 graphs. Moreover, when we have to join 100,000 graphs, GSimJoin is running out of memory so that no result is recorded.
(a)
(b)
6.5. Speedup
We evaluated the speedup of our algorithm by varying the number of instances from 10 to 50. The experimental results are shown in Figure 5(a). We can see that, with the increase of instances in the cluster, the performance of MGSJoin significantly improved. The improvement is significantly shown when threshold equals 4. With more instances running, the count filtering is getting faster by counting for the common signatures simultaneously and the verification is getting quicker by joining relations and calculating GED in parallel.
(a)
(b)
6.6. ScaleUp
We evaluated the scaleup of our algorithm by increasing both dataset sizes and number of nodes in the cluster. The result is shown in Figure 5(b). It is worth noting that as the dataset increased the results for different values of get similar in the trend of increase. All lines rise smoothly, which reveals good scalability of MGSJoin.
7. Related Work
Graph Similarity Queries. Similarity joins retrieve similar data object pairs, which can be strings, sets, trees, and graphs [16]. As to GEDbased graph similarity search, [17] proposed AT, a treebased gram approach. However, it is associated with the drawback of usually loose lower bound for count filtering. Seeing the drawback, [9] presented a pathbased gram approach. In comparison with AT, GSimJoin is more efficient by leveraging more advanced filtering techniques. Thus, we adopt the pathbased gram approach in this paper.
MapReduceBased Graph Algorithms. MapReduce is a distributed programming framework [10], which has been applied in processing large graphs. Many graph algorithms using MapReduce were discussed in [18], including triangles/rectangles enumeration and cliques computation. In [19], several techniques were proposed to reduce the input size of MapReduce, and the techniques are applied for minimum spanning trees, approximate maximal matchings, approximate node/edge covers, and minimum cuts. Personalized PageRank computation in MapReduce was discussed in [20]. Matrix multiplication based graph mining algorithms in MapReduce were investigated in [21]. More recently, densest subgraph computation [22], subgraph instances enumeration [23], and connected components computation in logarithmic rounds [24] were researched in MapReduce.
Graph Processing Systems in Cloud. Many systems were developed in order to deal with big graphs. Such a one representative system is Pregel [25], which takes a vertexcentric approach and implements a bulk synchronous parallel (BSP) computation model. HipG [26] improves BSP by using asynchronous messages to avoid synchronization. PowerGraph [27] is a distributed graph processing system that is optimized to process powerlaw graphs. Giraph++ was proposed in [28] to take graph partitioning into consideration when processing graphs. Workload balancing for graph processing in cloud was discussed in [29].
8. Conclusion
In this paper, we have investigated the problem of scalable graph similarity joins. We firstly present a MapReducebased graph similarity join algorithm MGSJoin following the filteringverification framework. To reduce the communication cost in the filtering phase, it incorporates the Bloom filter technique to reduce the number of intermediate keyvalue pairs. In addition, we devise a multiway join optimized verification procedure for further speedup. Extensive experiments are conducted on real datasets to confirm the efficiency and scalability of the proposed solution. Furthermore, the verification phase is further optimized with MapReduce, which enables the test on larger and denser graphs.
As a future direction, we plan to explore the possibility of optimizing the verification with multithreaded programming paradigm. Additionally, it is also of interest to test the efficiency and scalability of proposed algorithms on even larger and/or denser graphs.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The research is supported by the doctoral program of higher education of China (No. 2011437110008) and the national natural science foundation of China (No. 61303062).
References
 X. Yan and J. Han, “gSpan: graphbased substructure pattern mining,” in Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM '02), pp. 721–724, December 2002. View at: Google Scholar
 X. Yan, P. S. Yu, and J. Han, “Graph indexing: a frequent structurebased approach,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '04), pp. 335–346, June 2004. View at: Google Scholar
 H. He and A. K. Singh, “Closuretree: an index structure for graph queries,” in Proceedings of the 22nd International Conference on Data Engineering (ICDE '06), p. 38, April 2006. View at: Publisher Site  Google Scholar
 X. Yan, P. S. Yu, and J. Han, “Substructure similarity search in graph databases,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 766–777, June 2005. View at: Google Scholar
 Y. Tian and J. M. Patel, “TALE: a tool for approximate large graph matching,” in Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE '08), pp. 963–972, April 2008. View at: Publisher Site  Google Scholar
 A. Sanfeliu and K.S. Fu, “A distance measure between attributed relational graphs for pattern recognition,” IEEE Transactions on Systems, Man and Cybernetics, vol. 13, no. 3, pp. 353–362, 1983. View at: Google Scholar
 H. Bunke and G. Allermann, “Inexact graph matching for structural pattern recognition,” Pattern Recognition Letters, vol. 1, no. 4, pp. 245–253, 1983. View at: Google Scholar
 M. R. Garey and D. S. Johnson, Computers and Intractability, vol. 174, Freeman, San Francisco, Calif, USA, 1979.
 X. Zhao, C. Xiao, X. Lin, and W. Wang, “Efficient graph similarity joins with edit distance constraints,” in Proceedings of the 28th IEEE International Conference on Data Engineering (ICDE '12), pp. 834–845, April 2012. View at: Publisher Site  Google Scholar
 J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. View at: Publisher Site  Google Scholar
 D. Deng, G. Li, S. Hao, J. Wang, J. Feng, and W. S. Li, “MaSSJoin: a MapReducebased method for scalable string similarity joins,” in Proceedings of the 30th IEEE International Conference on Data Engineering (ICDE '14), pp. 340–351, Chicago, Ill, USA. View at: Google Scholar
 B. H. Bloom, “Space/time tradeoffs in hash coding with allowable errors,” Communications of the ACM, vol. 13, no. 7, pp. 422–426, 1970. View at: Publisher Site  Google Scholar
 A. Broder and M. Mitzenmacher, “Network applications of bloom filters: a survey,” Internet Mathematics, vol. 1, no. 4, pp. 485–509, 2004. View at: Google Scholar
 S. Cohen and Y. Matias, “Spectral bloom filters,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 241–252, June 2003. View at: Google Scholar
 F. N. Afrati and J. D. Ullman, “Optimizing joins in a mapreduce environment,” in Proceedings of the 13th International Conference on Extending Database Technology: Advances in Database Technology (EDBT '10), pp. 99–110, March 2010. View at: Publisher Site  Google Scholar
 Y. N. Silva and J. M. Reed, “Exploiting MapReducebased similarity joins,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '12), pp. 693–696, May 2012. View at: Publisher Site  Google Scholar
 G. Wang, B. Wang, X. Yang, and G. Yu, “Efficiently indexing large sparse graphs for similarity search,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 3, pp. 440–451, 2012. View at: Publisher Site  Google Scholar
 J. Cohen, “Graph twiddling in a MapReduce world,” Computing in Science and Engineering, vol. 11, no. 4, pp. 29–41, 2009. View at: Publisher Site  Google Scholar
 S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii, “Filtering: a method for solving graph problems in MapReduce,” in Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '11), pp. 85–94, June 2011. View at: Publisher Site  Google Scholar
 B. Bahmani, K. Chakrabarti, and D. Xin, “Fast personalized PageRank on MapReduce,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 973–984, June 2011. View at: Publisher Site  Google Scholar
 U. Kang, H. Tong, J. Sun, C.Y. Lin, and C. Faloutsos, “GBASE: a Scalable and general graph management system,” in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '11), pp. 1091–1099, August 2011. View at: Publisher Site  Google Scholar
 B. Bahmani, R. Kumar, and S. Vassilvitskii, “Densest subgraph in streaming and MapReduce,” Proceedings of the VLDB Endowment, vol. 5, no. 5, pp. 454–465, 2012. View at: Google Scholar
 F. N. Afrati, D. Fotakis, and J. D. Ullman, “Enumerating subgraph instances using mapreduce,” in Proceedings of the 29th International Conference on Data Engineering (ICDE '13), pp. 62–73, April 2013. View at: Publisher Site  Google Scholar
 V. Rastogi, A. Machanavajjhala, L. Chitnis, and A. Das Sarma, “Finding connected components in mapreduce in logarithmic rounds,” in Proceedings of the 29th International Conference on Data Engineering (ICDE '13), pp. 50–61, April 2013. View at: Publisher Site  Google Scholar
 G. Malewicz, M. H. Austern, A. J. C. Bik et al., “Pregel: a system for largescale graph processing,” in Proceedings of the International Conference on Management of Data, (SIGMOD '10), pp. 135–145, June 2010. View at: Publisher Site  Google Scholar
 E. Krepska, T. Kielmann, W. Fokkink, and H. Bal, “Hipg: parallelprocessing of largescale graphs,” ACM SIGOPS Operating Systems Review, vol. 45, no. 2, pp. 3–13, 2011. View at: Google Scholar
 J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graphparallel computation on natural graphs,” in Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12), pp. 17–30, 2012. View at: Google Scholar
 Y. Tian, A. Balmin, S. A. Corsten, S. Tatikonda, and J. McPherson, “From “think like a vertex” to “think like a graph”,” Proceedings of the VLDB Endowment, vol. 7, no. 3, 2013. View at: Google Scholar
 Z. Shang and J. X. Yu, “Catch the wind: graph workload balancing on cloud,” in Proceedings of the 29th International Conference on Data Engineering (ICDE '13), pp. 553–564, April 2013. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2014 Yifan Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.