Abstract
The multistage graph problem is a special kind of singlesource singlesink shortest path problem. It is difficult even impossible to solve the largescale multistage graphs using a single machine with sequential algorithms. There are many distributed graph computing systems that can solve this problem, but they are often designed for general largescale graphs, which do not consider the special characteristics of multistage graphs. This paper proposes DMGA (Distributed Multistage Graph Algorithm) to solve the shortest path problem according to the structural characteristics of multistage graphs. The algorithm first allocates the graph to a set of computing nodes to store the vertices of the same stage to the same computing node. Next, DMGA calculates the shortest paths between any pair of starting and ending vertices within a partition by the classical dynamic programming algorithm. Finally, the global shortest path is calculated by subresults exchanging between computing nodes in an iterative method. Our experiments show that the proposed algorithm can effectively reduce the time to solve the shortest path of multistage graphs.
1. Introduction
With the continuous development of big data and information technology, graph has been widely applied in many applications, and various graph structures and algorithms have been proposed. Among them, the multistage graph is a special kind of weighted directed graphs, which are widely used in engineering technology, concurrency control, transportation, task schedule in highperformance computing, and other fields. Many coordination or dynamic scheduling problems can be transformed into multistage graph problems [1, 2].
Recently, the scale of graph data has grown tremendously, so it is difficult even impossible to store and process such largescale graphs by a single computer or sequential processing method [3]. At this point, the distributed computing scheme became a must, and lots of dedicated graphprocessing systems have been appearing [4–6], such as Pregel [7], PowerGraph [8], GraphX [9, 10], GraphLab [11], and PowerLyra [12]. These graph processing systems extend the computation by dividing the graph into multiple partitions and processing on multiple computing nodes in parallel. Highquality partition can reduce the communication cost and achieve the load balance [13–15], thus the processing time can be minimized subsequently. The current distributed graph processing systems and algorithms are usually designed for general graphs, and they do not consider the special structural properties of multistage graphs, so there are some disadvantages in applying them to multistage graphs, such as high communication cost and long solution time. The purpose of this paper is to present a distributed algorithm DMGA (Distributed Multistage Graph Algorithm) for the shortest path problem of multistage graphs to make full use of their characteristics. The main contributions are as follows:(1)It presents a partitioning method for multistage graphs on distributed computing systems, which can make best use of characteristics of multistage graphs to achieve the best load balance and reduce the communication cost(2)It designs a distributed algorithm of the shortest path problem of multistage graphs based on dynamic programming idea(3)It performs extensive experiments to verify the performance of the proposed algorithm, compared to the classical parallel Dijkstra algorithm and the SSSP (singlesource shortest path) algorithm on Pregel
Table 1 gives an overview about the notations used in this paper. The organizations of the rest of the paper are as follows. Section 2 introduces the related works, and Section 3 presents the statements of the shortest path problem of multistage graphs. Section 4 describes the proposed DMGA algorithm, and Section 5 introduces the experiments and analysis. Section 6 concludes the paper.
2. Related Works
2.1. The Shortest Path Algorithms
Finding the shortest path is a classical problem of graph theory, and the wellknown sequential algorithms are Dijkstra, Floyd, and Bellman–Ford algorithms [16], which perform well in centralized computing. However, the largescale graph needs distributed computing algorithms to obtain the shortest paths quickly.
The singlesource shortest path (SSSP) is one of the most important shortest path problems. Peng et al. [1] defined a new graph model named by singlesourceweighted multilevel graph and presented a parallel SSSP algorithm by constructing the vectormatrix multiple model, dividing into parallel tasks, and setting data communication’s method. A·Davidson [17] developed three parallel SSSP algorithms for GPUs (Graphics Processing Unit): Workfront Sweep, NearFarand, and Bucketing. These algorithms utilize different approaches to balance the tradeoff between saving work and organizational overhead. S·Maleki [18] introduced a partially asynchronous parallel DSMR (Dijkstra Strip Mined Relaxation) algorithm for SSSP on shared and distributed memory systems. Busato and Bombieri [19] proposed a parallel Bellman–Ford algorithm based on frontier and active vertices that exploit the architectural characteristics of GPU architectures. Huang [20] gave a distributed Las Vegas algorithm on the classic scaling technique for the allpairs’ shortest paths on distributed networks. For the dynamic and stochastic graph models of the transportation network, Liu et al. [21] proposed an improved adaptive genetic algorithm by adjusting the encoding parameters to get the dynamic random shortest path. Ghaffari and Li [22] provided a distributed SSSP algorithm with less complexity, and it constitutes the first sublinear time algorithm for directed graphs. For the SSPPMPN (Shortest Simple Path Problem with MustPast Nodes), Su et al. [23] proposed a multistage metaheuristic algorithm based on kopt move, candidate path search, conflicting nodes promotion, and connectivity relaxation.
The above algorithms do not consider the structural characteristics of multistage graphs, so they will produce a large amount of communication overhead, resulting in tedious execution time.
2.2. Graph Partitioning Algorithms
The basis of the distributed graph processing system is to partition the entire graph into a set of computing nodes. The graph partition algorithms are classified into vertexcut and edgecut. Edgecut partitioning assigns each vertex to a unique partition, and the edge spanning partitions are called cut edge. As shown in Figure 1(a), edges are cut, and their two endpoints b and d are assigned to different partitions. Vertexcut partitioning assigns edges uniquely to a certain partition, which results in vertexcuts across multiple partitions [24]. As shown in Figure 1(b), vertex d is partitioned, both partitions and have copies of vertex d, and their references in each partition are also called mirrors [25].
(a)
(b)
The distributed graph processing systems often use the vertexcentric programming model [26, 27], where the computing node recursively operates its active vertices according to the userdefined graph function. Each vertex reads the statuses of its adjacent vertices or edges and updates its own status accordingly. In the iterative calculation of a graph, the partitions exchange intermediate results along edges. To some extent, the number of cut edges or mirror vertices can reflect the network communication overhead, which in turn affects the calculation efficiency. On the contrary, the load among computing nodes should be balanced to ensure that the computing nodes can achieve the results synchronously. Hence, both edgecut and vertexcut approaches aim to minimize crosspartition dependencies and achieve load balance [25].
The existing graph partitioning heuristic solutions are basically divided into offline and online partitioning strategies. The offline partitioning strategy refers to dividing the graph into several subgraphs before being loaded by the distributed system. F·Rahimian et al. [28] introduced the JABEJA algorithm that uses local search and simulated annealing techniques for graph partitioning. The algorithm only needs to use part of the information to process the graph. Akhremtsev et al. [29] presented a multilevel sharedmemory parallel graph partitioning algorithm that uses parallel label propagation for both coarsening and refinement, and it can balance the speed and quality of parallel graph partitioning.
The online partitioning strategy refers to partitioning the graph during the data loading process, where the input data is usually a vertex stream or an edge stream. Tsourakakis et al. [30] proposed the FENNEL algorithm based on localitycentric measures and balancing goals. Its core idea is to interpolate between maximizing the colocation of neighbouring vertices and minimizing that of nonneighbours. Petroni et al. [15] proposed the highdegree replicated first (HDRF) algorithm according to the characteristics of powerlaw graphs, which divide the vertices with high degrees in first. Zhang et al. [31] proposed the AKIN algorithm based on the vertex similarity index, which exploits the similarity measure of the vertex degree to collect structurerelated vertices in the same partition to further reduce the edgecut rate. Wang et al. [32] analysed the locality of the graph and proposed the targetvertex sensitive Hash algorithm. The algorithm predivides the target vertices of the edge logically and then partitions the graph in parallel according to the target vertices. Ji et al. [33] proposed a twostage local partitioning algorithm which introduces the concept of local partitions, emphasizing the impact of changes in the graph structure on the quality of partitions. Slota et al. [34] introduced XtraPuLP based on the scalable label propagation community detection technique. It can solve the multiple constraint and multiple objective graph partition problem on terascale graphs. Zhou et al. [35] proposed GeoCut which uses a costaware streaming heuristic and two partition refinement heuristics to reduce the cost and data transfer time of geodistributed data centres.
The above graph partitioning algorithm are all designed for general graphs, which do not consider the special characteristics of multistage graphs, so it is necessary to design the graph partitioning algorithm for multistage graphs to accelerate the distributed processing.
3. Problem Statements
A multistage graph is a directed singlesource and singlesink weighted connected graph, where V and E are, respectively, the set of vertices and edges and W is the weights of edges. The vertices are divided into disjoint stages, and each edge can only point from the vertex of the previous stage to the vertex of the succeeding stage. Formally, a multistage graph should satisfy(1), where m is the number of stages.(2), where is the number of vertices of the ith stage.(3).(4), where is the weight of edge . If , .(5), , and and are, respectively, the source vertex and sink vertex.
Figure 2 is an example of the multistage graph, where the blue numbers above the graph are the numbers of edges. This paper supposes that the multistage graphs are dense which meanswhere is the set of edges between Vi and Vi+1 and is the number of edges of Ei.
The shortest path problem of a multistage graph is to find the minimum cost path from the source vertex to the sink vertex. Let be the cost of the shortest path from vertex to . Obviously, is the cost of the shortest path from the source vertex to the sink vertex, and
Given a largescale multistage graph , we need to partition it to a cluster of computing nodes. Each computing node stores a part of G, and each part is called a partition. Let p be the number of partitions, so G is divided into partitions and is located on the ith computing node.
4. DMGA: The Proposed Algorithm
DMGA is run on the homogeneous cluster, which means all computing nodes have the same performance in terms of CPU, memory, and bandwidth. This algorithm partitions the entire graph to the given cluster first, and then, each computing node computes the shortest path of the partition on it. Finally, the computing nodes communicate with each other to obtain the shortest path of the whole graph.
Algorithm 1 gives the framework of DMGA. The details of each step of Algorithm 1 are given in the followingsections.

4.1. Multistage Graph Partition
In order to determine the graph partition strategy, we should analyse their impacts on the communication overhead after partition. According to the feature of multistage graphs, it is a better scheme to divide the vertices of the same stage into the same partition because it is easy to implement load balance and parallel shortest path solution. Suppose V_{i} and V_{i+1} are divided into two different partitions. If we use vertexcut strategy, the number of mirror vertices is either n_{i} or n_{i+1}. If we use edgecut strategy, the number of cut edges is . According to (1), and , which indicates that the vertexcut strategy has less communication overhead than edgecut strategy, so we adopt vertexcut strategy to partition the graph. Figure 3 is an example. The edgecut strategy produces 9 cut edges (Figure 3(a)), while the vertexcut strategy only produces 3 mirror vertices (Figure 3(b)).
(a)
(b)
Since the multistage graphs studied in this paper are dense, we use the number of edges to represent the load of a partition. Let Cap be the capacity of each computing node, which is also the maximum number of edges that can be stored by a partition, then the number of partitions of a given G can be estimated as
The above equation gives the lower limit of the number of computing nodes. It may lead to the load of the last computing node far lower than those of the other computing nodes. For example, if and , then . If the first 10 computing nodes are fully loaded, then the last computing node only has 100 edges, so the load is imbalanced. Hence, the maximum load of each partition is redefined aswhere is a predefined parameter to keep the load balance for different multistage graphs.
The idea of multistage graph partition is to assign the vertices of the same stage to the same partition. Figure 4 presents the flow diagram, and Algorithm 2 presents pseudocode. In this algorithm, records the number of edges stored in the current partition. Lines 1 and 2 initialize the variables. Lines 3–16 divide to a cluster of computing nodes. If (line 5), lines 6–8 assign the edges of to computing node , and lines 9 and 10, respectively, update and . If (line 11), which means the current partition will be overloaded if we assign the edges of to it, lines 12–14 update variables to prepare for succeeding partition.

4.2. Local Shortest Path Calculation
After partitioning the graph, each computing node calculates the shortest path of the subgraph stored on it. The shortest path of each partition is referred to as the local shortest path, and the shortest path of the whole graph is referred to as the global shortest path.
Theorem 1. If is one of the shortest paths from to , then is one of the shortest paths from to .
Proof. We prove it by using reduction to absurdity. Suppose is not one of the shortest paths from to . There must exist a shortest path from to . Let be one of the shortest paths from to . If we use to replace in the path , then the cost of is less than , so is not one of the shortest paths from to This is a contradiction.
The above theorem shows that any part of the shortest path is also the shortest path, so the global shortest path is composed of the local shortest paths of all partitions, andwhere and . Subsequently, we havewhere and .
Based on the above equation, it is necessary to calculate the shortest paths between any pair of vertices of the first and last stages for each partition. Specifically, each computing node uses the idea of dynamic programming shown as (2) to solve the local shortest path.
Figure 5 presents the flow diagram, and Algorithm 3 presents pseudocode. Note that this algorithm is run by each computing node in parallel. On partition , it produces one of the shortest path from to for each vertex and each vertex . In this algorithm, records the previous vertex in the shortest path from to , i.e., is one of the shortest path from to . It calculates the shortest paths in a forward way. Lines 1 to 4 initialize c and f for the vertices in the first stage. Lines 5 to 17 calculate the shortest paths by a triple loop. The outermost loop is for all stages (line 5). The middle loop is for all vertices of (line 6), and the innermost loop is for all vertices of (line 7). Given and , lines 8 to 14 calculate the shortest path from to , where lines 11 and 12, respectively, update and . Lines 18 to 29 generate the shortest paths using f in a backtracking way. Given and , it obtains the vertices’ sequence starting at vertex .

4.3. Global Shortest Path Calculation
After finding the shortest path of each partition, the global shortest path is calculated by message exchanging among computing nodes. Figure 6 depicts a sketch of the merging procedure of local shortest paths. This is an iterative procedure. In each iteration, a pair of computing nodes communicates where the “left” computing node sends its local shortest path to the “right” computing node, and the “right” computing node merges these two local shortest paths. Finally, gets the global shortest path.
Figure 7 presents the flow diagram, and Algorithm 4 presents the pseudocode. The set records the indices of computing nodes participating in subresults’ combination in each iteration. Initially, contains all computing nodes (line 1). Lines 2 to 5 initialize two sets for each computing node. Lines 6 to 30 calculate the global shortest path by message exchanging among computing nodes with the basic idea as Figure 6. Firstly, the “left” computing nodes send to the “right” neighbour computing node (lines 7 to 9), and this can be run in parallel for each pair of computing nodes. Secondly, the “right” computing nodes merge the two subresults to get a longer subresult (lines 10 to 28). Given a local shortest path of (line 11) and (line 12), if they can be merged, that is to say, the last vertex of the first path is the same as the starting vertex of the second path (line 13), it tries to merge them to be a longer path from to . If the shortest path does not exist, the algorithm generates one and appends it to (lines 14 to 17). If the shortest path exists but it is longer than the current one, the algorithm updates the shortest path (lines 18 to 21). After the two innermost loops (lines 11 to 25), the algorithm replaces with (line 26) and sets to be empty (line27) to prepare for the next iteration. Line 29 removes the “left” computing nodes from . Finally returns as the global shortest path (line 31).

4.4. An Example
Let us take Figure 2 as an example to demonstrate the process of the above algorithm. Given , we have according to (3). Set according to (4). Based on Algorithm 2, the graph will be partitioned to 3 partitions, as shown in Figure 8.
(a)
(b)
(c)
Next, each computing node calculates the local shortest path of its partition. Let us take as an example to show the process.(1)Initially, and of vertices of stage 4 are set to 0.(2)The vertices of calculate and . For example, and .(3)The vertices of calculate and . For example, and .(4)The vertices of calculate and . For example, and .(5)Finally, the vertices of backtrack to get the shortest paths. For example, to get the shortest path from to , it backtracks to in first because . Secondly, it backtracks to and then finally. Therefore, the shortest path from to is , and the cost is 14.
Similar to the above process, each calculates the local shortest paths, and these results are as follows:(1)(2)(3)
The last step is to merge the subresults of all partitions. sends to to get the shortest paths from to vertices of . Obviously, , and . Similarly, , , , , , and .
stores the shortest paths after merging, and it sends the new to . calculates and .
5. Experiments and Analysis
5.1. Experimental Setup
Because there are no public multistage graph datasets, we synthesize 5 datasets using Java on IntelliJ IDEA, where the number of vertices of each stage and the weights of edges are random values satisfying (1). Table 2 presents the basic data of these 5 datasets.
The shortest paths algorithms are run on Hadoop [36] in conjunction with the Spark [37] computing engine. Spark is a fast and generalpurpose computing engine designed for largescale data processing [38]. The cluster consists of 8 computers, and each computer has a 4core Intel processor, 8 GB memory, and 1 TB storage. The operating system is CentOS 7, and the distributed environment is built using Hadoop2.7 and Spark2.1.
In order to compare the performance of DMGA with existing algorithms, all graphs are partitioned to 8 partitions, which is a little different from Algorithm 2 whose number of partitions depends on the scale of the graph. Hence, the ML of each partition isin the experiments, and line 2 uses (7) to replace (4) in Algorithm 2.
5.2. Experimental Results
5.2.1. Partitioning Quality
At present, there is no partition algorithm for multistage graphs, so we compare the partition quality of the DMGA algorithm with the Hash partitioning algorithm. Hash is the default partitioning algorithm in many distributed graph processing systems, and it is the basis of most of existing distributed algorithms for solving the shortest path.
For the vertexcut partitioning method, the number of mirror vertices reflects the communication overhead. The fewer the mirror vertices, the less the communication overhead, and the corresponding calculation time will be reduced. Figure 9 shows the number of mirror vertices generated by two graph partitioning algorithms running on the above 5 datasets. It can be seen from the results that the number of mirror vertices of the two graph partitioning algorithms for small datasets is almost the same. With the increase of the scale of the dataset and the number of stages, the number of mirror vertices of the DMGA algorithm is significantly less than the Hash algorithm. Specifically, the number of mirror vertices produced by the Hash algorithm for Graph_3, Graph_4, and Graph_5 is, respectively, 37.25, 24.97, and 69.9 times of the DMGA algorithm.
This shows that the DMGA algorithm has a better partitioning result for multistage graphs than the Hash algorithm. It can be deduced that the number of mirror vertices will be reduced in further if (4) instead of (7) is used to guide partition in Algorithm 2.
For the homogeneous cluster which consists of computing nodes with the same configuration, the load of each computing node should be balanced as much as possible in order to reduce the calculation time. Table 3 shows the number of edges on each computing node of each dataset generated by two graph partitioning algorithms. The Hash algorithm partitions the graph according to the Hash function defined aswhich means the vertex is assigned to partition . Thus, the load of each partition is almost the same. The DMGA algorithm assigns all vertices in the same stage to the same partition, and the numbers of vertices in each stage are different, so the edge distribution is not balanced.
The average deviation and standard deviation are further analysed to check the loads among partitions, and they are shown in Figure 10. In this figure, AVEDEV and STDEV, respectively, represent average deviation and standard deviation. We can see that the average and standard deviations of DMGA are 2.3 to 7.5 times of those of Hash. Graph_3 has the maximum difference among all graphs: the average deviation of DMGA is 6.65 times of that of Hash, and the standard deviation of DMGA is 7.51 times of that of Hash. Graph_1 has the minimal difference among all graphs: the average deviation of DMGA is 2.79 times of that of Hash, and the standard deviation of DMGA is 2.36 times of that of Hash. These results also show that Hash can produce more balanced partition.
5.2.2. Comparison of Running Time
We compare the execution time of four algorithms: singlemachine, DMGA, SSSP [7] algorithm embedded in Pregel on Spark, and parallel Dijkstra [39–41]. The singlemachine algorithm utilizes the sequential dynamic programming algorithm. The experiments are run 10 times, and the results are the average over these runs.
Table 4 presents the results, and Figure 11 shows the graphical comparison. It can be seen that DMGA has a higher speedup ratio and time superiority over the sequential algorithm as the scale of dataset increases.(i)Graph_1: the sequential algorithm is faster than DMGA because the communication among computing nodes of DMGA takes up a lot of time.(ii)Graph_2: the running time of these two algorithms is almost the same.(iii)Graph_3 and Graph 4: the running time of the sequential algorithm is, respectively, 1.72 and 3.13 times of that of DMGA.(iv)Graph_5: the sequential algorithm meets “memory overflow” error, so this graph cannot be solved by a single machine. On the contrary, DMGA can obtain the shortest path.
In further, parallel Dijkstra has a relatively longer running time than the DMGA and SSSP, due to its high time and space complexity. SSSP uses Pregel’s default graph partition method which does not take the structural feature of multistage graphs into account, so it has a large amount of communication. SSSP needs longer time for the larger scale graph, which implies that the communication overhead increases significantly with the increase of the scale of the graph specifically.(i)Graph_2: SSSP has the least running time, and DMGA only needs 4 seconds longer than SSSP, but parallel Dijkstra needs more than 30 times of that of DMGA.(ii)Other graphs: DMGA needs the least running time. The larger the scale of the graph, the more obvious the advantage of DMGA. For example, the running time of DMGA is, respectively, 25.8% and 8.8% of that of SSSP and parallel Dijkstra for Graph_5.
The above results show that DMGA makes full use of the special structural characteristic of multistage graphs, and it has extremely low communication overhead.
6. Conclusion and Future Work
Nowadays, graph models are widely applied in many fields, and the scale of the graph increases significantly. The existing distributed graph computing systems cannot make full use of the special characteristics of the multistage graphs. To this end, this paper proposes DMGA which is used to solve the shortest path problem of largescale multistage graphs on a distributed computing system. DMGA consists of three phases: graph partition, local shortest path calculation, and global shortest path calculation. The experiment results demonstrate its highperformance. However, the load of DMGA is not balanced, and it only considers of the vertexcut partitioning method. In future, we will focus on reducing the number of mirror vertices and solution time as much as possible under the premise of load balance and propose the special algorithm based on edgecut partitioning idea.
Data Availability
The experimental data used to support the findings of this study are available from the corresponding author upon request and also provided in the supplementary information files.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported in part by the National Key R&D Plan of China under Grant 2018YFC1406203, Shandong Postgraduate Tutor Guidance Ability Improvement Project under Grant SDYY17040, and Science and Technology Support Plan of Youth Innovation Team of Shandong Higher School under Grant 2019KJN024.