Abstract

The connecting of things to the Internet makes it possible for smart things to access all kinds of Web services. However, smart things are energy-limited, and suitable selection of Web services will consume less resources. In this paper, we study the problem of selecting some Web service from the candidate set. We formulate this selection of Web services for smart things as single-source many-target shortest path problem. We design algorithms based on the Dijkstra and breadth-first search algorithms, propose an efficient pruning algorithm for breadth-first search, and analyze their performance of number of iterations and cost. Our empirical evaluation on real-life graphs shows that our pruning algorithm is more efficient than the breadth-first search algorithm.

1. Introduction

Recently, development of RFID, sensor, and networking technologies make it possible for various physical world things connecting to the internet, and people usually call it Internet of Things (IoT). In IoT, more and more devices are getting connected to the Internet, and the next step is to use the World Wide Web and its associated technologies, such as Web services, as a platform for smart things.

In Internet, there are a lot of Web services. These Web services connect with each other and construct a topology of graph. In IoT, all smart things are connecting to the Internet, so they can access all kinds of Web services. However, smart things are usually energy-limited, so the selection of suitable Web service is extremely important. Guinard et al. [1] give an overall view of Web service access in Internet of Things. They propose a process and a suitable system architecture that enables developers and business process designers to dynamically query, select, and use running instances of real-world services.

In this paper, we study the problem of Web services selection in IoT setting; that is, given a smart thing and a graph of Web services, how to select several suitable Web services from the candidate set while providing satisfactory quality of service. If the selected Web services are nearest, the smart thing would take less accessing and waiting time and then consume less power. Here, we formulate the selection of Web services as single-source many-target shortest path problem.

The shortest path problem can be classified as SSSTSP (single-source single-target shortest path), SSMTSP (single-source many-target shortest paths), APSP (all-pairs shortest paths), and SP ( shortest paths). The MSSTSP (many-sources single-target shortest paths) is the same as SSMTSP if we reverse all edges in the graph.

1.1. Our Contribution

In this paper, we study the problem of SSMTSP (single-source many-target shortest paths) on MapReduce, which is finding shortest paths from a candidate set for one source node on MapReduce. In our work, we do not care about the exact paths, but rather the nearest neighbors of the source node, so the SSMTSP problem can also be considered as single-source many-target nearest neighbors problem. The SSMTSP can be used in recommending friends or ads in social networks, and in searching malls or hotels in road networks.

We design algorithms based on the Dijkstra and breadth-first search algorithms, propose an efficient pruning algorithm for breadth-first search, and analyze their performance of number of iterations and cost. Our empirical evaluation on real-life graphs and Hadoop platform shows that the pruning algorithm is more efficient than the breadth-first search shortest path algorithm.

The rest of the paper is organized as follows. Section 2 gives the background for shortest path for graphs, MapReduce, Dijkstra algorithm, and breadth-first search. The SSMTSP problem and its corresponding algorithms are presented in Section 3. We analyze the performance of our algorithms in Section 4, show experimental results in Section 5, and review the related work in Section 6. Finally, conclusion and future work is given in Section 7.

2. Background

In this section, we provide some background for shortest path for graphs, MapReduce, Dijkstra algorithm, and breadth-first search.

2.1. Shortest Path for Graphs

We consider a Weighted Directed Graph , where is the set of nodes, is the set of edges, and the number of nodes and edges is and , respectively. We use to represent the diameter of . For an edge , the weight is denoted with , and is ’s parent node and is ’s child node. A path in a graph is a sequence of nodes , where and . and are called start and target of and linked by . The length of between source and target is the sum of weights of edges in it; that is, , and the number of hops from source to target in is the number of edges, that is, .

Definition 1. A path in graph from start to target is a shortest path, if there does not exist any path between and , such that .
For the sake of simplifying the presentation, we assume for the rest of the paper that the weight on each edge of equals 1; that is, . This degrades to an unweighted graph but does not affect the result of our algorithms.

2.2. MapReduce

MapReduce [2], proposed by Google, is a programming model for processing huge amounts of data in parallel using a large number of commodity machines, and its open-source implementation is Hadoop (http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Federation.html). By automatically handling the lower level issues, such as job distribution, data storage, and fault tolerance, it allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

In the MapReduce-like systems, programs are done iteratively in three phases: Map, Shuffle, and Reduce. In the Map phase, they read a collection of values or key/value pairs from input sources in parallel and emit zero or more key/value pairs for each input element by invoking a user defined Mapper function. In the Shuffle phase, they group together all the Mapper-emitted key/value pairs sharing the same key and output all groups to the next phase. In the Reduce phase, they invoke a user defined Reducer function for each distinct group independently and in parallel and emit zero or more key/value pairs that will be written on disks or be the input of the next Map phase in the following iteration. In our work, we will use two basic primitives: Map and Reduce.

2.3. Dijkstra Algorithm

Dijkstra algorithm [3] is a classical algorithm for single-source shortest path problem for graphs with positive weights. It computes the shortest paths between the source and any other node in a graph. The main idea is that starting from , expand one node that has the shortest path to ; repeat the expanding process until all nodes are reached from . In each expanding process, the algorithm expands a nearest neighbor, and it requires that the weight of each edge is positive.

Example 2. Given the graph in Figure 1, if we choose as the source, the distances between and other nodes will be , which can be done in 6 expanding processes.

2.4. Breadth-First Search

To find the shortest paths between the source and other nodes in graph , we can conduct a breadth-first search (BFS for short) starting from and obtain a BFS-tree [4]. In order to construct BFS-tree of , we first initialize as a tree having only the root node and then add children of in to . That is, if , we add to as a child of . Then iteratively, for each leaf node in current , we search all its children , such that and has not been added to yet and add to as a child of .

Example 3. Also considering the graph in Figure 1, the BFS-tree of can be seen in Figure 2. If we choose as the source, the distances between and other nodes are also , and , but the BFS can finish in only 3 expanding processes.

3. SSMTSP

3.1. Problem Statement

Definition 4. Given a (weighted) directed graph and a candidate node set which is chosen from , the SSMTSP problem is that, for a source node , find nearest neighbors with shortest paths, such that those nearest neighbors are in .

In our work, we do not care about the exact paths, but rather the target nodes, or called nearest neighbors, so the SSMTSP problem can also be considered as single-source many-target nearest neighbors problem. The candidate set could be chosen from according to nodes’ importance on the graph, such as PageRank scores [5].

3.2. DijkstraKNN Algorithm on MapReduce

On MapReduce, we keep a root set for source . At the beginning, contains only (, 0), and at each iteration, we add a nearest neighbor and the shortest distance between and to . As long as contains nodes coming from , or there is no out-link from , the expanding process terminates. Then we can get the nearest neighbors through another Map phase. Details of the DijkstraKNN algorithm can be seen in Algorithms 1 and 2.

Input: A (weighted) graph , candidate set , nearest neighbors, and source node ;
Output: A list of from , where ;
(1)let ;
(2)let ;
(3)/*******************************************
(4)Here we redefine “ ” as follows:
(5) and
(6)*******************************************/
(7)while do
(8) DijkstraMapper ;
(9) //find a nearest neighbour
(10)      DijkstraReducer();
(11)  end while
(12)   ;
(13)  return  

DijkstraMapper   :
(1)for all   in   do
(2)if     then
(3)  emit ;
(4)end if
(5)end for;
DijkstraReducer  (output of DijkstraMapper):
(6)let ;
(7)for all     do
(8) //find a nearest neighbour not in ;
(9)if     then
(10)    ;
(11)   end if
(12)  end for;
(13)  return  

3.3. BFSKNN and PruningBFSKNN Algorithms on MapReduce

The DijkstraKNN algorithm expands one nearest neighbor in each iteration for the source . However, we can expand all its neighbors of the same hops at the same time, that is, BFSKNN algorithm. In BFSKNN algorithm, iteratively we expand all neighbors of the same hops from and update the shortest distances between and these nodes. While all distances do not change any more, we obtain the shortest distances between and other nodes in a graph. Details of the BFSKNN algorithm can be seen in Algorithm 3.

Input: A (weighted) graph , candidate set , nearest neighbors, and source node ;
Output: A list of from , where ;
(1)let ;
(2)let ;
(3)while     do
(4) ;
(5) BFSMapper ;
(6) = BFSReducer();
(7)end while;
(8)//“ ” means the same as Algorithm 1
(9) ;
(10) return  

Different from the Dijkstra algorithm, which expands the nearest neighbor in each iteration, the BFSKNN algorithm expands all nodes that have the same hops from (see Algorithm 4 for details). The conditions that terminate them are also different. In the Dijkstra algorithm, if we have found nearest neighbors or has not any out-link, the algorithm terminates. However, in the BFSKNN algorithm, the termination condition is that all distances between and other nodes in do not change.

BFSMapper ( ):
(1)for all   in   do
(2) emit ;
(3)end for;
(4)for all   in   do
(5) emit ;
(6)end for;
BFSReducer  (output of BFSMapper):
(7)for all     do
(8) let ;
(9)for all     do
(10)   if     then
(11)     ;
(12)   end if
(13)  end for;
(14)  emit( );
(15) end for

Theorem 5. The BFSKNN Algorithm terminates in a limited number of iterations, and the output is a valid solution to the SSMTSP problem.

Proof. Given the graph , construct a BFS-tree rooted at source in the following steps: (1)let be the root of ;(2)find all out-links for each leaf node of in and add them to as children of that leaf node;(3)repeat Step  2 until that the path from root to leaf node forms a loop in .
So, we have that there is a simple path from to in if and only if there is a path from root to leaf node in . The above steps finish in a limited number () of iterations, so the BFSKNN algorithm terminates in a limited number of iterations. After iterations, we can find all the shortest paths between and other nodes in , so since the -th iteration, all the shortest paths between and other nodes will not change any more, and the nearest neighbors from construct a valid solution to the SSMTSP problem.

However, in each expanding process of BFSKNN, all nodes expand along with their out-links, and this causes the problem of path expansion because all edges have to be accessed. In order to prune edges that are not necessarily accessed, we keep a list cList for each node, which records nodes from the candidate set for the current path. In cList, if a path contains nodes from , then we prune paths derived from that path, because they do not contain any useful information about nearest neighbors. Details can be seen in in Algorithms 5 and 6.

Input: A (weighted) graph , candidate set , nearest neighbors, and source node ;
Output: A list of from , where ;
(1)let ;
(2)let ;
(3)while     do
(4) ;
(5) PruningBFSMapper ;
(6) = PruningBFSReducer();
(7)end while;
(8)/*******************************************
(9)Here we redefine “ ” as follows:
(10) and
(11) *******************************************/
(12) ;
(13) return  

PruningBFSKNNMapper ( ):
(1)for all   in   do
(2) emit ;
(3)end for;
(4)for all   in   do
(5)if     then
(6)  return;
(7)else if     then
(8)  
(9)end if;
(10)  emit ;
(11)  end for;
PruningBFSKNNReducer (output of PruningBFSKNNMapper):
(12) for all     do
(13)  Let ;
(14)  for all     do
(15)   if     then
(16)     ;
(17)   else if     then
(18)     ;
(19)   end if;
(20)  end for;
(21)  emit( );
(22) end for

Example 6. Given the graph in Figure 1, candidate set and , we can prune the BFStree of in Figure 2, and the results are in Figure 3.

Theorem 7. The PruningBFSKNN algorithm terminates in a limited number of iterations, and the output is a valid solution to the SSMTSP problem.

Proof. Constructing a BFS-tree the same as Proof of Theorem 5, we have that there is a simple path from source to target in if and only if there is a path from root to leaf node in . Prune paths have candidate nodes from after the th candidate node. There are two kinds of pruned nodes, normal nodes (not belonging to ) and candidate nodes. If a node is a candidate node, it cannot be one of the nearest neighbors in the pruned paths, because we have found nearest neighbors in those paths. Otherwise, if a node is a normal node, then whether it appears in pruned paths or not does not affect the termination of the algorithm and does not affect the correctness of the algorithm either. So we have that the PruningBFSKNN algorithm terminates in a limited number of time, and the output is a valid solution to the SSMTSP problem.

4. Performance Analysis

In this section, we analyze and compare the number of iterations and cost of the DijkstraKNN, BFSKNN, and PruningBFSKNN algorithms. Realistic analysis of the efficiency of MapReduce algorithms is not straightforward, because the algorithms’ efficiency in practice depends on many other factors, such as distribution of data, scheduling of jobs, and proximity of the communicating machines. However, these factors are all controlled by the system, and only the number of iterations and cost can be controlled by users’ MapReduce programs.

4.1. The Number of Iterations

Theorem 8. The Dijkstra algorithm with parameters and finishes in MapReduce iterations.

Proof. Since is chosen from , so when we expand a node, the probability that it belongs to is . In order to obtain nearest neighbors, we need expanding processes, so the Dijkstra algorithm finished in MapReduce iterations.

Theorem 9. The BFSKNN and PruningBFSKNN algorithms with parameters and finish in () MapReduce iterations.

Proof. Construct a BFS-tree the same as Proof of Theorem 5, so the depth of equals the diameter of , and the BFSKNN and PruningBFSKNN algorithms terminate in at most expanding processes, so the BFSKNN and PruningBFSKNN algorithms finish in at most MapReduce iterations.

4.2. Cost Analysis

We now start by analyzing the cost of the DijkstraKNN algorithm. The DijkstraKNN algorithm terminates in expanding processes. In each expanding process, the input is and , so the total cost is Next, we analyze the cost of the BFSKNN algorithm. The BFSKNN algorithm terminates in at most expanding processes. In each expanding process, the input is and , so the total cost is At last, we analyze the cost of the PruningBFSKNN algorithm. The PruningBFSKNN algorithm also terminates in at most expanding processes. In the th () expanding process, the input is the same as BFSKNN, but in the th () expanding process, we prune unnecessary paths, so the total cost that we save is

5. Experiments

In this section, we present the results of the experiments that we did to test the performance of our algorithms.

5.1. Experimental Setup

Dataset. In order to demonstrate the robustness of our methods and to show their performance on realistic data, we present experiments with two real-world datasets, Epinions social network [6] and LiveJournal social network [7, 8]. Summary statistics about these datasets are presented in Table 1.

Experimental Platform. We implement the Dijkstra, BFSKNN, and PruningBFSKNN algorithms in Java on top of Hadoop platform. Our experiments are executed on a cluster of 20 nodes, where each node is a commodity machine with a 2.16 GHz Intel Core 2 Duo CPU and 1 GB of RAM, running CentOS v6.0.

5.2. Experimental Results

For choosing the candidate sets, we compute the PageRank scores [5] of each graph using Pegasus [9] and select 5,000 top ones as the candidate set. As can be seen from Figure 4 that PageRank scores of the two graphs all conform to the power law, that means only a small number of nodes are important. In the ads, the candidate sets usually contain the entities in which we gain profit from business.

We choose top 50, 500, and 5000 PageRank score nodes from Epinions social network and top 5000 PageRank score nodes from LiveJournal social network and compare the performance of the above algorithms. We select randomly 20 source nodes from each graph and compute the average execution time.

As you can see from Figure 5, the execution time of the DijkstraKNN algorithm changes greatly as we change the size of the candidate set, and bigger candidate set has smaller execution time. Moreover, the BFSKNN and PruningBFSKNN algorithms have more stable execution time. We also print all results for candidate sets size of 500 and 5000 in Figures 6(a) and 6(b), which show that the execution time grows linearly as the growth of iterations. This is because the inputs are and in each iteration, and the processing time is about 37 seconds. However, the execution time also grows nearly linearly as the growth of , which is against our intuition that it should grow exponentially as the number of nodes grows exponentially. The reason is that when we search in a graph, we usually find high PageRank score nodes instead of those fringe nodes.

We also compare the stability of the above algorithms; details are in Figure 7. The execution time of the DijkstraKNN algorithm changes greatly as we choose different sources because some nodes having more influential neighbors need less iterations. The BFSKNN and PruningBFSKNN algorithms have more stable execution time because their execution time depends on the radius of a graph for the source, and the radius of a graph is more stable in practice.

For LiveJournal social network, we choose the top 5000 PageRank score nodes as the candidate set and compare the performance of the above algorithms. The result is in Figure 8. The DijkstraKNN algorithm is efficient for small , but if the candidate set is smaller and is bigger, the PruningBFS algorithm is a better choice.

In this section, we review some related work about MapReduce, the shortest paths and entity recommendation for graphs.

MapReduce. MapReduce algorithms have been designed or proposed in the literature for a wide range of applications, such as machine learning [10], text processing [11], and bioinformatics [12, 13]. MapReduce provides an excellent tool for large graph processing as well; a number of graph processing systems have been designed on it in the literature, such as Pegasus [9] and Giraph (http://incubator.apache.org/giraph/). In this paper, we study one of the most well-known graph computation problems, that is, computing the shortest paths or the nearest neighbors.

The Shortest Paths. Computing the shortest paths is a basic and important primitive that lies at the core of graph related problems. The shortest path problem can be classified as SSSTSP, SSMTSP, APSP, and SP and classical algorithms such as Dijkstra algorithm [3] and Floyd-Warshall algorithm [14] can only handle small graphs in serial computation model. As the graphs become larger and larger, in order to answer questions timely, researchers deal with this problem by approximation [1517], preprocessing [18, 19], or parallelization. Current parallel algorithms for the shortest path problem are mainly based on searching the BFS-tree of the graph [18, 20, 21], but they either study the problem of SP or focus on APSP. In this paper, we study the problem of SSMTSP, which is a composition of the above two, design BFS-based parallel algorithm on MapReduce, and propose corresponding pruning strategy.

Entity Recommendation. As complex networks, such as social networks and web-page graphs, become more and more popular, entity recommendation on graphs becomes a hotspot among researchers. Current entity recommendation methods are mainly content-based or link-based. Content-based entity recommendation algorithms consider only the content of entities and recommend the most similar ones for some entity (refer to [22, 23] for details). Rank-based entity recommendation algorithms consider the structure of the graphs and recommend the most influential ones for some entity. The influence of entities can be defined as PageRank [5], HITS [24], SimRank [25], or some methods that derive from them. In this paper, from the perspective of influence, we consider the nearest neighbors as the most influential entities and recommend the nearest neighbors from the selected candidate set for some entity.

7. Conclusion and Future Work

In this paper, we study the problem of single-source many-target shortest paths for graphs, which is finding nearest neighbors from a candidate set for one source. To the best of our knowledge, this is the first study of recommending nearest neighbors from selected candidate set. As MapReduce programming model becomes more and more popular in data intensive computing, we evaluate algorithms on its open-source implementation-Hadoop. We design algorithms based on the Dijkstra and breadth-first search algorithms, propose an efficient pruning algorithm for breadth-first search, and evaluate their performance.

As graphs become bigger and bigger, precise algorithms need much time to compute several nearest neighbors and some fringe nodes may even waste more computing sources. In the future, we plan to seek approximation algorithms that can handle this problem.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.