Abstract

Community structure plays a key role in analyzing network features and helping people to dig out valuable hidden information. However, how to discover the hidden community structures is one of the biggest challenges in social network analysis, especially when the network size swells to a high level. Infomap is a top-class algorithm in nonoverlapping community structure detection. However, it is designed for single processor. When tackling large networks, its limited scalability makes it less effective in fully utilizing server resources. In this paper, based on infomap, we develop a scalable parallel nonoverlapping community detection method, Pinfomr (parallel Infomap with MapReduce), which utilizes the MapReduce framework to solve the two problems. Experiments on artificial networks and real datasets show that our parallel method has satisfying performance and scalability.

1. Introduction

A few common properties in many complex networks have been discovered: small-world property, scale-free feature, and community structure pattern [14]. Community structure is playing a key role in the formation and function of these networks [5]. However, it is one grave challenge in complex systems [6].

Current social networks have jumped to millions even billions of nodes [7]. Take Facebook for example, its monthly active user has reached billion [8]. However, due to computational costs, traditional community discovery algorithms are willing, but unable to tackle such huge complex networks. So, it is necessary to implement a fast and scalable approach to detect communities in big social networks.

Network partitioning is NP-complete [9]. Partitioning a network into approximately equal sized components while minimizing the number of edges between different components is extremely important in parallel computing [10]. For example, parallelizing many applications involves the problem of assigning data or processes evenly to processors, while minimizing the communication traffic. However, when the network size reaches a certain level, direct segmentation on the original networks is not realistic, and there exist deficiencies of convergence rate of traditional algorithms.

Nowadays, mainstream servers are configured with high performance hardware. Empirical studies [11] have showed that infomap [12] is a top-class standalone algorithm for nonoverlapping community detection. However, due to the limitations of technological level, processing capability of single core has encountered a bottleneck and the scalability of infomap is suffered as a consequence, that is, because it only utilizes one core or processor of the server. Besides, computing resource waste is an additional product of infomap running on multiprocessor server. How to improve the scalability of infomap and make full use of servers is an awkward subject.

Information science is shifting from computing-intensive to data-intensive [13] with the advent of the era of big data. Some novel parallel computing frameworks shine, in which MapReduce [14] is one of the best. In this paper, based on our previous work [15], we present a new scalable parallel community detection method coalescing several existing excellent techniques, such as infomap, -shell decomposition, multilevel network partitioning, and MapReduce. A high-level description of our approach is as follows. First, we divide the whole network into a number of partitions and the number of partitions is far less than that of community structures. To speed up the process, we develop an enhanced multilevel partitioning method. Next, with MapReduce, we run parallel method to mine the community structures simultaneously within the partitions. Finally, we collect the community structures together to form a final result.

Main contributions of this paper are as follows: we propose a new model to mine community structure in big social networks. We integrate -shell decomposition theory with multilevel -way partitioning algorithm to deal with peripheral nodes. We implement a scalable and parallel infomap to uncover community structures and to improve resource utilization rate.

The rest of this paper is organized as follows. Section 2 briefly reviews some concepts and background information. Section 3 provides problem statement and detailed description of the parallel community detection method. In Section 4, we conduct a couple of experiments to evaluate the performance of the method proposed in this paper. Finally, Section 5 provides some concluding remarks and outlines future research directions.

2. Preliminary Knowledge

2.1. Relevant Concepts

In this paper, we only study undirected networks, which can be mathematically described as , consisting of node set and edge set ; represents the number of nodes, represents a node, and means its degree; represents the number of edges and is the edge between and , where .

Infomap is based on information-theory. So some information-theoretic concepts are briefly reviewed here. In information theory, the information contained in a distribution is called entropy. For a discrete random variable with a probability distribution , its entropy is

Mutual information calibrates the shared information between two distributions, and . We define as the joint probability of and . and are defined as marginal probability distribution of and , respectively. Then, mutual information of and is

Normalized mutual information (NMI) is often used for evaluating clustering result, information retrieval, feature selection, and so forth. Value range of NMI is and when and are the same, NMI equals 1.0. Consider

2.2. -Shell Decomposition Theory

-shell decomposition is a well-established method for analyzing the structure of large-scale networks [1618]. In particular, it provides a method for identifying hierarchies in a network. It is assumed that importance of a node is not related to its degree but its location. The process assigns an integer index, , to each node, representing its location within the successive layers (-shells) in the network. The index is a robust measure and the node ranking is not significantly influenced in the case of incomplete information. The -core of a network is the maximum subnetwork of whose degree is no less than . The -shell of is the set of all nodes belonging to the -core of but not to the ()-core.

Nodes are assigned to -shells based on their remaining degree, which is obtained by successive pruning of nodes with degree smaller than the value of the current layer. The decomposition process starts by removing all nodes with degree . After that, some nodes may be left with one link. We then prune the system iteratively until there is no node left with in the network. The removed nodes, along with the corresponding links, form a shell with index . In a similar fashion, we iteratively remove the next -shell, where , and continue to remove higher shells until all nodes are removed. As a result, each node is associated with one index, and the network can be viewed as the collection of all shells. value of a node can be very different from its degree. In Figure 2, we can see that has neighbors with . Figure 5 is the result of Figure 2 of which peripheral nodes are processed.

2.3. Multilevel -Way Partitioning Method

Partitioning the node set of a network into disjoint subsets is called a -way partitioning of . Each subset and the edges within the subset constitute a partition of . Figure 1 shows a simple network with communities surrounded by the dotted circles and partitions.

Definition 1 (effective edge lost ratio). An edge whose endpoints are in the same community, that is, intracommunity edge, is called an effective edge. If the endpoints of an effective edge are divided into different partitions, then we call it an effective edge lost. The effective edge lost ratio is the percentage of the effective edge lost divided by the total number of edges in the network.

In Figure 1, is an effective lost, whereas is not. It is apparent that effective edge lost plays a more important role in the community detection than the edges connecting nodes in different communities and being cut off by partition boundaries.

A number of high-quality and computationally efficient graph partitioning methods have been proposed and multilevel graph partitioning algorithms [9, 19, 20] are currently considered to be a start-of-the-art method and being extensively used. Here, we optimize the multilevel -way partitioning method proposed by Abou-Rjeili and Karypis [21] to partition the power law networks.

From Figure 3, we can see that multilevel -way partitioning method consists of coarsening phase, initial partitioning phase, and uncoarsening and refinement phase. Instead of trying to partition directly on the original graph , multilevel partitioning algorithms first obtain a sequence of successive approximations, such as , , and in the coarsening phase, of the original graph. Each of these approximations represents smaller than the size of the original graph. This process continues until a level of approximation is reached where the graph contains only a few hundreds of nodes. At this point, partitioning algorithms begin to compute partitions of that graph, corresponding to the five small partitions of in the initial partitioning phase, and since the graph is quite small, even simple algorithms are able to take it over and get reasonably good results, such as K-L [22]. And there is a parameter used to control the balance of partitions. The final step, uncoarsening and refinement phase, is to map the partitions of the smallest graph onto the original graph and to derive final partitions.

2.4. Infomap

In this paper, we continue our work on the information theoretic community detection model-infomap. First, we briefly review the model. It utilizes the duality between compressing a data set and detecting and extracting significant patterns or structures within the data, which is a statistical concept known as minimum description length statistics [23]. A random walk, represented as a Markov process, is used as a proxy for information flow. For a community-structured network, when a random walker enters a community, it tends to stay in it for a long time and the possibility of moving out into another community is low.

In an undirected network, the random walk has a state at time , indicating where it is. Then in next step, , the walker will move to chosen at random from neighbors of . To describe the state of random walker, a -level description model with Huffman coding is proposed. The first level encodes the communities and the second level encodes the nodes within the communities. Then we can use “community ID + node ID” to uniquely describe a particular node in the network. Huffman codes are prefix-free coding scheme and are optimally efficient for symbol-by-symbol encoding. It saves space by assigning short codewords to common events or objects and long codewords to rare ones, just as Morse code uses short codes for common letters and longer codes for rare ones. So, the path of the random walker can be described as a coding sequence.

Figure 4 is an example for illustrating the -level description method. Assuming there are communities divided like Figure 4(b), then the code sequence for the random walk in Figure 4(a) is “0 111 00 10 111 010 10 011 110 00 10 1101011 1 00 01 10 00 11 10 01 1.” The underlined word “0” in bold format means random walk starting from C1. The underlined word “1011 1” in bold format means random walk leaving from C1 and entering C2. The description length of the sequence will be bits and about bits per step. But, in Figure 4(c), we will need bits and bits per step. Community division is obviously more reasonable in Figure 4(b) than in Figure 4(c), and the average description length in the former one is shorter than in the latter one. From the perspective of information theory, we know that smaller entropy corresponds to smaller uncertainty. Corresponding to the community detection, smaller entropy means smaller indistinctness or clearer community structure.

3. Parallel Community Detection Method

3.1. Problem Statement

Assuming there is an optimal community division, , in a community-structured network . With , the network is divided into communities, and the lower limit of average description code length is . According to the Shannon source coding theorem [24] and the Kraft’s inequality [25], we know that, for any division pattern , the average codeword length per source symbol, , for an optimal prefix-free code satisfies

Obviously, calculating an endless random walk on a network to get is not realistic. Fortunately, when randomly walking on a network endlessly, we will get a steady visit frequency for each node, and we can calculate that easily with many methods, such as PageRank. With the steady visit frequency distribution, we will be able to describe the state of the random walker easily. For , the probability of and being in the same community is and the probability of being in different communities is , where belongs to community . Then the can be described as

where means the probability of moving out from the current community and . is the average description length of nodes in all communities, and it can be expressed as

With the probability or to visit the node , represents the probability of staying in the current community during the next step, and . expresses the infor mation entropy of the visiting probability of the nodes in the community , which can be written as

For the NP-complete challenge, we cannot achieve the global optimal division pattern by direct computing on a big social network. However, we can archive a set of local optima to approximate by partitioning the network into small subnetworks (partitions) and tackling them independently with MapReduce. Then, the issue will become how to discover optimal division pattern in partition and get final . For , it would be sufficient to calculate the for different s and pick up the one with shortest description length as . Finally, we get a community set , where corresponds to , and , where and .

3.2. Procedure of the Parallel Community Detection

For the convenience of illustration, we adopt Figure 1 to start this section and assume that the amount of partitions is far less than that of communities in big social networks. There are stages in the parallel community detecting process.

In the first stage, we calculate the steady visiting probability of all nodes (shown in Algorithm 1). Here, we modify the traditional PageRank, which is used to deal with directed networks and run a iterative MapReduce-based version to get the global steady visit probability vector. In each iteration, visit probability of is (since there is no teleport and link sink in undirected networks, we set )

  (1) method  (nid , node )
  (2) //(8)
  (3) emit(nid , ) //pass along network structure
  (4) for all nodeid   do
  (5)      emit(nid , ) //pass pagerank contribution to neighbors
  (6) endfor
  (7) method  (nid ,)
  (8)
  (9) for all    do
(10)      if    then
 (11)         //recover local network structure
(12)      else
(13)         //sums pagerank contributions
(14)      endif
(15) endfor
(16)
(17) emit(nid , )

Second, we use multilevel -way partitioning method enhanced by -shell decomposition method to divide a big social network into approximately equal sized disjoint partitions (, , and in Figure 1). Edges cut off by partition boundaries will be discarded. As networks studied here are sparse and with community structure, edge cut (lost) ratio will be low. However, partitioning method has a decisive influence on the final community detection effectiveness which will be explained with experiments in Section 4. A matching of a network is a set of edges and no two edges in it share a same node. To coarsen a network, a commonly used method is to collapse the node pairs forming the matching, such as random matching, heavy-edge matching, and maximum weighted matching. However, it shrinks at a slow rate and does not consider the relative importance of nodes in different locations. We all know that there is a large number of low-degree and low value nodes in power law networks, so we can turn this characteristic into revenue. Here, we use the -shell decomposition to merge the peripheral nodes with high speed and more accurate performance during the coarsening phase (shown in Algorithm 2).

  (1) set int
  (2) while  , do
  (3) for all node , do
  (4)    if  , then
  (5)      //a list to store nodes with
  (6)    endif
  (7) endfor
  (8) if  , then
  (9)    //process finished
(10) endif
(11) for all node , do
(12)    if , then // hasn’t been annexed
(13)      //get rid off itself
(14)      //assign new id to annexed by
(15)     
(16)      //replace with in ’s neighbor
(17)     
(18)     
(19)     
(20)    endif
(21) endfor
(22) endwhile

In the last stage, parallel community detection method is carried out on all partitions (such as the 3 partitions in Figure 1) and all partitions are tackled independently. When the parallel process finished, each partition generates a community set (such as the 2 communities in in Figure 1). Combining all the community sets together will be the final result. Figure 6 gives a straightforward statement of this process and more detail is shown in Algorithm 3. At the beginning of this stage, we treat each node as a community and then use a bottom-up approach with greedy scheme to find out community pairs that can minimize and merge them to form new communities.

   (1) Initialization-global variables
   (2)   //number of communities
   (3)  
   (4)  for   from to , do
   (5)    //community ID of
   (6)   
   (7)  endfor
   (8) method Map(node ,)
   (9) while  , do
 (10)  for  , do
  (11)    //able to merge in current iteration
 (12)  endfor
 (13)  for  , do
 (14)    //randomly select a node(community)
 (15)   
 (16)    //decrease of
 (17)    //community id which leads to minimum
 (18)   for   in , do
 (19)   if   //if merge and
 (20)        
 (21)        
(22)   endif
(23)   endfor
(24)   if   0 & δ , then
(25)   
(26)   
(27)   
(28)   
(29)   endif
(30)  endfor
(31) endwhile
(32) for  , do
(33)  
(34) endfor

4. Experiments and Analysis

In this section, we conduct several experiments and analyze the results. All experiments are running on the Hadoop-1.1.1 cluster of Antivision Software Ltd. The cluster consists of 20 PowerEdge R320 servers (Intel Xeon CPU E5-1410 @2.80 GHz, memory 8 GB) with 64-bit NeoKylin Linux OS, and servers are connected by a Cisco 3750G-48TS-S switch. Data sets are shown in Table 1, including artificial networks and real networks.

All artificial networks used here are generated by LFR benchmark. In LFR, some parameters give us a direct control on network properties: network size (), degree distribution (, , ), and community structure (, ) [26]. and are exponents for degree and community size distributions, which range between and , respectively. Mix is the ratio of edges connecting nodes from different communities divided by collective edges of all communities. For the average and balance, we set and for artificial networks.

4.1. Accuracy Experiments

In accuracy experiments, we compare our method, Pinfomr, with two top-class methods, Louvain algorithm [27] and OSLOM algorithm [28], on different data sets and with different partition numbers. The data sets used are , , and , and result is shown in Figure 7, where means partition number. The situation when or is defined by us as no community structure and NMI in this case is set to 0, but the case when is close to is discarded. Taking for instance Louvain in Figure 7(a), when , and we conclude that a community just contains nodes averagely. From the design of LFR, we know that when mix value is too high, such as higher than , there will be no obvious community structures, and the network will not be a power law network but more like a random network which is not the focus here. Figure 7 indicates that Pinfomr is more stable and accurate than the others in uncovering community structures in power law networks. For running time, we can see that for the same data set, Louvain consumes the longest time and Pinfomr needs the shortest time. OSLOM requires a little more time than Pinfomr when mix parameter is not too big.

From Figure 7(c), we know that the NMI decrease as partition numbers increase, but the performance is excellent and stable when mix is less than , and NMI will maintain at about . Our results show that Pinfomr is able to achieve better results in a shorter period of time, although accompanied by some loss of performance.

4.2. Partitioning Experiment

In previous section, we have mentioned that the quality of partition will play a vital role in the final performance of parallel community detection. Therefore, we conduct experiment in this section to test the impact and effectiveness of different partitioning methods on Pinfomr.

We use two simple partitioning methods to compare with the improved multilevel -way partitioning method. One is a sequential partitioning method dividing the network according to the storage order of the nodes and edges on the HDFS. The other one is a random matching partitioning method by randomly choosing nodes to generate a matching. For example, assuming that we bisect with and into and , when using sequential partitioning method, the first 10,000 nodes will be collected to form and links within will form . The other nodes are left for and links within form . If we use the random matching method, we will randomly select 5,000 node pairs into and all links within will form , and the process for is similar to . Dividing a connected network into subnetworks or partitions will cause edge loss. Excellent partitioning methods will always try to walk through the slits between communities and avoid cutting off the effective edges. Here, we use the data set to test performance of different partitioning methods with and . In Figures 8(a) and 8(b), we can see that multilevel -way partitioning method is stable and results of Pinfomr on it are very close to the results of infomap and also very close to the real results. From Figures 8(c) and 8(d), we get that, for multilevel -way partitioning method, total edge loss ratio increases linearly along with the increase of mix parameter. It is easy to understand that, from the meaning of the mix parameter, effective edge loss ratio always remains at low level before mix rising up to . Manifestations of sequential partitioning method and random partitioning method are also easy to explain. Distribution of edges of LFR generated networks is random and uniform, regardless of storage order. As a result, the total edge loss ratio will remain at about . Effective edge loss decreases linearly with the increase of mix for the same reason when total edge loss ratio increases linearly with the increase of mix of multilevel partitioning method in Figures 8(a) and 8(b).

In addition, we conduct a degree distribution test on a real network-LiveJournal to verify performance of the improved multilevel partitioning method. The network is divided into partitions by the improved multilevel -way partitioning method, and the degree distributions corresponding to the original network and the subnetworks are shown in Figure 9. Comparative observation indicates that the subnetworks got from the improved multilevel -way partitioning method are able to maintain the degree distribution characteristics of the original network.

4.3. Scalability and Performance Experiment

Our study aims to uncover community structures in big social networks and improve resource utilization as much as possible. Here, we unify the two problems together by means of MapReduce. With a small portion of expense of performance, we will achieve the goal. In this section, we will test the scalability and performance of the parallel community detection method, and data sets used are D4, D5, LiveJournal (http://snap.stanford.edu/data/com-LiveJournal.html), Youtube (http://snap.stanford.edu/data/as-skitter.html), and Orkut (http://snap.stanford.edu/data/com-Orkut.html).

For a certain network in Figures 10(a) and 10(b), when increases, the speedup ratio will increase but the acceleration will become slow, since MapReduce needs some time to initiate before map tasks start to run and transmit data from map phase to reduce phase. For a certain , as network size grows, the speedup ratio will become higher. For Figure 10(c), we just present the running time of parallel community detection method on MapReduce because the capacity of the servers cannot deal with such large networks on one server or with .

Finally, we apply the same process onto the real networks. Experiments on real networks shown in Figure 11 also confirm that our parallel community detection method has excellent scalability. From the results, we can conclude that, when increases, the subnetwork size assigned to each map task will be smaller, and the total edge lost ratio will increase, which will further reduce the subnetwork size. From Figures 10 and 11, we can get the following: in the case of constant data size, the running time and are linear approximation when is small. When partition number is small, the running time is affected by the number of partitions significantly. When the partition number reaches the “critical point” (Figure 11(c) Orkut, and Figure 11(b) Youtube, ), running time is less affected by the changes of partition number and shows “long tail effect” to some extent. The reason is that the cost of MapReduce is basically fixed. For a larger social network with the same number of map tasks, Mapreduces initial time accounts for a smaller proportion of the total running time. When partition number increases and the total running time decreases, the proportion of the initial time is not negligible. It makes our method exhibiting “long tail effect” in different data sets.

5. Conclusion

Community detection has become an important research topic in social networks. Traditional algorithms on community mining cannot effectively adapt to the current big social network scenarios [29, 30]. Infomap is excellent standalone community detection method and, by means of multilevel -way partitioning method enhanced by -shell deposition, we are able to develop a parallel community discovery method on MapReduce framework. Related experiments verified the validity of the proposed work in this paper, and it may possess some reference meaning for social network analysis and social community mining with the big data techniques. Next, well try to use some overlapping partitioning methods to further improve the community detection accuracy.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors would like to express their sincere gratitude to Zhang Yuchao from Beijing Institute of System Engineering for providing great assistance through the entire research process, Lancichinetti A. from Amaral Lab of Northwestern University for supporting their work unselfishly with implementation of some algorithms, and Chen Siming from University of Illinois at Chicago for his careful review, comments, and feedback on this paper. In addition, this research is supported by the National High-Tech R&D Program of China (nos. 2012AA012600, 2012AA01A401, and 2012AA01A402), the National Natural Science Foundation of China (no. 61202362), the State Key Development Program of Basic Research of China (no. 2013CB329601), and Project funded by the China Postdoctoral Science Foundation (no. 2013M542560).