Mathematical Problems in Engineering

Volume 2015 (2015), Article ID 934301, 13 pages

http://dx.doi.org/10.1155/2015/934301

## A Parallel Community Structure Mining Method in Big Social Networks

^{1}College of Computer, National University of Defense Technology, Changsha, Hunan 410073, China^{2}Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USA

Received 5 July 2014; Accepted 2 August 2014

Academic Editor: Haipeng Peng

Copyright © 2015 Songchang Jin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Community structure plays a key role in analyzing network features and helping people to dig out valuable hidden information. However, how to discover the hidden community structures is one of the biggest challenges in social network analysis, especially when the network size swells to a high level. Infomap is a top-class algorithm in nonoverlapping community structure detection. However, it is designed for single processor. When tackling large networks, its limited scalability makes it less effective in fully utilizing server resources. In this paper, based on infomap, we develop a scalable parallel nonoverlapping community detection method, Pinfomr (parallel Infomap with MapReduce), which utilizes the MapReduce framework to solve the two problems. Experiments on artificial networks and real datasets show that our parallel method has satisfying performance and scalability.

#### 1. Introduction

A few common properties in many complex networks have been discovered: small-world property, scale-free feature, and community structure pattern [1–4]. Community structure is playing a key role in the formation and function of these networks [5]. However, it is one grave challenge in complex systems [6].

Current social networks have jumped to millions even billions of nodes [7]. Take Facebook for example, its monthly active user has reached billion [8]. However, due to computational costs, traditional community discovery algorithms are willing, but unable to tackle such huge complex networks. So, it is necessary to implement a fast and scalable approach to detect communities in big social networks.

Network partitioning is NP-complete [9]. Partitioning a network into approximately equal sized components while minimizing the number of edges between different components is extremely important in parallel computing [10]. For example, parallelizing many applications involves the problem of assigning data or processes evenly to processors, while minimizing the communication traffic. However, when the network size reaches a certain level, direct segmentation on the original networks is not realistic, and there exist deficiencies of convergence rate of traditional algorithms.

Nowadays, mainstream servers are configured with high performance hardware. Empirical studies [11] have showed that infomap [12] is a top-class standalone algorithm for nonoverlapping community detection. However, due to the limitations of technological level, processing capability of single core has encountered a bottleneck and the scalability of infomap is suffered as a consequence, that is, because it only utilizes one core or processor of the server. Besides, computing resource waste is an additional product of infomap running on multiprocessor server. How to improve the scalability of infomap and make full use of servers is an awkward subject.

Information science is shifting from computing-intensive to data-intensive [13] with the advent of the era of big data. Some novel parallel computing frameworks shine, in which MapReduce [14] is one of the best. In this paper, based on our previous work [15], we present a new scalable parallel community detection method coalescing several existing excellent techniques, such as infomap, -shell decomposition, multilevel network partitioning, and MapReduce. A high-level description of our approach is as follows. First, we divide the whole network into a number of partitions and the number of partitions is far less than that of community structures. To speed up the process, we develop an enhanced multilevel partitioning method. Next, with MapReduce, we run parallel method to mine the community structures simultaneously within the partitions. Finally, we collect the community structures together to form a final result.

Main contributions of this paper are as follows: we propose a new model to mine community structure in big social networks. We integrate -shell decomposition theory with multilevel -way partitioning algorithm to deal with peripheral nodes. We implement a scalable and parallel infomap to uncover community structures and to improve resource utilization rate.

The rest of this paper is organized as follows. Section 2 briefly reviews some concepts and background information. Section 3 provides problem statement and detailed description of the parallel community detection method. In Section 4, we conduct a couple of experiments to evaluate the performance of the method proposed in this paper. Finally, Section 5 provides some concluding remarks and outlines future research directions.

#### 2. Preliminary Knowledge

##### 2.1. Relevant Concepts

In this paper, we only study undirected networks, which can be mathematically described as , consisting of node set and edge set ; represents the number of nodes, represents a node, and means its degree; represents the number of edges and is the edge between and , where .

Infomap is based on information-theory. So some information-theoretic concepts are briefly reviewed here. In information theory, the information contained in a distribution is called entropy. For a discrete random variable with a probability distribution , its entropy is

*Mutual information* calibrates the shared information between two distributions, and . We define as the joint probability of and . and are defined as marginal probability distribution of and , respectively. Then, mutual information of and is

*Normalized mutual information* (NMI) is often used for evaluating clustering result, information retrieval, feature selection, and so forth. Value range of NMI is and when and are the same, NMI equals 1.0. Consider

##### 2.2. -Shell Decomposition Theory

-shell decomposition is a well-established method for analyzing the structure of large-scale networks [16–18]. In particular, it provides a method for identifying hierarchies in a network. It is assumed that importance of a node is not related to its degree but its location. The process assigns an integer index, , to each node, representing its location within the successive layers (-shells) in the network. The index is a robust measure and the node ranking is not significantly influenced in the case of incomplete information. The -core of a network is the maximum subnetwork of whose degree is no less than . The -shell of is the set of all nodes belonging to the -core of but not to the ()-core.

Nodes are assigned to -shells based on their remaining degree, which is obtained by successive pruning of nodes with degree smaller than the value of the current layer. The decomposition process starts by removing all nodes with degree . After that, some nodes may be left with one link. We then prune the system iteratively until there is no node left with in the network. The removed nodes, along with the corresponding links, form a shell with index . In a similar fashion, we iteratively remove the next -shell, where , and continue to remove higher shells until all nodes are removed. As a result, each node is associated with one index, and the network can be viewed as the collection of all shells. value of a node can be very different from its degree. In Figure 2, we can see that has neighbors with . Figure 5 is the result of Figure 2 of which peripheral nodes are processed.

##### 2.3. Multilevel -Way Partitioning Method

Partitioning the node set of a network into disjoint subsets is called a -way partitioning of . Each subset and the edges within the subset constitute a* partition* of . Figure 1 shows a simple network with communities surrounded by the dotted circles and partitions.