Machine Learning with Applications to Autonomous SystemsView this Special Issue
Research Article | Open Access
Community Clustering Algorithm in Complex Networks Based on Microcommunity Fusion
With the further research on physical meaning and digital features of the community structure in complex networks in recent years, the improvement of effectiveness and efficiency of the community mining algorithms in complex networks has become an important subject in this area. This paper puts forward a concept of the microcommunity and gets final mining results of communities through fusing different microcommunities. This paper starts with the basic definition of the network community and applies Expansion to the microcommunity clustering which provides prerequisites for the microcommunity fusion. The proposed algorithm is more efficient and has higher solution quality compared with other similar algorithms through the analysis of test results based on network data set.
Network community structure is one of the most common and important topological properties of complex networks whose characteristic is that links between the same communities are dense while links between different communities are sparse. The research on the network community mining algorithm has a very important theoretical meaning for analyzing the topology of complex network, understanding its function, finding its hidden patterns, and predicting its behavior which is widely used in social networks, biological networks, and the World Wide Web. The literature [1–4] summarizes the research background, research significance, research status at home and abroad, and current main problems of the complex network clustering method.
Network community clustering algorithm can be divided into intelligent optimization algorithm and heuristic algorithm or the mixture of the two algorithms. The idea of intelligent optimization algorithm is to abstract the community clustering problem into a mathematical problem of calculating the optimal solution, using intelligent optimization algorithm to calculate the optimal solution which is updated by judging preferential conditions of the objective function. The idea of heuristic algorithm is to calculate the community which each node belongs to according to the rules of the algorithm [5–7].
In recent years, with further research and exploration in the complex network community, efficient community clustering algorithms emerge endlessly. The multiobjective discrete particle swarm optimization (MODPSO) , one of intelligent optimization algorithms, calculates the optimal scheme for the community clustering by updating two objective functions: NRA and RC. This algorithm has better nonrandomness and executes efficiently. Moreover, research on heuristic algorithms continues to develop; core node fusion algorithms based on data field  and betweenness centrality  have also received widespread attention.
Radicchi has given characteristics of the network community structure . Links between nodes in the same communities are dense while links between nodes in different communities are sparse. For a network , represents set of nodes and edges in networks, represents the degree of node (the number of nodes connected with node ), represents an adjacency matrix of the network , represents a community adjacency matrix of (i.e., ), , and . The definition of strong community is The definition of weak community is
According to this characteristic of the community, we can divide a real community which a node belongs to and split edges which connected with node into two parts: edges connected with community and edges disconnected with community , and the number of edges is and , respectively. We can determine the community which nodes belong to by comparing the two values. If , we can determine which community that node belongs to. Further analysis shows that if , we can determine that node belongs to this community.
Newman and Girvan put forward the concept of modularity to measure the quality of network community clustering in paper . Many community clustering algorithms have accepted this concept as an index to measure the quality of community clustering. The formula of modularity is given as follows:where represents the number of edges in the network, is the adjacency matrix of the network, is the degree of node , and represents that node and node belong to the same community while represents that node and node are not in the same community.
In multiobjective particle swarm algorithm, objective function in single objective particle swarm modularity is further replaced with modularity density and we explore by updating the value of RA (Ratio Association) and the RC (Ratio Cut). The formula of RA, RC, and is shown as follows:where denotes the number of nodes in the network, represents the number of divided communities in the network, is the th community among divided communities, is the number of nodes in the th community, represents the set of nodes which are not in the community , , is the adjacency matrix of the network, and RA and RC are closely connected with the two measurement indexes (Conductance and Expansion) of network community clustering mentioned in the paper . Conductance denotes the ratio of the number of nodes pointing outside the community to the number of edges of the community. Expansion represents the number of edges each node has which point outside the community. The formula of Conductance is The formula of Expansion is
In the above formulas, represents the number of links on the boundary of , denotes the number of links within the community , and is the number of nodes in community .
The algorithm of this paper adopts the divide-and-conquer strategy . The nodes in the network are divided into microcommunities with a single node as the core. We can get the community structure through fusing the microcommunities randomly . After finishing the core steps of the algorithm, the final result of the community clustering can be screened out via the index of the modularity density. The clustering of microcommunities and the whole process of the algorithm will be described in Section 2 in detail. Section 3 will make simulation analysis on experimental results of the algorithm.
2. Microcommunity Fusion
The algorithm in this paper constructs microcommunities according to the index Expansion during the process of community clustering. In the procedure of microcommunity fusion, it merges and fuses microcommunities according to the definition of strong community.
The algorithm is different from other heuristic algorithms. The algorithm divides communities into several basic “microcommunities” in the network and then mergers and fuses these “microcommunities” to get the final results of the network community clustering.
Firstly, nodes with larger values of degree in the network are selected corresponding to center nodes of communities in the network. The selected minimum value of degree is called the threshold. By testing different choices of the threshold during the process of the algorithm, we can find that the threshold is larger than the average value of degree of network nodes and the experimental result is ideal when the number of center nodes accounts for about 20% (see Section 3.4) of the total number of nodes. The formula of the choice of threshold Deg iswhere is an array which is in ascending order according to the value of node degree in the network, is the number of elements in the array, that is, the number of nodes in the network, and the value of is 20% in this algorithm.
According to the chosen center node, eligible node in its neighbors is selected to join the microcommunity. This algorithm uses the index Expansion summarized in  for choice. Because Expansion is a nonlinear function, when the number of nodes is small, the change range of the function value cannot meet the expectation. Thus, this paper adjusts the computing method in the index Expansion and removes the center nodes and connected edges. The calculation example of Expansion is given in Figure 1.
The calculation process of Expansion which used as its center node is given as follows. At the stage of initialization, we set all neighbor nodes of as a microcommunity which sets as its center node and the value of Expansion is 4/6. The algorithm traverses each neighbor node and calculates the value of Expansion after removing it out of the microcommunity. If the value becomes small, the node will be removed. Otherwise, it calculates the next neighbor node. In the example, the change process of Expansion value of each node in the network diagram is given in Table 1. One of the initial nodes traversed is randomly selected as . EXP is the value of Expansion before removing the node. EXP_NEXT is the value of Expansion after removing the node.
According to the information from the table, nodes which set as center node of the microcommunity are , , , , and . Compared with standard Expansion, the Expansion used to screen out the node from the microcommunity is stricter in computational condition. It is conducive to make the structure of microcommunity stable and the change of nodes is more explicit during the process of the microcommunity fusion.
2.2. Algorithm Flow
In this algorithm, each center node clustered is used as the core node of a microcommunity. The algorithm merges microcommunities by comparing the close level of links between different core nodes. and are core nodes, represents the degree of node , represents the degree of node , represents the assigned community number of node , represents the assigned community number of node , is the adjacency matrix of two nodes, and have some overlapping neighbor nodes, and is the number of those overlapping neighbor nodes. If , we do not take measures to deal with both nodes. If , will be calculated. If then . All nodes of the microcommunity are updated synchronously. If then . All nodes of the microcommunity are updated synchronously.
After completing the fundamental fusion of microcommunities, the node without clustering will be classified. The proportion of the number of each neighbor node of the undetected nodes in community numbers is checked. Then, the node will be added to the community which has the largest proportion. The operation of merging is implemented according to the sequence of nodes during the process of searching network nodes three times. But as we know, the relationship of nodes in complex networks is extremely cumbersome. Each node may repeat with more than one node’s neighbor nodes. And the ratio of repetition is more than 1/2. Therefore, if we take the ordinal search classification algorithm, some unpredictable extreme situations will emerge. In view of the consideration of the detail, in this paper, the order of search is generated randomly. In the last step of the algorithm, the result which fits the community clustering rules better can be screened out.
The following are the specific steps of the algorithm.
Step 3. The community which is not detected is added to its most closely linked community.
Step 4. Classify nodes of communities which have not been detected.
Step 5. Save the result of classification, compute the module density , and save it after detecting community.
Because the different order of merging network nodes can lead to different results, the clustering result with larger value of module density is used as the final result of the community clustering according to formula (6).
3. Simulations and Analysis
The algorithm of this paper is written in JAVA. The hardware environment of running the program is Inter (R) Core (TM) i5-4200U CPU, 1.60 GHZ, and 4 GB RAM. The software environment is Microsoft Windows 8.1 operating system, jdk 1.7, and Eclipse software development environment.
In order to analyze the quality of network community clustering easily, this paper adopts the so-called Normalized Mutual Information (NMI) index described in  to compare the actual clustering result with the clustering result of this algorithm. NMI is commonly used to estimate the similarity between the true clustering results and the detected ones. Two vectors, and , are inputted during the process of comparison. The th bit of the vector represents the class of the th node. The NMI (, ) is then defined as follows:where is the number of clusters in vector (), is the mixing matrix which consists of vector and vector , is the number of elements shared in common by the th classification of vector and by the th classification of vector , is the sum of elements of in row (column ), and is the number of nodes of the network. The value of NMI (, ) is in the interval . If NMI (, ) = 1, then . If NMI , then and are totally different.
This paper conducts the test on Dolphin Networks, Football Networks, Karate Networks, and so on. The clustering result of the algorithm in this paper is better than other algorithms by analyzing the experimental results. At the same time, this algorithm has higher execution efficiency.
3.1. Experimental Data Analysis of Dolphin Networks
Each node represents a bottlenose dolphin in the data set. By observing the living habits of these dolphins for a long time, their study found that these dolphins show a specific pattern of contact and construct a social network containing 62 nodes. If two dolphins do something together frequently, there will be an edge between the two corresponding nodes in the network.
The algorithm of this paper conducts the community clustering on Dolphin Networks and sets the maximum value of the module density: . This clustering result is as follows: the value of the module degree is 0.374 and the value of NMI is 1.0. This result is the same as the actual community clustering.
As already stated in Section 2.1, the threshold selected in this data set is ; that is, the node whose degree is equal or greater than 7 is chosen as the core node. During the investigation of the data set, we chose multiple parameters for test. Figure 3 gives the comparison of real clustering results from 10 groups of parameter calculation in which choose the value from 0 to 90%. As shown in the diagram, when = 20% and = 30%, the real clustering result occupies the largest proportion. From the calculation, we find that when = 30%, obtained threshold is the same as the former.
3.2. Experimental Data Analysis of Football Networks
In the network, each node represents a university team which participates in the USA football season in 2000. The edge which links two nodes represents that the corresponding two teams once had a game at least rather than the relationship between the two teams.
The actual community structure of Football Networks is given in Figure 4. We can get the community clustering result shown in Figure 5 by using the algorithm in this paper. The module degree of the actual community clustering of Football Networks is −0.0239 and the module density of the actual community clustering of Football Networks is −100.83. Obviously, the actual networks clustering of Football Networks does not fully comply with the rules of network community clustering. In Figure 4, we can find that all nodes of community 6 cannot meet the basic rules of community clustering. Nodes of community 6 have no connection with each other. But the connection between nodes of community 6 and nodes of other communities is dense. The condition that a few nodes have less connection with their own community also exists in other communities. It is inevitable for those communities which have a lot of nodes.
Figure 6 also gives the comparison of experimental results when choose different parameters. Because of the irrationality of the real clustering in this data set, the diagram only shows the proportion of the modularity density when the value is larger than 1 in experimental results. From the diagram, we can find that when choose the value between 20% and 50%, the proportion is large, and the final threshold of the degree is same, so we choose 20% in the experiment.
3.3. Experimental Data Analysis of Karate Networks
This network is a classical data set in the field of social network analysis . In the early 1970s, Zachary, a sociologist, spent two years to observe the social relation network among 34 members of a karate club in an American university. The network consists of 34 nodes. An edge between two nodes indicates that the corresponding members are friends and they contact each other frequently. The network attribute profile is shown in Table 4.
According to this algorithm, we conduct the community clustering to Karate Networks. We set the maximum value of the module density . At the same time, . This result is the same as the result of the actual community clustering. The topology of Karate Networks community clustering is shown in Figure 7.
Figure 8 gives the contrast diagram of experimental results when in this network data set choose different parameters. When = 20%, the threshold of degree is 6; the real clustering result occupies the largest proportion.
Through comparing experimental results using different values of in different networks comprehensively, we choose = 20% as the final parameter which has good experimental results for most of the networks. In fact, the selection of covers a wide range because a threshold may correspond to multiple , while = 20% is covered in parameters in most of the better experimental results.
There are many kinds of algorithms for community clustering in complex networks. All of them, however, not only have advantages but also have drawbacks. According to the degree of nodes in the network and Expansion, the algorithm proposed in this paper clusters several microcommunities and gets the final community structure of the network by merging microcommunities. The clustering can be implemented by generating 100 random sequences when merging core nodes, and then the result with the maximum value of the modularity density will be selected as the final result of clustering. Experiments show that sieve method used in this paper can efficiently find the result which is very close to the result of community structure in real networks. According to the basic principle of the network community clustering, the algorithm can give the better community structure in an efficient way.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This paper was supported by Hi-tech Research and Development of China, 863 Funds (2009AA01Z212), National Natural Science Foundation of China (61003237 and 61401225), Jiangsu Provincial National Science Foundation (BK20140894) NUPTSF (Grants nos. NY213047 and NY213050), and Higher Education Revitalization Plan Foundation of Anhui (2013SQRL102ZD).
- S. Kang and D. A. Bader, “Large scale complex network analysis using the hybrid combination of a mapreduce cluster and a highly multithreaded system,” in Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW '10), pp. 1–8, Atlanta, Ga, USA, April 2010.
- T. Pei, H. Zhang, Z. Li, and Y. Choi, “Survey of community structure segmentation in complex networks,” Journal of Software, vol. 9, no. 1, pp. 89–93, 2014.
- D. W. Zhang, F. D. Xie, D. P. Wang, Y. Zhang, and Y. Sun, “Cluster analysis based on bipartite network,” Mathematical Problems in Engineering, vol. 2014, Article ID 676427, 9 pages, 2014.
- Z. Li, S. Zhang, R.-S. Wang, X.-S. Zhang, and L. Chen, “Quantitative function for community detection,” Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, vol. 77, no. 3, Article ID 036109, 2008.
- B. Yang, D.-Y. Liu, J. Liu, D. Jin, and H.-B. Ma, “Complex network clustering algorithms,” Ruan Jian Xue Bao/Journal of Software, vol. 20, no. 1, pp. 54–66, 2009.
- X. Liu, D. Li, S. Wang, and Z. Tao, “Effective algorithm for detecting community structure in complex networks based on GA and clustering,” in Computational Science—ICCS 2007: 7th International Conference, Beijing, China, May 27–30, 2007, Proceedings, Part II, vol. 4488 of Lecture Notes in Computer Science, pp. 657–664, Springer, Berlin, Germany, 2007.
- X. Liu and T. Murata, “Advanced modularity-specialized label propagation algorithm for detecting communities in networks,” Physica A: Statistical Mechanics and Its Applications, vol. 389, no. 7, pp. 1493–1500, 2010.
- M. G. Gong, Q. Cai, X. W. Chen, and L. J. Ma, “Complex network clustering by multiobjective discrete particle swarm optimization based on decomposition,” IEEE Transactions on Evolutionary Computation, vol. 18, no. 1, pp. 82–97, 2014.
- Y. H. Liu, J. Z. Jin, Y. Zhang, and C. Xu, “A new clustering algorithm based on data field in complex networks,” Journal of Supercomputing, vol. 67, no. 3, pp. 723–737, 2014.
- C. Tong, J. W. Niu, B. Dai, and Z. Y. Xie, “A novel complex networks clustering algorithm based on the core influence of nodes,” The Scientific World Journal, vol. 2014, Article ID 801854, 7 pages, 2014.
- G. Palla, I. Derényi, I. Farkas, and T. Vicsek, “Uncovering the overlapping community structure of complex networks in nature and society,” Nature, vol. 435, no. 7043, pp. 814–818, 2005.
- M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, vol. 69, no. 2, Article ID 026113, 15 pages, 2004.
- J. Leskovec, K. J. Lang, and M. Mahoney, “Empirical comparison of algorithms for network community detection,” in Proceedings of the 19th International Conference on World Wide Web (WWW '10), pp. 631–640, ACM, Raleigh, NC, USA, April 2010.
- M. Khalilian, F. Z. Boroujeni, N. Mustapha, and M. N. Sulaiman, “K-means divide and conquer clustering,” in Proceedings of the International Conference on Computer and Automation Engineering (ICCAE '09), pp. 306–309, IEEE Computer Society, Bangkok, Thailand, May 2009.
- R. Green, I. Staffell, and N. Vasilakos, “Divide and Conquer? k-means clustering of demand data allows rapid and accurate simulations of the British electricity system,” IEEE Transactions on Engineering Management, vol. 61, no. 2, pp. 251–260, 2014.
- F. Wu and B. A. Huberman, “Finding communities in linear time: a physics approach,” European Physical Journal B, vol. 38, no. 2, pp. 331–338, 2004.
Copyright © 2015 Jin Qi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.