Research Article | Open Access
Neighbor Similarity Based Agglomerative Method for Community Detection in Networks
Community structures can reveal organizations and functional properties of complex networks; hence, detecting communities from networks is of great importance. With the surge of large networks in recent years, the efficiency of community detection is demanded critically. Therefore, many local methods have emerged. In this paper, we propose a node similarity based community detection method, which is also a local one consisted of two phases. In the first phase, we first take out the node with the largest degree from the network to take it as an exemplar of the first community and insert its most similar neighbor node into the community as well. Then, the one with the largest degree in the remainder nodes is selected; if its most similar neighbor has not been classified into any community yet, we create a new community for the selected node and its most similar neighbor. Otherwise, if its most similar neighbor has been classified into a certain community, we insert the selected node into the community to which its most similar neighbor belongs. This procedure is repeated until every node in the network is assigned to a community; at that time, we obtain a series of preliminary communities. However, some of them might be too small or too sparse; edges connecting to outside of them might go beyond the ones inside them. Keeping them as the final ones will lead to a low-quality community structure. Therefore, we merge some of them in an efficient approach in the second phase to improve the quality of the resulting community structure. To testify the performance of our proposed method, extensive experiments are performed on both some artificial networks and some real-world networks. The results show that the proposed method can detect high-quality community structures from networks steadily and efficiently and outperform the comparison algorithms significantly.
Many real-world systems can be abstracted as complex networks, in which nodes represent entities in the systems, and edges correspond to interactions between the entities. One of the most significant characteristics observed in these complex networks is the “community structure,” which means that nodes in the network can be divided into groups naturally; nodes in the same group are connected densely, and connections across different groups are relatively sparse; each of the node groups is a so-called “community.”
The communities are always related to functional modules of networks. For instance, communities can be groups of web pages in WWW networks  or scientific papers in citation networks  sharing same topics, books with the same political orientations copurchased from the online bookseller, Amazon.com , pathways or complexes in metabolic networks, or protein-protein interaction networks [4, 5]. In social networks, communities often correspond to real social groupings having the same interests or professional occupations, e.g., scientist groups classified according to the scientists’ specialties in the coauthor relationship collaboration networks [6, 7], jazz musician groups divided according to the locations and race , or affiliations of gang members in the policing area of Hollenbeck, Los Angeles . Besides this, some researches have indicated that networks can present quite different properties when being considered at the community level, rather than from the perspective of entire network or the individual node .
Therefore, analyzing the community structures in networks can facilitate the recognition of the characteristics of networks and make prediction further about the functional properties of the corresponding systems. That is to say, community detection provides us with an effective means for studying the functional properties of networks via dipping into structural characteristics, which really make sense in practical applications. Therefore, a multitude of methods [11, 12] have been proposed for detecting communities in complex networks; we will review some related literature in Section 2.
In this paper, we propose a community detection method as well, which is based on node similarity and consists of two phases. The first phase repeatedly selects the node with the largest degree in the remainder of the network and either takes it as the exemplar of a new community or inserts it into the community to which its most similar neighbor belongs, according to its most similar neighbor’s community affiliation. At the end of this phase, we get a series of communities. However, they are only the preliminary communities; some of them might be too small or too sparse; edges connecting to outside of them might go far beyond the ones inside them. Accepting them as the final ones will lead to a low-quality community structure. Therefore, the second phase merges some of the preliminary communities to improve the quality of the resulting community structure.
The main contributions of this work can be summarized as follows.(i)We propose a node similarity based local algorithm, shortened as NSA, for community detection, which is a two-phase method. The first phase is used to get the preliminary communities, and the second phase is to merge some of the preliminary communities to improve the quality of the resulting community structure.(ii)We propose an index, community metric, to measure the sparsity or smallness of a community. In the second phase, we use the index as a criterion to determine which preliminary communities need to be merged.(iii)Extensive experiments on some artificial networks and real-world networks are carried out to testify the performance of the proposed method. The experimental results show that the performance and the time complexity of the proposed method are steadily promising and outperform its competitors.
The remainder of this paper is organized as follows. Section 2 reviews some literature about community detection. The details of the proposed algorithm are elaborated in Section 3. The experimental results and analysis on both artificial networks and real-world networks are presented in Section 4. In Section 5, we discuss how to set the optimal value for a parameter introduced in our proposed method, and the paper ends with a conclusion in Section 6.
2. Related Work
A great deal of community detection methods have been proposed in the last decade; these methods try to explore communities in networks from various perspectives. The graph theory-based methods take the problem of community detection as the traditional task of graph partitioning and divide the network into subnetworks. Kernighan-Lin  is a representative method of this kind, which partitions the network into two arbitrary subnetworks first and then repeatedly swaps some nodes between the two subnetworks to maximize a predefined gain function.
The hierarchical clustering methods reveal multilevel community structures either in divisive ways or in agglomerative approaches or in hybrid ways; e.g., GN algorithm [6, 7] detects communities by repeatedly removing the edge with the largest betweenness from the networks, its output is a dendrogram representing the nested hierarchy of possible community structures of the network, and the level corresponding to the largest value of a measure, modularity, is taken as the final result. FastQ algorithm [23, 24] takes each node in the network as a community first and then repeatedly merges two of them into one. Its output is also a dendrogram depicting the merge procedure of possible community hierarchies. Zarandi et al.  randomly removed some edges with low similarity to obtain some disconnected components as the primary communities, and then some of them are merged to get the resulting community structure.
The modularity optimization-based algorithms detect community structures from networks by utilizing the physical meaning of modularity—the higher the value of modularity, the better the community structure—and taking the modularity as the objective to optimize. For instance, in order to maximize the modularity of the community structure, Fast[23, 24] joins a pair of communities whose merge can lead to the largest modularity increment in each iteration. Louvain algorithm  uses the node-moving strategy to extract community structure with the optimized modularity from the network, which begins with an initial partition of each node being a community as well; then for each node, the algorithm evaluates the modularity gain of moving it into the community to which each of its neighbors belongs and moves that node into the community with the largest positive modularity gain consequently. SLM (short for Smart Local Moving) algorithm  searches for possibilities of increasing modularity with respect to both splitting communities and moving sets of nodes from one community to another.
LPA (Label Propagation Algorithm)  makes utilization of information propagation mechanism to detect communities from networks. Every node in the network is initialized with a unique label and all nodes in the network are arranged in a random order first; then each node in that specific order updates its label to the one occurred most frequently among its neighbors. This label update procedure is ended with the status that every node in the network has a label which is the majority one among neighbors, and nodes with the same labels form a community. Owing to its simplicity and high efficiency, several variants have been derived from LPA. Barber et al.  proposed a series of algorithms that propagate labels under some constraints; LPAm is the most famous one, which tries to maximize the modularity during the label propagation procedure. Chin et al.  identified the main communities using the number of mutual neighboring nodes first; then they attached some independent constraints to the basic LPA and used the constrained LPA to add the remainder nodes into communities; finally, they used a node-moving strategy like that is employed in Louvain to refine the quality of the resulting community structure. Ding et al.  yielded a modified version of LPA, which exploits the idea of density peak clustering  and Chebyshev inequality to choose community centers from the network, and then propagates labels of the selected centers to the whole network with the proposed multistrategy of label propagation.
Density-based methods define and utilize the concept of density in networks for nodes or communities to uncover community structures. SCAN  borrows the idea from the classical density-based clustering algorithm, DBSCAN , to reveal communities, hubs, and outliers from networks. SCAN++  is a derivative of SCAN; it reduces time consumption via introducing a new data structure and reducing the number of density evaluations in the detecting procedure. IsoFdp  maps the network nodes as data points into a low-dimensional manifold and then exploits the density peak clustering algorithm  to extract the final community structure. LCCD algorithm  also practices on the way proposed in the density peak clustering algorithm  to locate the structural centers from networks and then expands communities from the identified centers to the borders using a local search procedure.
Network dynamic-based methods explore community structures by simulating the dynamic processes in networks. Random walk is a typical dynamic procedure carried out in networks; random walk-based methods utilize the tendency of the walker being trapped into a community during a short walk, rather than walking across the community border into another community, to detect communities from networks. WalkTrap  makes use of random walk to calculate the probability of going from one node to another during a short-length walk and then calculates the distance to measure nodes’ similarities and community similarities. PPC algorithm  considers the network as a single community initially and recursively partitions each community utilizing node similarities computed using random walks until further partitioning cannot acquire a better value of modularity. RWA  employs random walks to calculate the probability of a node belonging to a community, and each community is expanded by repeatedly attracting the node which is most likely to belong to that community to join. Besides this, Attractor  utilizes distance dynamics to explore communities from networks, node interactions might change the distances among nodes, and the distance change will make an impact on the interaction in reverse. Members of the same community will gradually move together under such interplays, and nodes in different communities will keep far away from each other steadily. BiAttractor  extends the concept of distance dynamics and the idea of Attractor to bipartite networks, which is used to detect two-mode communities of bipartite networks.
Spectral methods engage eigenspectra of various network-associated matrices to extract communities. For example, Amini et al.  found the initial node partitions using the spectral clustering method based on the normalized Laplacian matrix derived from a regularized adjacency matrix; those partitions were used for fitting a stochastic block model by a pseudolikelihood algorithm to detect the resulting community structure. Siemon C. de Lange et al.  identified an integrative community structure in the macroscopic anatomical neural networks of the macaque and cat and the microscopic network of the C. elegans by examining the spectra of their normalized Laplacian matrices. Krzakala et al.  produced a class of spectral algorithms to detect communities based on the nonbacktracking matrix, which depicts a nonbacktracking walk on the directed edges of the network. Shi et al.  proposed a spectral community detection method, LLSA, which employs Lanczos method to obtain the approximated eigenvector of the transition matrix with the largest eigenvalue, and the elements of this eigenvector approximately indicate the affiliation probability of the corresponding nodes to the communities.
Most of the methods mentioned above are global ones; they detect communities often depending on some global information, such as the number of communities, information about eigenvalues or eigenvectors, as prior knowledge, but they are hard to acquire due to the size of networks involved getting larger and larger. Moreover, most of them are computationally demanding, leading to high time complexity. These limitations prevent them from being applied to large-scale applications. To overcome the deficiency of the global algorithms, many local methods have been proposed, including some of the aforementioned methods. For example, LPA and most of its variations determine which label should be adopted by a node according to its neighborhood only; LCCD takes into account both the local density of nodes and the relative distance between nodes to locate the local structural centers and expands communities from the structural centers with a local search procedure; LLSA applies a fast heat kernel diffusing to sample a small subnetwork including almost all members of a community, and the eigenvector whose elements suggest nodes for their memberships of communities is obtained by performing Lanczos method on the sampled subnetwork.
Besides this, ComSim algorithm  identifies cores of communities from bipartite networks by seeking for cycles which are node chains formed by following outgoing links and reaching a node already visited and then allocates the remaining nodes to the communities that maximize the similarity between the node and the community. In BLI algorithm , local clustering information and local structural similarity are employed to establish the primary community structure; then some small-scale communities whose sizes are smaller than a given threshold, , are absorbed by some larger ones. kSIM  is also a local method that works in a bottom-up way. At the beginning, each node is taken as a community; then the preliminary communities are formed by identifying for each node the neighbor community to which one of its most similar neighbors with the lowest degree belongs and assigning the node to that community. In this procedure, common neighbor index is employed as the similarity measure for each pair of nodes.
Compared to those global ones, these local methods show good performance in large-scale networks. Inspired by this, we also propose a local method to extract communities from networks. The proposed method is based on node similarity and is termed as NSA (Node Similarity based Algorithm) for short; it comprises of two phases: the first phase aims at constructing the preliminary community structure; the second phase tries to improve the quality of the final result by merging some small or sparse communities. To do so, we also propose a measure, community metric, to evaluate the sparsity or smallness of communities. The details of the proposed method are elaborated in the next section.
3. The Proposed Method
3.1. The Framework of the Proposed Method
The framework of the proposed method is outlined by the pseudocode listed in Algorithm 1.
|Input: , the network; , the community metric threshold|
|Output: , the detected community structure|
|/ form the preliminary community structure /|
|/ merge small or sparse communities in /|
|2 PCM(, )|
As mentioned previously, the proposed method consists of two phases. Function calls FPC() and PCM() implement the two phases, respectively. The former establishes the preliminary community structure based on a node selection strategy and the node similarity; the latter merges some small or sparse communities to improve the quality of the resulting community structure. The inputs of this algorithm are the network and a threshold ; the network involved in this paper is the undirected and unweighted graph, which is always represented as as in Algorithm 1, where and are the node set and edge set, respectively; and are the number of nodes and edges in the network, individually. The threshold is used in the second phase of the proposed method to identify communities to be merged—a community whose community metric is smaller than should be merged into another one. The output of this algorithm is the detected community structure.
The next two subsections describe the two procedures concretely and deliberately.
3.2. Formation of the Preliminary Community Structure
The function FPC() implements the first phase of the proposed method, whose purpose is to construct the preliminary community structure from the network. We first pick out the node with the largest degree from the network, take it as the exemplar of the first community, and insert its most similar neighbor into the community as well (if there are more than one node with the largest degree in the network, we arbitrarily select any one of them to take it as the exemplar; and if the exemplar has more than one most similar neighbors, the one with the smallest degree is selected). Afterwards, the next largest-degree node in the remainder of network is selected; if its most similar neighbor has not been classified into any community yet, we create a new community for it and its most similar neighbor. Otherwise, if its most similar neighbor has been assigned to a certain community (e.g., the one denoted as ), we insert the selected node into that community (i.e., ) as well. We repeat this process until every node is classified into a community. In this procedure, densely connected nodes can quickly gather together around the exemplars to form communities. At the end of this procedure, we get a series of communities, which constitute the preliminary community structure of the network. The pseudocode describing the entire procedure is listed in Algorithm 2.
|Input: , the network|
|Output: , the identified preliminary community structure|
|1 Initialize variables and , which are used to record|
|the unclassified nodes and the preliminary community structure:|
|2 Select the node with the largest degree, denote it as :|
|3 Get the most similar neighbor of , denote it as :|
|4 if has not been assigned to any community then|
|5Create a new community for nodes and :|
|6 Insert the created community into the community structure:|
|7 Remove nodes and from as they are classified:|
|9Find the community to which belongs, denote it as :|
|10Insert node into :|
|11Remove node from as it is classified:|
|12 Repeat steps 2 through 11, until|
In this algorithm, the degree of node is the number of ’s neighbors and is denoted as , i.e.,whereis the set of neighbors of node . stands for the similarity between nodes and . There are abundant ways to calculate the similarity between nodes in the network; any one of them can be employed in principle. However, to pursue the efficiency, we calculate it here as in the following equation, which involves only the neighborhoods of nodes and themselves.The variables and are used to record the unclassified nodes and the preliminary community structure; they are naturally initialized to be the original node set of network and an empty set in step 1. Steps 2 and 3 select the node with the largest degree from the remainder of the network and its most similar neighbors and denote them as and , respectively. Step 4 determines whether has been assigned to a community or not; if it has not been classified to any community yet, steps 5 and 6 create a new community for nodes and and insert the newly created community into ; then step 7 removes nodes and from as they have been classified into the new community just now. If node has been already assigned to a community, step 9 finds the community , to which node ’s most similar neighbor belongs, and step 10 inserts node into community . Since node has been assigned to community , step 11 removes it from . Step 12 repeats operations in steps 2 through 11, until , meaning that all the nodes in the network have been visited. At that time, the preliminary community structure is obtained in and is returned as the output of this algorithm in step 13.
To make it clearer, we take Zachary’s karate club network  as an example to illustrate intuitively the procedure. This is a network with 34 nodes and 78 edges as shown in Figure 1(a), in which the node with the largest degree is node ‘34’, and its most similar neighbor is node ‘33’. Therefore, node ‘34’ is taken as the exemplar of the first community, and node ‘33’ is also inserted into this community. Then, the node with the largest degree in the remaining nodes is node ‘1’; its most similar neighbor is node ‘2’. Since node ‘2’ has not been assigned to a community yet, we create a new community, take node ‘1’ as its exemplar, and insert node ‘2’ into the new community as well. The same thing happens to node pairs (‘3’, ‘4’), (‘32’, ‘29’), and (‘9’, ‘31’) sequentially. Then the next largest-degree node is ‘14’; its most similar neighbor node ‘4’ is already in the third community; therefore, we insert node ‘14’ into the third community. All of the other nodes are processed in the same way, and in the subsequent operations, node pairs (‘24’, ’30’), (‘6’, ‘7’), (‘5’, ‘11’), and (‘25’, ‘26’) form new communities; all of the remaining nodes are inserted into communities to which their most similar neighbors belong. At the end of the process, we obtain the preliminary community structure as shown in Figure 1(b), in which each node connects to its most similar neighbor with a directed edge.
3.3. Merge of Small or Sparse Communities
At the end of the first phase of our proposed method, we obtain the preliminary community structure. However, some communities are either too small or too sparse to make sense, just like the preliminary communities ‘5’, ‘11’, ‘9’, ‘31’, ‘32’, ‘29’, ‘25’, ‘26’, ‘28’, ‘24’, ‘30’, ‘27’, and ‘6’, ‘7’, ‘17’ in Figure 1(b), because each of them contains only a few nodes, the inside edges of each of them are very sparse; the number of edges inside each of them is much smaller than that of edges connecting to outside, violating the characteristic that connections inside one community are much denser than those across different communities. Keeping them in the final community structure will lead to the low quality. Therefore, we merge some of the preliminary communities to acquire the final result in the second phase, which is carried out by function call PCM() in Algorithm 1.
To this end, there are two problems needed to be solved in PCM(). The first one is to identify which communities are small or sparse enough that need to be merged into another ones; the second one is to select the communities into which each of the small or sparse communities should be merged.
For the first problem, we propose an index, community metric, which takes into account two factors, community size and community sparsity, to find out the preliminary communities needed to be merged. Here, we formalize the relevant concepts and the index as Definition 1 through Definition 3.
Definition 1 (community sparsity). The sparsity of community is defined as follows:where is the set of edges within community and and is the set of edges connecting nodes in community with other communities.
That is to say, the sparsity of community is defined as the ratio between the number of inner edges of and the number of outer edges of . Obviously, the more edges exist within community , the larger the value of will be, and vice versa.
Definition 2 (community scale). The scale of community is formalized as follows:where is the set of nodes in community .
Obviously, the scale of community is defined as the ratio of the number of nodes in to the total number of nodes in the network. The more nodes there are in community , the larger value the ratio will be, and vice versa.
Definition 3 (community metric). The community metric is a combination of both the community sparsity and the community scale, which is defined for community as follows:
On the basis of these definitions, the first problem can be solved by setting a community metric threshold, . That is to say, if , community needs to be merged into another community.
For the second problem, we consider a strategy conforming to the construction of preliminary communities. The preliminary communities are formed based mainly on node similarity in the first phase; therefore, we also use the similarity as a criterion here to merge communities, i.e., each of the small or sparse communities is merged into its most similar adjacent community. Here, the similarity between two communities, and , is calculated as follows:where is the similarity between nodes and , which is calculated using (3). In function PCM() implementing the merge procedure, is a community needed to be merged; is one of its adjacent communities. The numerator of the right term in (7) is the sum of similarities between nodes in communities and . Dividing by the denominator, , is a constraint on the priority for larger communities to prevent from forming some giant communities.
The logic of entire procedure of the second phase is listed in Algorithm 3; the operations are almost self-explanatory. The variable is used to record the final community structure; it is initialized as the preliminary community structure, in step 1. Step 2 calculates the community metric for each of the preliminary communities, steps 3 and 4 select the community with the smallest community metric and its most similar community, step 5 merges them to yield a new community, and step 6 calculates the community metric for that new community. Step 7 replaces the two communities and with that new community in to reflect the effect of the merge operation. Step 8 repeats operations in steps 3 through 7, until the minimal community metric of the selected community is larger than the given threshold , meaning that all the remaining communities are satisfactory; therefore, the merge procedure is terminated and the resulting community structure in is returned in step 9.
|Input: , the preliminary community structure; , the community-metric threshold|
|Output: , the final community structure|
|1 Initialize , which is used to record the community structure:|
|2 Calculate the community metric for each of the preliminary communities:|
|3 Select the community with the minimal community metric, denote its index as :|
|4 Identify the most similar community with , denote its index as :|
|5 Merge communities and to form a new community:|
|6 Calculate the community metric for the new community:|
|7 Replace the two communities and with the new community to reflect the merging effect:|
|8 Repeat steps 3 through 7, until|
3.4. Time Complexity
The proposed algorithm is comprised of two phases; the first one is to form the preliminary communities. The main time consumption in this phase is on the selection of the node with the largest degree (step 2 in Algorithm 2) and its most similar neighbor (step 3 in Algorithm 2); the former can be accomplished in in each iteration using a max-heap data structure, the latter can be got down in with the max-heap, where is the average degree of nodes in the network. Since , the time consumption of the first phase is .
The second phase is used to improve the quality of the resulting community structure by merging some of the small or sparse communities. The major time is spent on determining the community needed to be merged and its most similar adjacent community in each iteration. Assuming there are communities in the preliminary community structure, the former operation can be implemented in ; the latter can also be carried out with time consumption in the worst case. Hence, the second phase can be implemented with time consumption.
Since , then . Therefore, the proposed method can detect communities from networks with a relatively high efficiency, time complexity.
4. Experimental Results and Discussion
4.1. Network Datasets and Comparison System
To testify the performance of our proposed method, we have conducted extensive experiments on both some groups of artificial networks and some real-world networks. The artificial networks are synthesized using LFR benchmark network generator , which works with some parameters to control the characteristics of generated networks. Here, we consider the influences of both the network scale and community size; therefore, four types of networks are generated, say, small networks with small communities and big communities and larger networks with small communities and big communities, respectively. Each of the small networks and larger networks contains 1000 and 5000 nodes, respectively; the small community contains about 10 nodes at least and 50 nodes at most; the minimum and maximum number of nodes in the big communities are 20 and 100, respectively. The generated networks with small communities and big communities are marked using the suffixes ‘s’ and ‘b’, individually. The exponents of the power-law distributions that node degree and community size follow are the default values, and , respectively. The parameters used to synthesize the four groups of artificial networks are listed in Table 1.
We also performed the experiments on 13 real-world networks; the size of these networks spans from tens to hundreds of thousands of nodes; the information about them is listed in Table 2. These real-world networks can be divided into two categories: the first category includes the first four networks whose ground-truth communities are known a priori; the second one contains the other nine networks, which have no publicly acknowledged ground-truth community structures.
On these networks, we ran our proposed method to detect community structures from them and compared the results to those of 5 popular community detection algorithms, namely, Fast, WalkTrap , LPA, Attractor, IsoFdp, which have been already introduced in Section 2. For LPA, since it is a nondeterministic algorithm, we ran it on each network 10 times and take the average of the evaluation metrics as its resulting metric value obtained from that network. For our proposed method, NSA, we empirically set for the dolphin social network and for other networks in the experiments. The details of how to set the optimal value of will be discussed in Section 5.
4.2. Evaluation Metrics
Two indexes, namely, NMI (Normalized Mutual Information)  and modularity, are adopted as the measure metrics to evaluate the quality of the detected community structure in this paper. The NMI between the ground-truth community structure and the extracted one is calculated as follows:where , , and , respectively.
The NMI is an information-theory based metric, which measures how much the detected community structure agrees with the ground truth. Therefore, it can only be used to evaluate the quality of the detected community structure on networks whose ground-truth community structure is already known. Its value is in the range of , larger is better.
Another metric widely used to evaluate the performance of community detection method is modularity, which is defined as follows:where is the diagonal element of a matrix , whose element is the fraction of edges between nodes in communities and to the total edges in the network, is the number of communities in the community structure; is the fraction of edges associated with nodes in community .
The first term in the right of (9) is the fraction of edges within communities; the second term is the expected value of the same fraction in a random graph, in which nodes and degree distribution are the same as in the original network, but edges are connected between nodes randomly. The smaller difference is between the two terms; the more the network approaches a random graph, then the weaker the community structure is. On the contrary, the larger the difference between them is, the network departs further from the random graph, then the stronger the community structure is. That is to say, the modularity measures quality of the community structure from the perspective of how far the detected result deviates from a random network; its effective value falls in , higher is better.
4.3. Synthetic Networks
We carried out experiments on four groups of artificial networks to testify the performance of the proposed method. As mentioned above, all the four types of artificial networks are synthesized using the LFR benchmark generator software . Besides the parameters listed in Table 1, another critical parameter for this software is the mixing parameter, , which regulates for each node the ratio of edges connected to nodes in other communities. The smaller the value of is, the clearer the community structure will be. Obviously, is a transitive point, above which communities in networks tend to be obscure.
In our experiments, we varied the value of from 0.1 to 0.8 with an increment of 0.1 for each group of LFR networks. To eliminate the occasionality, we generated 10 networks for each value of while keeping the same setting for other parameters. Since the community structures have been already embedded in these synthetic networks, we use NMI as the metric to evaluate the performance of our proposed method and the comparison algorithms. We took these networks as the input one by one to run our proposed method and the comparison algorithms to detect communities and use the average of NMI as the resulting metric. The results detected by our proposal and the comparison algorithms from the small networks with small-sized communities or big-sized communities are illustrated in Figures 2(a) and 2(b), respectively; the results revealed from the larger networks with small-sized communities and big-sized communities are presented in Figures 3(a) and 3(b), separately.
In Figures 2(a) and 2(b), Fast tends to introduce mistakes in the results no matter communities in networks are well separated or obscure. As mentioned previously, Fast is a typical modularity-optimization based algorithm; it aims only at acquiring results with larger modularity, rather than high accuracy. In our experiments, all of the results uncovered by it are not satisfactory. Even in the networks with , it still failed to identify the exact communities, and furthermore, its performance is the worst in comparison algorithms for . For , the quality of its results is only better than that of LPA. LPA performed as well as other comparison algorithms in those networks for , but its performance dropped dramatically for ; it even could not detect the effective communities from networks for . This might be due to its own label-update mechanism; when the community boundaries become obscure, nodes tend to accept incorrect labels to update their own ones, always leading to the trivial results; even all nodes are labeled as members of one giant community. The proposed method, NSA, acquired on all networks for , meaning that the detected partitions are perfectly matched with the ground-truth community structures in these networks. For , NSA also obtained the results as better as those of WalkTrap, Attractor, and IsoFdp. For , there has been a slip in the quality of the detected community structures for all those three algorithms and the proposed method. For , the quality of our proposal is better than that of Attractor in networks with larger communities; and for , the performance of our proposed method is the best.
In Figures 3(a) and 3(b), we obtained the similar results as those in Figure 2 overall. But they still differ from each other in some way. In Figure 3(a), our proposed method performed the best on almost all networks. For in Figure 2, NMI of the results extracted by our proposed method is lower than those of WalkTrap and IsoFdp; however, in Figure 3, the proposed method performed better than IsoFdp for . These results suggest that the performances of the comparison algorithms are not stable on different networks, but our proposed method can steadily extract high-quality community structures from networks with different characteristics. This is also can be manifested from the fact that all the curves of the proposed method in these figures decline more slowly than others. Moreover, we can draw a conclusion by comparing the curves of the proposal’s own in these figures that our proposed method inclines to perform better on larger networks with small communities; therefore, it overcomes the problem of resolution limit to some extent.
4.4. Real-World Networks
We also carried out experiments on 13 real-world networks to further test the effectiveness and efficiency of our proposed method. As mentioned in Section 4.1, these networks fall in two categories, ones with the ground-truth community structure known a priori and the other ones without publicly acknowledged ground truth.
Networks with Ground-Truth Community Structure. This category includes the first 4 networks listed in Table 2; since their ground-truth community structure is already known, we measure the quality of the community structures identified by the proposed method and comparison algorithms in terms of both NMI and modularity. The values of the two metrics obtained by the proposed method and comparison algorithms have been recorded in Table 3. The scales of these networks are relatively small, facilitating to us visualizing the detected results. Below, we analyze the results extracted by the proposed method from these networks individually.
The Karate Club Network. This is a network depicting the friendships among members of a karate club; it contains 34 nodes and 78 edges. This network was compiled by Wayne W. Zachary, who observed the karate club for 3 years. During the period of study of Zachary, the club split into two factions because of a dispute arisen between the administrator and the instructor. Corresponding to the two parts, the network is always taking the partition of two communities as the ground truth, which is shown in Figure 4(a). The result detected by our proposed method is presented in Figure 4(b).
From Figure 4, we can see that our proposed method detected 3 rather than 2 communities from the network. It seems that the detected result deviates from the ground truth in some ways, but this result coincides with the conclusion found in the experiments on synthetic networks that our proposed method tends to find small communities from networks to overcome the problem of resolution limit. Moreover, considering from the perspective of measure metrics, the modularity corresponding to the detected result is the largest among those of comparison algorithms. Although our proposed method is not based on the strategy of optimizing modularity, it inclines to acquire the community structure with as larger modularity as possible. If it is not the largest, it is the second largest with a small offset to the largest. These findings can also be manifested in next networks.
Lusseau’s Dolphin Social Network. This network describes the interactions of a group of dolphins living in Doubtful Sound, New Zealand. It consists of 62 nodes and 159 edges, which represent dolphin individuals and the cooccurrences of pairs of dolphins being observed, respectively. This network is generally partitioned into 4 groups as the ground-truth community structure, which is as exhibited in Figure 5(a). Figure 5(b) is the community structure uncovered by our proposed method.
In Figure 5, our proposed method detected communities from this network with a high degree of success, it identified 4 communities as well, the absolute majority of nodes are classified into the correct communities, and the result almost approaches the ground-truth community structure. Considering quantitatively, both the values of NMI and modularity corresponding to the result detected by the proposed method from this network are the largest among those of comparison algorithms, which means that the community structure identified by the proposed method is obviously better than those of comparison algorithms.
Risk Map Network. This network is a world political map loaded in the popular game, Risk (https://en.wikipedia.org/wiki/Risk_(game)), in which 42 countries or territories of 6 continents are involved. Therefore, 42 nodes and 83 edges connecting adjacent countries or territories are organized in 6 communities as the ground truth, which is illustrated in Figure 6(a). Feeding this network into the proposed method, we obtained the community structure as shown in Figure 6(b).
Comparing the detected result to the ground truth community structure, the community containing nodes ‘18’ and ‘23’ in the ground truth is split into two small communities in Figure 6(b), owning to the tendency of the proposed method. Besides this, nodes ‘26’, ‘33’, and ‘34’ are misclassified into the wrong communities in the detected result. But nodes ‘12’, ‘16’, ‘26’, ‘33’, and ‘34’ are special ones in this network; the outer edges associated with them are no less even more than those within the communities to which these nodes belong. Therefore, if we ignore the meaning of the actual representation of these nodes and consider qualitatively based on the topology only, the community structure extracted by our proposed method is more rational than the ground truth; more edges associated with these three nodes are located within the community than in the ground truth; thus, more tightly these three nodes are connected to nodes within the same community in Figure 6(b). When considering quantitatively, both values of the two measure metrics of our proposed method are second only to those of Fast and are the same with those of WalkTrap. These results also confirm that our proposed method provides us with an acceptable solution to the problem of community detection.
Scientists Collaboration Network. This is the largest connected component of a network delineating the coauthor relationship among scientists working at the Santa Fe Institute, New Mexico. Nodes in this network represent scientists; edges stand for the two scientists who have collaborated at least on one paper. There are 118 nodes and 197 edges in total in this network. The nodes can be divided into 6 groups as the ground-truth communities according to the specialties of the scientists, which is as presented in Figure 7(a). Taking this network as the input to the proposed method, we obtained the community structure as illustrated in Figure 7(b).
The proposed method revealed 8 communities from this network; two additional communities are detected in Figure 7(b). These two communities are relatively independent components, especially for the community containing nodes ‘1’; there are much more inner edges than outer edges. That is to say, nodes in these two communities are connected more tightly to one another than with the remainder of the network. Therefore, isolating them from the network and taking them as independent communities are also reasonable. Considering from the perspective of measure metrics, the value of NMI obtained by the proposed method is the largest, which suggests that the result detected by our proposal is the one most approaches the ground-truth community structure; the modularity value of the proposed method is not the largest though; it is also second only to that of Fast. These results also testify that our proposed method can extract high-quality community structure from networks.
Networks without Ground-Truth Community Structure. This category contains the last 9 real-world networks listed in Table 2. For the experiments carried out on this category of networks, we evaluate the quality of the extracted community structures using the modularity only due to the absence of the ground-truth community structures. For the proposed method and comparison algorithms, the obtained values of modularity have been recorded in Table 4. To illustrate them intuitively, we also plotted them in a bar chart, which is presented in Figure 8.
On these networks, our proposed method achieved the largest modularity from 8 of them. On the only other one network, ColiNeta, it still obtained the second largest value of modularity. For Fast, it is based on the modularity optimization strategy though; it acquired the largest value of modularity on network ColiNeta only. For WalkTrap, it is an approach based on random walk; then its time complexity is relatively high. It cannot manage to get effective results from networks Amazon and DBLP, due to the large scale of these two networks. For LPA and Attractor, they can extract community structures from all those networks, but the quality of the detected results is not satisfactory. For IsoFdp, it can only be applied to connected networks and cannot run on networks ColiNeta, NetScience, and YeastL, as these three networks are disconnected. It cannot detect the community structure from networks Amazon and DBLP effectively either because of their large scale. These comparison results manifest that our proposed method can steadily, effectively, and efficiently provide us with promising solutions for the problem of community detection in networks of wide-range applications and outperform comparison algorithms significantly.
5. Parameter Setting
In the second phase of the proposed method, we introduce a threshold for the community metric to identify the preliminary communities needed to be merged. As aforementioned, we calculate the community metric for every preliminary community in the merge procedure; if the value of is below the threshold , the corresponding community is identified as the one needed to be merged.
Therefore, works as a parameter in our proposed method, whose setting can influence the quality of the resulting community structure. Considering qualitativity, the larger or the sparser the network is, the threshold should be smaller in accordance with the definitions of community sparsity (), community scale (), and community metric (). To determine the optimal value of , we conduct a group of experiments to explore the relationship between the value of and the quality of the resulting community structure on the first four networks listed in Table 2, namely, the karate club network, the dolphin social network, the map of game Risk, and the scientists collaboration network, respectively. The quality of the resulting community structure is measured in term of modularity . We vary the value of from 0 to 1.0 by increasing 0.005 each time; for each value of , we run our proposed method on these networks and observe the change of modularity along with the varies of .
The observed results are as illustrated in Figure 9, in which we plotted only the proportion of because the largest modularities are obtained during on all of those four networks. Our proposed method gets the largest modularity when on the dolphin social network and on the other three networks. Therefore, we adopt the corresponding value for those four networks and empirically set for other networks to perform the experiments. In Figure 9, the largest modularity is obtained around the value of , and the interval of covers the optimal value of . Therefore, we empirically suggest that be adjusted adaptively around 0.1 in the range of according to the size and the sparsity of networks involved in real-world applications.
(a) The karate club network
(b) The dolphin social network
(c) The risk map network
(d) The scientists collaboration network
In this paper, we presented a novel method to detect communities from networks. It is a local method based on node similarity and overcomes the deficiency of high time consumption of global methods. First, we construct the preliminary community structure by repeatedly selecting the node with the largest degree and either taking it as the exemplar of a new community or inserting it into the community to which its most similar neighbor belongs; on the basis of its most similar neighbor’s community assignment, i.e., if its most similar neighbor has not been assigned to any community yet, we create a new community for it and its most similar neighbor; if its most similar neighbor has been assigned to a certain community, we insert it into that community as well. At the end of this process, we obtain a series of preliminary communities. However, some of them might be too small or too sparse, leading to a low-quality result. Therefore, we merge some of the preliminary communities to acquire the final community structure. To do so, we also proposed some indexes which take both the size and sparsity of communities into account to determine which communities should be merged.
To test the performance of the proposed method, we have performed extensive experiments on four groups of synthetic networks and 13 real-world networks and compared the detected community structures with the results extracted by comparison algorithms in terms of NMI and modularity; the comparison results demonstrate that our proposed method can extract high-quality community structures from networks abstracted from various applications, and nodes in the extracted communities are connected more tightly. The proposed method overcomes the problem of resolution limit to some extent and outperforms the competitors successfully.
We have conducted experiments on some artificial networks and some real-world datasets. The artificial networks are synthesized using LFR benchmark network generator, which can be freely available at https://sites.google.com/site/santofortunato/. The parameters used to synthesize the artificial networks are listed in Table 1. The real-world data supporting this study are from previously reported studies, which have been cited in Table 2. Most of the real-world datasets can also be downloaded from http://www-personal.umich.edu/~mejn/netdata/ and https://snap.stanford.edu/data/index.html. The ColiNeta dataset was provided by Jeong et al. . We construct the Risk Map network manually according to the literature .
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was partially supported by the National Natural Science Foundation of China (Grant ID: 61602225).
- J. Kleinberg and S. Lawrence, “Network analysis: The structure of the web,” Science, vol. 294, no. 5548, pp. 1849-1850, 2001.
- P. Chen and S. Redner, “Community structure of the physical review citation network,” Journal of Informetrics, vol. 4, no. 3, pp. 278–290, 2010.
- M. E. J. Newman, “Modularity and community structure in networks,” Proceedings of the National Acadamy of Sciences of the United States of America, vol. 103, no. 23, pp. 8577–8582, 2006.
- E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, and A. L. Barabási, “Hierarchical organization of modularity in metabolic networks,” Science, vol. 297, no. 5586, pp. 1551–1555, 2002.
- R. Guimerà and L. A. N. Amaral, “Functional cartography of complex metabolic networks,” Nature, vol. 433, no. 7028, pp. 895–900, 2005.
- M. Girvan and M. E. J. Newman, “Community structure in social and biological networks,” Proceedings of the National Acadamy of Sciences of the United States of America, vol. 99, no. 12, pp. 7821–7826, 2002.
- M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, vol. 69, no. 2, Article ID 026113, 2004.
- P. M. Gleiser and L. Danon, “Community structure in jazz,” Advances in Complex Systems (ACS), vol. 6, no. 4, pp. 565–573, 2003.
- Y. van Gennip, B. Hunter, R. Ahn et al., “Community detection using spectral clustering on sparse geosocial data,” SIAM Journal on Applied Mathematics, vol. 73, no. 1, pp. 67–83, 2013.
- M. E. J. Newman, “Finding community structure in networks using the eigenvectors of matrices,” Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, vol. 74, no. 3, Article ID 036104, 19 pages, 2006.
- S. Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 3–5, pp. 75–174, 2010.
- S. Fortunato and D. Hric, “Community detection in networks: a user guide,” Physics Reports, vol. 659, pp. 1–44, 2016.
- B. W. Kernighan and S. Lin, “An efficient heuristic procedure for partitioning graphs,” Bell Labs Technical Journal, vol. 49, no. 1, pp. 291–307, 1970.
- W. W. Zachary, “An information flow model for conflict and fission in small groups,” Journal of Anthropological Research, vol. 33, no. 4, pp. 452–473, 1977.
- D. Lusseau, “The emergent properties of a dolphin social network,” in Proceedings of the Royal Society of London B: Biological Sciences, vol. 270, supplement 2, pp. S186–S188, 2003.
- K. Steinhaeuser and N. V. Chawla, “Identifying and evaluating community structure in complex networks,” Pattern Recognition Letters, vol. 31, no. 5, pp. 413–421, 2010.
- M. E. J. Newman, “The structure and function of complex networks,” SIAM Review, vol. 45, no. 2, pp. 167–256, 2003.
- H. Jeong, B. Tombor, R. Albert, Z. N. Oltval, and A.-L. Barabásl, “The large-scale organization of metabolic networks,” Nature, vol. 407, no. 6804, pp. 651–654, 2000.
- R. Guimerà, L. Danon, A. Díaz-Guilera, F. Giralt, and A. Arenas, “Self-similar community structure in a network of human interactions,” Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, vol. 68, no. 6, Article ID 065103, 2003.
- R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Network motifs: simple building blocks of complex networks,” Science, vol. 298, no. 5594, pp. 824–827, 2002.
- M. Boguñá, R. Pastor-Satorras, A. Díaz-Guilera, and A. Arenas, “Models of social networks based on social distance attachment,” Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, vol. 70, no. 5, Article ID 056122, 2004.
- J. Yang and J. Leskovec, “Defining and evaluating network communities based on ground-truth,” Knowledge and Information Systems, vol. 42, no. 1, pp. 181–213, 2015.
- M. E. J. Newman, “Fast algorithm for detecting community structure in networks,” Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, vol. 69, no. 6, Article ID 066133, 2004.
- A. Clauset, M. E. J. Newman, and C. Moore, “Finding community structure in very large networks,” Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, vol. 70, no. 6, Article ID 066111, 2004.
- F. Dabaghi Zarandi and M. Kuchaki Rafsanjani, “Community detection in complex networks using structural similarity,” Physica A: Statistical Mechanics and its Applications, vol. 503, pp. 882–891, 2018.
- V. D. Blondel, J. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2008, no. 10, Article ID P10008, 2008.
- L. Waltman and N. J. Van Eck, “A smart local moving algorithm for large-scale modularity-based community detection,” The European Physical Journal B, vol. 86, no. 11, article 471, pp. 1–14, 2013.
- U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks,” Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, vol. 76, no. 3, Article ID 036106, 2007.
- M. J. Barber and J. W. Clark, “Detecting network communities by propagating labels under constraints,” Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, vol. 80, no. 2, Article ID 026129, 2009.
- J. Hou Chin and K. Ratnavelu, “A semi-synchronous label propagation algorithm with constraints for community detection in complex networks,” Scientific Reports, vol. 7, Article ID 45836, 2017.
- J. Ding, X. He, J. Yuan, Y. Chen, and B. Jiang, “Community detection by propagating the label of center,” Physica A: Statistical Mechanics and its Applications, vol. 503, pp. 675–686, 2018.
- A. Laio and A. Rodriguez, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014.
- X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A structural clustering algorithm for networks,” in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’07), pp. 824–833, ACM, New York, NY, USA, August 2007.
- M. Este, H. P. Kriegel, S. Jörg, and x. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96), pp. 226–231, AAAI Press, 1996.
- H. Shiokawa, Y. Fujiwara, and M. Onizuka, “Scan++: Efficient algorithm for finding clusters, hubs and outliers on large-scale graphs,” in Proceedings of the 3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006, pp. 1178–1189, Republic of Korea, September 2006.
- T. You, H.-M. Cheng, Y.-Z. Ning, B.-C. Shia, and Z.-Y. Zhang, “Community detection in complex networks using density-based clustering algorithm and manifold learning,” Physica A: Statistical Mechanics and its Applications, vol. 464, pp. 221–230, 2016.
- X. Wang, G. Liu, J. Li, and J. P. Nees, “Locating structural centers: A density-based clustering method for community detection,” PLoS ONE, vol. 12, no. 1, Article ID e0169355, 2017.
- P. Pons and M. Latapy, “Computing communities in large networks using random walks,” in International symposium on computer and information sciences, pp. 284–293, 2005.
- S. A. Tabrizi, A. Shakery, M. Asadpour, M. Abbasi, and M. A. Tavallaie, “Personalized PageRank clustering: a graph clustering algorithm based on random walks,” Physica A: Statistical Mechanics and its Applications, vol. 392, no. 22, pp. 5772–5785, 2013.
- Y. Su, B. Wang, and X. Zhang, “A seed-expanding method based on random walks for community detection in networks with ambiguous community structures,” Scientific Reports, vol. 7, Article ID 41830, 2017.
- J. Shao, Z. Han, Q. Yang, and T. Zhou, “Community detection based on distance dynamics,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1075–1084, ACM, Australia, August 2015.
- H.-L. Sun, E. Ch'ng, X. Yong, J. M. Garibaldi, S. See, and D.-B. Chen, “A fast community detection method in bipartite networks by distance dynamics,” Physica A: Statistical Mechanics and its Applications, vol. 496, pp. 108–120, 2018.
- A. A. Amini, A. Chen, P. J. Bickel, and E. Levina, “Pseudo-likelihood methods for community detection in large sparse networks,” The Annals of Statistics, vol. 41, no. 4, pp. 2097–2122, 2013.
- S. C. de Lange, M. A. de Reus, and M. P. van den Heuvel, “The laplacian spectrum of neural networks,” Frontiers in Computational Neuroscience, vol. 7, no. 189, 2014.
- F. Krzakala, C. Moore, E. Mossel et al., “Spectral redemption in clustering sparse networks,” Proceedings of the National Acadamy of Sciences of the United States of America, vol. 110, no. 52, pp. 20935–20940, 2013.
- P. Shi, K. He, D. Bindel, and J. E. Hopcroft, “Local Lanczos Spectral Approximation for Community Detection,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, vol. 10534 of Lecture Notes in Computer Science, pp. 651–667, Springer International Publishing, 2017.
- R. Tackx, F. Tarissan, and J. Guillaume, “ComSim: a bipartite community detection algorithm using cycle and node’s similarity,” in International Workshop on Complex Networks and their Applications, vol. 689 of Studies in Computational Intelligence, pp. 278–289, Springer International Publishing, 2017.
- T. Wang, L. Yin, and X. Wang, “A community detection method based on local similarity and degree clustering information,” Physica A: Statistical Mechanics and its Applications, vol. 490, pp. 1344–1354, 2018.
- K. R. Žalik, “Maximal neighbor similarity reveals real communities in networks,” Scientific Reports, vol. 5, Article ID 18374, 2015.
- A. Lancichinetti, S. Fortunato, and F. Radicchi, “Benchmark graphs for testing community detection algorithms,” Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, vol. 78, no. 4, Article ID 046110, 2008.
- L. Ana and A. Jain, “Robust data clustering,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. II-128–II-133, Madison, WI, USA, 2003.
Copyright © 2019 Jianjun Cheng et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.