Abstract

Community structures in complex networks play an important role in researching network function. Although there are various algorithms based on affinity or similarity, their drawbacks are obvious. They perform well in strong communities, but perform poor in weak communities. Experiments show that sometimes, community detection algorithms based on a single affinity do not work well, especially for weak communities. So we design a self-adapting switching (SAS) algorithm, where weak communities are detected by combination of two affinities. Compared with some state-of-the-art algorithms, the algorithm has a competitive accuracy and its time complexity is near linear. Our algorithm also provides a new framework of combination algorithm for community detection. Some extensive computational simulations on both artificial and real-world networks confirm the potential capability of our algorithm.

1. Introduction

The continuing advance of network science plays a prominent role in deepening the understanding of complex systems in the real world [13]. Among others, one salient property commonly observed in many complex networks is the community structure, i.e., the organization of nodes in different groups, with many edges connecting nodes of the same group and comparatively fewer connections among nodes of different groups [47]. For instance, in a scientific citation network, communities are sets of scientific papers on the same topic or in a similar research field [8], while in protein-protein interaction networks, proteins working in the same biological process (or being in the same cellular component) interact with each other. Moreover, the community structure has been shown to have strong impacts on epidemic dynamics [9, 10] and link prediction. Therefore, with the acquisition of the real network data, one should pay careful attention to the community structure, which is of value to further investigations of complex networks.

For a deep understanding to the community structure, it is necessary to define what a community is. In general, there are three types of definitions: local definition, global definition, and definition based on vertex similarity [6], including the definition based on modularity and the topological structure, such as the self-referring definition and comparative definition [11]. However, there are few definitions that quantitatively describe the community structure. In 2003, Radicchi et al. provide the community definitions in both the strong and weak sense with the quantitative description [12]: the subgraph C is a community in a strong sense ifand in a weak sense if

The above quantitative definitions mean that the degrees inside of all, or most, nodes are more than the degrees outside, where the degree inside is the number of node’s neighbors in the same community and the degree outside is the number of node’s neighbors in other communities. Thereafter, another quantitative definition is defined by Hu et al. [11] as follows: subnetworks (or subgraphs) are said to be m communities of a network (or graph) G if and only if they satisfy that , and for any node , one haswhere A is the adjacency matrix of the graph G. Unlike the consideration by Zhan et al. [13], we regard this definition as the generalized definition, since it allows that each node degree outside can be more than degree inside, and only need the node which has the largest number of neighbors with its own community. In this paper, we use this definition as our standard for community detection and it is remarkable that the overlap of node is not considered and node belongs to only one community based on the detection result.

In order to accurately describe the quantitative relation between the degree inside and outside of communities, Lancichinetti et al. introduce a mixing parameter for each node i to denote that the node i shares a fraction of its links with external nodes and a fraction with internal nodes, i.e., [14, 15]. In this paper, we consider that the mixing parameter of each node is less than 0.5 in strong communities, and contrarily, it is more than 0.5 in weak communities, and these two kinds of communities all satisfy the definition of Hu et al.

There have been various kinds of algorithms designed for community detection. For example, the Kernighan–Lin algorithm, spectral bisection method, k-means clustering method, and the spectral clustering algorithm are traditional algorithms derived from graph theory or statistics. With the development of computers, large-scale computing is becoming widely available, so it is feasible to increase the calculation complexity and network scale. These advances enable researchers to develop many optimized algorithms, including the greedy algorithms based on modularity [16] and betweenness [4, 17]. Meanwhile, there are some algorithms which are based on dynamical methods [1823] and similarity or affinity [24, 25]. However, ignoring difference between the strong and weak communities is a major drawback to some algorithms based on node affinity or similarity, which makes the detection accuracy of these algorithms low for weak communities. Thus, we design a self-adapting switching (SAS) algorithm based on single affinity and combination of two affinities.

The evaluation criterions for the performance of community detection can be determined by two kinds of approaches. One is to compute the topology-based metrics, including the coverage, conductance, and modularity metrics. The other is to calculate the knowledge-driven measurements, such as the Precision metrics, Jaccard index, and the normalized mutual information (NMI) [26]. We adopt NMI index as the evaluation criterion for the performance of algorithms in some real-world networks, the Lancichinetti–Fortunato–Radicchi (LFR) benchmark networks (heterogeneous networks) [14], the Girvan–Newman (GN) benchmark networks (homogeneous networks) [4], and the nonuniform popularity similarity optimization (nPSO) benchmark networks (heterogeneous networks) [27]. Based on the results, we find that our algorithm has an advantage over some state-of-the-art algorithms and is more suitable for heterogeneous networks with larger power-law exponent. This paper is outlined as follows. In Section 2, we design the principle of our algorithm and discuss its complexity. Tests and results are presented in Section 3. Conclusions are summarized in Section 4.

2. Structural Analysis and Algorithm

In this section, we will present an analysis about the community structure and design the affinity-based SAS algorithm for community detection, and then its complexity is discussed at last.

2.1. The Analysis of Community Structure

Some studies indicate that the node degrees generally obey the power-law distribution [28, 29] or [30] log-normal distributions in real-world networks, where the nodes with large degree are known as hub nodes and have strong degree centrality, such as the network in Figure 1. Although the number of hub nodes in real-world networks is relatively small, their vital roles in communities and networks have been repeatedly mentioned in some literature studies [13, 31, 32]. The identification of the hub nodes is usually considered as the breakthrough point for heuristic algorithms. In these algorithms, a single affinity is often deficient for community detection, especially for the weak communities. Therefore, we design a new algorithm that combines two affinities in the detection of weak communities.

As is well known, the ultimate aim of the community detection algorithms that are based on affinity or modularity is to find the global maximum of such indices and to guarantee the minimum number of connections between different communities. Both of them are nondeterministic polynomial hard problems. Putting aside these problems, our algorithm is heuristic and its detection process is based on the affinity between the nodes being detected and having been detected, rather than between two single nodes. Motivated by the different affinities, i.e., the common neighbors (CN), hub depressed (HD), and hub promoted (HP) indices summarized by Zhou et al. [33], we provide two definitions of affinity for node j and node set P as follows, and some important notations are shown in Table 1.

The first affinity between any node j and node set P is as follows:

The second affinity between any node j and node set P is given bywhere is the degree of node j.

These two affinities have different emphases: the first one focuses on the absolute number of the common neighbors and the second is the relative affinity. Our heuristic algorithm is implemented from the hub node and then detects other nodes belonging to the same community based on these affinities.

Generally, the affinities between nodes in one community are larger than those between nodes in different communities, while this is hard to be satisfied sometimes, especially for the weak communities. To illustrate this point, we calculate the first and second affinity between the hub node and its neighbors in LFR benchmark graphs. First, the second affinity between the hub node’s neighbors in the same community and other communities is shown in Figure 2(a). We discover that the second affinities of nodes in strong communities have obvious differences, but they are mixed together in the weak communities when . We conduct a similar experiment on the weak communities with the first affinity to observe its distinction ability. Since the first affinity is the absolute number of common neighbors, we normalize it and only pay attention to its normalization form in Figure 2(b), where the notation of hub node is i, the notation of its neighbors set is , and node j is a neighbor of hub node:

From the statistical results, we find that, for the strong communities, the second affinity has effective distinction ability. However, it is not enough to detect the weak communities and need to work with the first affinity. Moreover, the detection method of strong communities is not suitable to weak communities and may detect many communities composed of several nodes or even a single node, which can be a trigger principle of the switch condition in our SAS algorithm. So our algorithm is divided into two parts, which we name in short as SAS-1 and SAS-2, respectively. Next, we will describe the algorithm and its principle in detail.

2.2. The Algorithm

Here, we will introduce the two parts of our algorithm including its core principles and pseudocodes and then analyze its complexity. Some important notations are also shown in Table 1.

2.2.1. The Strong Community Method SAS-1

In this method, each community, its nodes and the edges of these nodes, will be gradually deleted from the network after the end of its detection. So we denote the network as after the () community has been detected, where and are the sets of nodes and edges, respectively. In order to describe the algorithm generally, we will use the example of the detection of community.The first step: at step , the method selects one node as the hub node, whose degree is maximal in . At this step, the hub node i and its neighbors, satisfying , are the detected nodes belonging to , where the node set P consists of the node i and its neighbors, node .The second step: at step , the method searches the nodes in , and then these nodes’ neighbor j is substituted into this community if and only if it satisfies the conditionwhere the value 0.5 is confirmed by the definition of the strong community, and then the community from to is updated.The step: similarly, when , in order to reduce the complexity, the method searches the nodes in and only detects these nodes’ undetected neighbors. Then, neighbor j is substituted into if and only if it satisfies the conditionand the community is updated.

The detection process of the community finishes until there are no nodes satisfying condition (8).

2.2.2. The Weak Community Method SAS-2

From the results in Figure 2, we can infer the method SAS-1 may detect many communities that are composed of several nodes or even a single node in weak communities. Hence, the algorithm needs a self-adapting switching condition to reflect this phenomenon and make it to switch from SAS-1 to SAS-2. Our method is to calculate the average scale of communities having been detected and the switching condition between the two methods is given bywhere , in which and is the average degree, and is the current number of communities having been detected. Actually, few neighbors of hub node in weak community can satisfy condition (7), so the average scale of communities detected by SAS-1 is the same order with , and the parameter p derived from hypothesis test is a small incidence rate.

Once the SAS-1 triggers the switching condition, it will switch to SAS-2 and redetects the network. Different from the first method SAS-1, this method does not delete any nodes or edges from the network because the recognition of the weak communities depends on the whole construction of the network. In the following, we will also introduce this new method by taking the detection progress of the community as an example.The first step: at step , the method selects the node i with the maximal degree as the starting node, which does not belong to other communities . Obviously, we have after confirming the starting node.The second step: at step , the method chooses the hub node’s neighbor j not belonging to other communities , as the member of the community if and only if it satisfies the following condition:where γ is a threshold based on the average value of in Figure 2, and then is updated.The step: when , similar to the method SAS-1, this method searches the nodes in and only detects these nodes’ undetected neighbors. Then neighbor j is substituted into if and only if it satisfies the condition

The termination condition of the community is to separate the undetected nodes with lower affinity from the nodes having been detected, which have higher affinity each other. We assume that the detection of the community stops when there is no node j satisfying the following condition at step :where node j is one of the undetected neighbors of nodes, which belong to , and node belong to , and the parameter is used to cut the community in the network.

The algorithm pseudocodes are shown in Algorithm 1 and its process structure is shown in Figure 3. Last, we analyze the algorithm complexity. In the method SAS-1, the detection process is conducted in every communities, so we consider that the average step number for each community is . The complexity in searching and filtering for each node by the condition (8) scale is . With the detection of communities, the number of nodes is reduced, so the extreme complexity is about , where is the number of communities having been detected and is the average degree. In the method SAS-2, the detection process is different only in the detected condition (11), so the complexity is also about . In summary, the complexity of the SAS algorithm is . Evidently, is the same order as the average size of community , namely, . Since , thus the complexity can be derived to , where m is the number of network edges.

Algorithm community detection with self-adapting switching.
(1)input Adjacency matrix of the network.
(2)while .
(3) Select a node as the hub node such that its degree is maximal in the current network.
(4) Gradually search the suitable nodes by .
(5) Repeat step 4 until no nodes satisfy the condition in step 4, and then update the network and start the next detection.
(6)if fails.
(7) Select a node as the hub node such that its degree is maximal in the network.
(8) Gradually search the suitable nodes by the double affinities condition: .
(9) Repeat step 8 until no nodes satisfy , then start the next detection.
(10)Adjust the results based on the community definition.

3. Results

In this section, some experiments are performed on both real-world networks (the karate club network, the dolphin network, the football team network, and the political books network) and synthetic networks (LFR, GN, and nPSO benchmark graphs). First, we use the GN benchmark to estimate suitable parameters range and analyze parameter sensitivity. The SAS algorithm relies on three parameters β, γ, and ρ, where the choice of parameter β is related to the average degree . The parameters ρ and γ can be freely selected in the range . In the GN benchmark, its scale is 128 and degree distribution is relatively concentrated, so it is suitable for parameter sensitivity analysis. Because its average degree is 16, so we default the and mainly study the sensitivity of parameters ρ and γ. Based on the results in Figure 4, we find that the results are insensitive to parameters β and γ. However, the changes of parameter ρ have obvious influence on the results when . Fortunately, when the parameter , all the detected results are stable and do not have wide-range fluctuations.

In practice, the parameter γ should be close to 1 to ensure the accuracy of initial detected nodes. The parameter should be appropriately increased with the increase of clustering coefficient. Then, we evaluate the advantages and disadvantages of our algorithm compared with other state-of-the-art algorithms: Infomap, LPA, Louvain, Walktrap, Fast greedy, EM, and Blondel. The performance comparison in real-world networks confirms its potential capability shown in Table 2 and Figure 5. It is worth mentioning that some community divisions are slightly different from the ground truth. The possible reason is that the detailed division of communities leads to an increase in the number of community, but its results at least satisfy the quantitative definition of our article and have a good accuracy rate.

In order to discuss our algorithm accuracy deeply, we use three benchmark networks: LFR, GN, and nPSO to study how the algorithm performance, NMI [26], changes with the weakening of the community structure.

3.1. The LFR Benchmark

In this part, LFR networks have two different scales: 1000 and 5000, as presented in Figure 6. For each kind of network, we consider two different community sizes, indicated by the letters S and B, where S stands for “small” communities that have about 10 to 50 nodes and B stands for “big” communities that have about 20 to 100 nodes [15]. In Figure 6, our algorithm tests four types of networks by NMI with . For the strong and weak community, the performance of our algorithm is better than some algorithms in Table 2.

3.2. The GN Benchmark

Beyond that, we test the SAS algorithm in the GN benchmark network with the results shown in Figure 7, where each point is also tested on 100 same kind networks. The performance of SAS algorithm is as good as other algorithms in Table 2. It is well known that the LFR benchmark is a kind of heterogeneous networks, whose degree distribution follows the power-law distribution. However, for the GN benchmark, its degree distribution follows the normal distribution and the role of hub nodes is weakened. Maybe the heterogeneity of network structure will affect the accuracy of our algorithm. Next, we will use the nPSO benchmark to conduct the further analysis of the performance of our algorithm.

3.3. The nPSO Benchmark

Recently, there is a new network generative model named nonuniform popularity similarity optimization (nPSO) for evaluation of community detection and link prediction that can create synthetic networks with controlled parameters: network scale, average degree, community number, power-law exponent, and temperature. It allows one to tune the mixing property of networks by temperature. In particular, this model simulates how random geometric graphs grow in the hyperbolic space, generating realistic networks with clustering, small-worldness, scale-freeness, and rich-clubness.

In this part, we generate the nPSO hyperbolic networks with community with these parameters: (network size), (half of average degree), (temperature, inversely related to the clustering coefficient), (number of communities), and (power-law degree distribution exponent). We also compare the SAS algorithm with state-of-the-art community detection algorithms. From the results in Figures 810, we find that the performance of SAS algorithm is not sensitive to the change of parameters N, , and . However, it performs well in the heterogeneous network with and generally with . This indicates that our algorithm may be more suitable for heterogeneous networks with larger power-law exponent. Combining all the detection results, we can see that the SAS algorithm has some advantages over other state-of-the-art algorithms, and its accuracy ranks high among those algorithms in some benchmarks. The near linear time complexity is also an advantage of our algorithm.

4. Conclusions

In this paper, the performance of SAS algorithm is evaluated with some state-of-the-art algorithms in real-world networks as well as three benchmark graphs, traditionally used in the existing literatures. First, experimental results show that it is feasible to use different affinities for strong and weak communities. Our algorithm improves the accuracy of weak communities, compared with some algorithms based on single affinity, and has the same reliability as some state-of-the-art algorithms. Second, some heuristic algorithms based on hub node may need to analyze the network degree distribution or clustering coefficient in advance to improve the accuracy of the algorithm. The weakening of the role of hub nodes may be the reason why our algorithm performs bad in nPSO benchmark with power-law exponent 2, but performs well in LFR benchmark and nPSO benchmark with power-law exponent 3. This is also an important direction of algorithm improvement in the future. Last, our definitions of affinity are based on the concept of common neighbours. Recently, there is a new paradigm to define affinities that not only uses the information associated with the number of common neighbours but also considers (and integrates) the information associated with the links that occurs between the common neighbours. The union of common neighbours and their cross-links is named as local community, and the redefinition of affinities based on common neighbours in function of local communities has demonstrated to significantly boost link prediction in both monopartite and bipartite networks. If the SAS algorithm adopts affinities based on the local community paradigm, instead of the simple common neighbours’ paradigm, we guess that this possible innovation may make our algorithm more suitable for heterogeneous networks with smaller power-law exponent.

Data Availability

Previously reported data were used to support this study and are available at Mark Newman’s network data (see http://www-personal.umich.edu/∼mejn/netdata/) and the algorithm LFR procedure is available at https://github.com/eXascaleInfolab/LFR-Benchmark_UndirWeightOvp#changelog. The original authors have already made the data freely available. These prior studies (and datasets) are cited at relevant places within the text as references [4, 15].

Conflicts of Interest

The authors declare that no conflicts of interest exist in the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant nos. 61873154, 11331009, and 11601294 and BNU Interdisciplinary Research Foundation for the First-Year Doctoral Candidates (no. BNUXKJC1806).