Complexity

Volume 2019, Article ID 9764341, 20 pages

https://doi.org/10.1155/2019/9764341

## Effectively Detecting Communities by Adjusting Initial Structure via Cores

^{1}School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China^{2}College of Educational Science and Technology, Northwest Minzu University, Lanzhou 730030, China

Correspondence should be addressed to Mei Chen; moc.liamtoh@utjzl.nehc.iem

Received 1 May 2019; Revised 29 September 2019; Accepted 4 October 2019; Published 3 November 2019

Academic Editor: Ludovico Minati

Copyright © 2019 Mei Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Community detection is helpful to understand useful information in real-world networks by uncovering their natural structures. In this paper, we propose a simple but effective community detection algorithm, called ACC, which needs no heuristic search but has near-linear time complexity. ACC defines a novel similarity which is different from most common similarity definitions by considering not only common neighbors of two adjacent nodes but also their mutual exclusive degree. According to this similarity, ACC groups nodes together to obtain the initial community structure in the first step. In the second step, ACC adjusts the initial community structure according to cores discovered through a new local density which is defined as the influence of a node on its neighbors. The third step expands communities to yield the final community structure. To comprehensively demonstrate the performance of ACC, we compare it with seven representative state-of-the-art community detection algorithms, on small size networks with ground-truth community structures and relatively big-size networks without ground-truth community structures. Experimental results show that ACC outperforms the seven compared algorithms in most cases.

#### 1. Introduction

There are many different kinds of networks in the real world, such as biological networks, ecological networks, social networks, etc. Usually, many real-world networks have intrinsic community structures. A community in a network is always expressed as a group of nodes with dense connections within a community but sparse connections with other communities. Community detection can help us to discover the nontrivial structures and topological features of complex networks.

So far, to resolve the community detection problems in complex networks, various algorithms have been developed based on the widely used concept of similarity measures or certain criterions. The well-known FastQ [1] and spectral clustering [2, 3], and two new algorithms Attractor [4] and ISCD [5] can be regarded as similarity-based methods [6]. FastQ has introduced a fast greedy strategy for modularity maximization. It effectively corresponds to a simple nearest neighbor agglomerative clustering of the network where the adhesion coefficient is used as a similarity measure. But if the links between two communities connect low-degree nodes, this approach will fail to detect the communities. Spectral clustering models view the community detection as a graph partitioning, which apply spectral analysis to obtain the cut minimization. However, spectral clustering algorithms are not efficient because their running time is cubic in the size of the input dataset, which often limits the usability of these community detection algorithms only to small networks in practical use. Another limitation of spectral clustering algorithms is that they depend on an input parameter *k* which is hard to determine. The Attractor converts the edge between two nodes to a distance according to the Jaccard similarity and calculates the graviation between them. But the accuracy of this method needs to be improved. The ISCD uses a common neighbor similarity between two nodes to propose a new evaluation criterion of internal-link compactness, but it also depends on an input parameter *k* which is hard to determine.

The critierion-based community detection methods have proposed several critierions, such as modularity [7], betweenness [8], and minimum-cut [9, 10]. But to evaluate the quality of partitioning according to certain criteria often needs to apply optimization methods to detect communities. For example, the most widely used method, modularity maximization, detects communities by searching for one or more divisions with particularly high modularities over possible divisions of a network. Since exhaustively searching over all possible divisions is usually intractable, the time complexity of modularity maximization or minimum-cut in the optimal version of community detection is proved NP-complete [11]. Besides high time cost, the max modularity does not often result in optimum partitions in real networks [12, 13].

Core-based methods [14–16] also play an important role in community detection. Algorithms for finding the cores are efficient and amenable for parallelization [17]. But the lack of ability to distinguish influential nodes is a problem to this method [18].

Different from heuristic methods with certain criterion, we propose a simple, yet effective and fast, similarity-based community detection algorithm named ACC (Adjusting initial Community structure via Cores). ACC can overcome the limitations that many existing community detection methods have. In other words, some of the exiting methods are heuristics with heavy computation in practice but ACC are not.

The most remarkable characteristic of ACC is the ingenious combination of the advantages of similarity-based methods and core-based methods. Based on a naive fundamental assumption that nodes in the same community are more similar to each other than to those in other communities, ACC first groups nodes together according to a novel similarity to obtain the initial community structure. In particular, the similarity proposed in ACC considers not only the number of common neighbors but also the exclusion degree between two adjacent nodes. Then, based on another expression of the assumption that connections of the nodes in the same community are dense while connections of the nodes in different communities are sparse, ACC regards a node with the max local density in an initial community as the core of that community and adjusts the initial communities according to cores.

Our key contributions are as follows. (1) We present a novel similarity which considers not only common neighbors between two adjacent nodes but also their mutual exclusive degree. The threshold of the similarity is easy to be set. (2) We define the influence of a node on its neighbors as its local density, which makes discovering the core in a community much easier. (3) We propose a new community detection algorithm ACC with near-linear time complexity, which can find high-quality communities in different networks.

The remaining sections are organized as follows. We review the related works in Section 2. Then, we give preliminary of ACC in Section 3 and elaborate ACC algorithm in Section 4. After that, in Section 5, we present the performances of ACC on not only networks with ground-truth community structures but also networks without ground-truth community structures, which show how effective our method is, compared to state-of-the-art methods. Finally, Section 6 concludes the work.

#### 2. Related Works

To date, many different methods have been proposed for community identification. We only report some popular methods.

Spectral clustering [3, 19] has become one of the most popular clustering algorithms, and it is currently being used in a wide range of applications. It considers the graph as a similarity matrix and solves a data clustering problem where each cluster is a community. Spectral clustering algorithm gets the top-*k* eigenvectors of eigensystem to form an matrix. Then, every column of the matrix is the attributes of the corresponding nodes. It groups these *n* nodes with *k*-Means to get the final community structure. Unfortunately, the running time of spectral clustering algorithms might be cubic in the size of the input dataset, which makes it prohibitive to use this approach on very large datasets.

FastQ [1] algorithm is an agglomerative method that merges nodes into bigger and bigger communities hierarchically, using the modularity criterion. It initializes the network from a state in which each node is a sole member of a community. Then, it repeatedly merges communities pair, which results in the smallest decrease in modularity and ends up with a state at which all nodes in the network are arranged in a community. The result of FastQ can be represented as a dendrogram. Each level of dendrogram indicates a different community structure. FastQ selects the community structure corresponding to the highest modularity value as the final result. But since FastQ is based on the optimization for modularity, the community it has detected is not always corresponding to the ground-truth community in real application.

Newman2006 [20] algorithm first constructs a modularity matrix for networks. Then, it arranges nodes corresponding to the positive element in top eigenvector in a community and other nodes in the opposite community. Newman2006 ends the process when no positive eigenvalue exists.

Louvain method [21] initializes each node as a single community and shifts the community label of each node according to the modularity gain until the labels converge. Then, it considers each community as a node to merge some of them again according to the modularity gain.

Infohiermap algorithm [22] is a flow-based community method. It reveals hierarchical organization by multilevel compression of random walks on networks.

The label propagation (LPA) approach [23, 24] is based on the simple idea that a node should be assigned to the community to which most of its neighbors belong. LPA has been widely concerned. In addition to the advantage of linear time complexity, it does not need to define objective function and the number of community in advance. But since LPA simply updates the label of a node according to the plurality vote of its neighbors, it suffers from the problem of randomness caused by random update order, which affects the accuracy and stability of the community.

In recent years, considerable efforts have been put in improving the effectiveness and efficiency of community detection [25, 26]. PPC (Personalized PageRank Clustering) [27] employs the inherent cluster exploratory property of random walks to reveal the clusters of a given graph, which combines random walks and modularity to reveal the clusters of a graph. PPC has a linear time and space complexity. Attractor [4] is a community detection algorithm based on distance dynamics. Attractor converts the edges between two nodes to a distance according to the Jaccard similarity and calculates the graviation between them. The graviation makes the nodes within one community close to each other and the nodes from different communities far away. MEA_{s}-SN [28] is a multiobjective evolutionary algorithm based on similarity to find communities in signed networks. The two objectives to be optimized are based on the concepts of positive and negative cluster similarity between two neighboring nodes. MHGNMF [29] takes higher-order information among the nodes into consideration to enhance the clustering performance. SUM [30] is a similarity-based method which detects communities by suspecting the maximum degree nodes.

#### 3. Preliminary

In this section, we first prepare the necessary notions about community detection. Then, we define a novel similarity and a new local density. Lastly, we introduce the community detecting strategy used in ACC algorithm.

##### 3.1. Related Notions

Let be an undirected graph, where *V* is the set of nodes, is the set of edges. indicates a connection between the nodes *u* and . The number of nodes in a graph can be represented as , and the number of edges can be represented as .

For a node , the neighbors of node *u* are the members of set containing its adjacent nodes which share a common incident edge with *u*: . The degree of *u* is denoted as , .

In this paper, nodes with links to two or more communities are defined as borders.

##### 3.2. A New Similarity

In general, if two nodes have a number of common neighbors, we believe that the two nodes are similar. For two adjacent nodes , their common neighbors are denoted as , and . Therefore, can be used to measure the similarity degree. If there are too many different neighbors between *u* and , they will be dissimilar. We call this dissimilarity mutual exclusive degree. Suppose that the degree of node *u* is smaller than that of , means that the number of the neighbors of *u* which are not the neighbors of , where is the smaller degree of nodes *u* and . Therefore, the mutual exclusive degree of two adjacent nodes can be defined as .

Given a graph , through the number of the common neighbors and mutual exclusive degree of *u* and , the novel reasonable structural *similarity * between two adjacent nodes *u* and is defined as follows:where is the neighbors with degree 1 of the smaller degree node.

##### 3.3. Local Density and Core

We define the influence of a node on its neighbors as its local density and regard the node with the greatest density in a community as a core. For a given network , a node , , and , where *p* is the number of *u*’s neighbors which have no common neighbors with *u* and *q* is the number of *u*’s neighbors which have common neighbors with *u*. We assume that the influence of *u* on each of its neighbors is 1. Thus, the total influence of *u* on the *p* neighbors is . The influence of *u* on each of the *q* neighbors is weakened by the common neighbors, and the real remaining influence is that 1 subtracts the weakened influence. If we define the weakened influence as the Jaccard similarity, [31], then the influence of *u* on each of the *q* neighbors is . Thus, the total influence of *u* on the *q* neighbors is . The influence of *u* on all its neighbors is . Therefore, the *local density* of node *u* can be expressed as

#### 4. The ACC Algorithm

We introduce the ACC algorithm in this section. First, we describe the process of community detection of ACC in detail. Then, the time complexity of ACC is analyzed.

##### 4.1. The ACC Algorithm

We present our ACC algorithm as follows: *Step 1.* Obtain the initial community structure: given a network , for a node , , if , where the *ζ* is the threshold of similarity, we put the two nodes into one community. The number of the initial communities discovered in this step is marked as *k*. *Step 2*. Adjust the initial community structure according to cores: we select the top-*k* nodes with the highest local density as cores of communities. ACC considers that there is at most one core in one community. Thus, if there are more than one core in an initial community, the initial community will be broken up into as many communities as the number of the cores, and each core represents a community. Then, we rebuild new communities by assigning unlabelled neighbors of each core to the core. *Step 3*. Expand communities: we assign each remaining unlabelled node to the community to which its highest density neighbor belongs. If an unlabelled node does not have a highest density neighbor, it will be regarded as an initial community. *Step 4*. Merge small communities and reassign borders: for each small community whose size is smaller than 3, we merge it to the community which has most links to the small community. Then, we regard nodes having links to two or more different communities as the borders and reassign each border to the community which has most links to the border. Next, considering that reassigning the borders could produce small communities whose size can be smaller than 3, we merge each small community whose size is smaller than 3 to the community which has most links to it.

##### 4.2. Time Complexity Analysis

To invest the initial communities, the similarity of any two linked nodes in a network is required, and thus the time computation of Step 1 of ACC is . In Step 2, since ACC needs to get the Jaccard distance and local density, the time complexity is , where *d* is the average degree of nodes in network . During the process of getting the top-*k* nodes with the highest local densities, the time complexity is . Moreover, to assign the neighbors of cores to the corresponding core, the time complexity is . Step 3 of ACC takes time to assign the unlabelled nodes, where *q* is the number of unlabelled nodes, . To merge the small-size communities into its most links communities, the time complexity is . The time complexity of reassigning the borders is . In total, the time complexity is . Since , , , the time complexity is approximately .

#### 5. Experiments and Analysis

In this section, we evaluate our proposed algorithm ACC on real-world networks to demonstrate its benefits.

##### 5.1. Baselines and Benchmarks

###### 5.1.1. Baselines

To evaluate the performance of ACC, we compare it with several representative state-of-the-art community detection algorithms.

*Spectral clustering (SC)* [2] is one of the most popular community detection algorithms and is currently being used in a wide range of applications. It is based on the graph p-Laplacian.

*Newman2006* [20] is a splitting algorithm, which uses the maximum eigenvalue of a matrix and recursively splits a network until the final results are obtained.

*Infohiermap* [22] is a well-known approach, which requires a strong information-theoretic background and reveals multilevel structures in networks.

*FastQ* [1] is a bottom-up algorithm based on optimizing, which selects the result from the tree where the Modularity *Q* is the maximum.

*LPA* [23] is a fast community detection algorithm with a linear time complexity, which is based on label propagation.

*PPC* [27] is an efficient graph clustering algorithm, which employs random walks to detect communities.

*Attractor* [4] is one of the current most popular community detection algorithms, which is based on distance dynamics.

ACC is implemented in Python 2.7 environment. FastQ and LPA are obtained from igraph which is Python module, and their running environment is also Python 2.7. The source codes of Spectral clustering, Newman2006, Infohiermap, PPC, and Attractor are provided by their authors.

###### 5.1.2. Benchmarks

We evaluate the performances of different community detection algorithms on real-world networks, including small-size networks with class information and relatively large size networks without class information. The basic statistical information of the networks is listed in Table 1. In all the networks used in this paper, *ζ* of ACC algorithm is 0, except Polbooks network in which *ζ* of ACC is −1.