Complexity

Volume 2019, Article ID 8292485, 16 pages

https://doi.org/10.1155/2019/8292485

## Neighbor Similarity Based Agglomerative Method for Community Detection in Networks

^{1}School of Information Science and Engineering, Lanzhou University, China^{2}Department of Electronic Information Engineering, Lanzhou Vocational Technical College, China

Correspondence should be addressed to Jianjun Cheng; nc.ude.uzl@nujnaijgnehc

Received 27 December 2018; Revised 15 March 2019; Accepted 11 April 2019; Published 2 May 2019

Academic Editor: Guang Li

Copyright © 2019 Jianjun Cheng et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Community structures can reveal organizations and functional properties of complex networks; hence, detecting communities from networks is of great importance. With the surge of large networks in recent years, the efficiency of community detection is demanded critically. Therefore, many local methods have emerged. In this paper, we propose a node similarity based community detection method, which is also a local one consisted of two phases. In the first phase, we first take out the node with the largest degree from the network to take it as an exemplar of the first community and insert its most similar neighbor node into the community as well. Then, the one with the largest degree in the remainder nodes is selected; if its most similar neighbor has not been classified into any community yet, we create a new community for the selected node and its most similar neighbor. Otherwise, if its most similar neighbor has been classified into a certain community, we insert the selected node into the community to which its most similar neighbor belongs. This procedure is repeated until every node in the network is assigned to a community; at that time, we obtain a series of preliminary communities. However, some of them might be too small or too sparse; edges connecting to outside of them might go beyond the ones inside them. Keeping them as the final ones will lead to a low-quality community structure. Therefore, we merge some of them in an efficient approach in the second phase to improve the quality of the resulting community structure. To testify the performance of our proposed method, extensive experiments are performed on both some artificial networks and some real-world networks. The results show that the proposed method can detect high-quality community structures from networks steadily and efficiently and outperform the comparison algorithms significantly.

#### 1. Introduction

Many real-world systems can be abstracted as complex networks, in which nodes represent entities in the systems, and edges correspond to interactions between the entities. One of the most significant characteristics observed in these complex networks is the “community structure,” which means that nodes in the network can be divided into groups naturally; nodes in the same group are connected densely, and connections across different groups are relatively sparse; each of the node groups is a so-called “community.”

The communities are always related to functional modules of networks. For instance, communities can be groups of web pages in WWW networks [1] or scientific papers in citation networks [2] sharing same topics, books with the same political orientations copurchased from the online bookseller, Amazon.com [3], pathways or complexes in metabolic networks, or protein-protein interaction networks [4, 5]. In social networks, communities often correspond to real social groupings having the same interests or professional occupations, e.g., scientist groups classified according to the scientists’ specialties in the coauthor relationship collaboration networks [6, 7], jazz musician groups divided according to the locations and race [8], or affiliations of gang members in the policing area of Hollenbeck, Los Angeles [9]. Besides this, some researches have indicated that networks can present quite different properties when being considered at the community level, rather than from the perspective of entire network or the individual node [10].

Therefore, analyzing the community structures in networks can facilitate the recognition of the characteristics of networks and make prediction further about the functional properties of the corresponding systems. That is to say, community detection provides us with an effective means for studying the functional properties of networks via dipping into structural characteristics, which really make sense in practical applications. Therefore, a multitude of methods [11, 12] have been proposed for detecting communities in complex networks; we will review some related literature in Section 2.

In this paper, we propose a community detection method as well, which is based on node similarity and consists of two phases. The first phase repeatedly selects the node with the largest degree in the remainder of the network and either takes it as the exemplar of a new community or inserts it into the community to which its most similar neighbor belongs, according to its most similar neighbor’s community affiliation. At the end of this phase, we get a series of communities. However, they are only the preliminary communities; some of them might be too small or too sparse; edges connecting to outside of them might go far beyond the ones inside them. Accepting them as the final ones will lead to a low-quality community structure. Therefore, the second phase merges some of the preliminary communities to improve the quality of the resulting community structure.

The main contributions of this work can be summarized as follows.(i)We propose a node similarity based local algorithm, shortened as* NSA*, for community detection, which is a two-phase method. The first phase is used to get the preliminary communities, and the second phase is to merge some of the preliminary communities to improve the quality of the resulting community structure.(ii)We propose an index,* community metric*, to measure the sparsity or smallness of a community. In the second phase, we use the index as a criterion to determine which preliminary communities need to be merged.(iii)Extensive experiments on some artificial networks and real-world networks are carried out to testify the performance of the proposed method. The experimental results show that the performance and the time complexity of the proposed method are steadily promising and outperform its competitors.

The remainder of this paper is organized as follows. Section 2 reviews some literature about community detection. The details of the proposed algorithm are elaborated in Section 3. The experimental results and analysis on both artificial networks and real-world networks are presented in Section 4. In Section 5, we discuss how to set the optimal value for a parameter introduced in our proposed method, and the paper ends with a conclusion in Section 6.

#### 2. Related Work

A great deal of community detection methods have been proposed in the last decade; these methods try to explore communities in networks from various perspectives. The graph theory-based methods take the problem of community detection as the traditional task of graph partitioning and divide the network into subnetworks. Kernighan-Lin [13] is a representative method of this kind, which partitions the network into two arbitrary subnetworks first and then repeatedly swaps some nodes between the two subnetworks to maximize a predefined gain function.

The hierarchical clustering methods reveal multilevel community structures either in divisive ways or in agglomerative approaches or in hybrid ways; e.g., GN algorithm [6, 7] detects communities by repeatedly removing the edge with the largest betweenness from the networks, its output is a dendrogram representing the nested hierarchy of possible community structures of the network, and the level corresponding to the largest value of a measure, modularity[7], is taken as the final result. Fast*Q* algorithm [23, 24] takes each node in the network as a community first and then repeatedly merges two of them into one. Its output is also a dendrogram depicting the merge procedure of possible community hierarchies. Zarandi et al. [25] randomly removed some edges with low similarity to obtain some disconnected components as the primary communities, and then some of them are merged to get the resulting community structure.

The modularity optimization-based algorithms detect community structures from networks by utilizing the physical meaning of modularity—the higher the value of modularity, the better the community structure—and taking the modularity as the objective to optimize. For instance, in order to maximize the modularity of the community structure, Fast[23, 24] joins a pair of communities whose merge can lead to the largest modularity increment in each iteration. Louvain algorithm [26] uses the node-moving strategy to extract community structure with the optimized modularity from the network, which begins with an initial partition of each node being a community as well; then for each node, the algorithm evaluates the modularity gain of moving it into the community to which each of its neighbors belongs and moves that node into the community with the largest positive modularity gain consequently. SLM (short for Smart Local Moving) algorithm [27] searches for possibilities of increasing modularity with respect to both splitting communities and moving sets of nodes from one community to another.

LPA (Label Propagation Algorithm) [28] makes utilization of information propagation mechanism to detect communities from networks. Every node in the network is initialized with a unique label and all nodes in the network are arranged in a random order first; then each node in that specific order updates its label to the one occurred most frequently among its neighbors. This label update procedure is ended with the status that every node in the network has a label which is the majority one among neighbors, and nodes with the same labels form a community. Owing to its simplicity and high efficiency, several variants have been derived from LPA. Barber et al. [29] proposed a series of algorithms that propagate labels under some constraints; LPAm is the most famous one, which tries to maximize the modularity during the label propagation procedure. Chin et al. [30] identified the main communities using the number of mutual neighboring nodes first; then they attached some independent constraints to the basic LPA and used the constrained LPA to add the remainder nodes into communities; finally, they used a node-moving strategy like that is employed in Louvain to refine the quality of the resulting community structure. Ding et al. [31] yielded a modified version of LPA, which exploits the idea of density peak clustering [32] and Chebyshev inequality to choose community centers from the network, and then propagates labels of the selected centers to the whole network with the proposed multistrategy of label propagation.

Density-based methods define and utilize the concept of* density* in networks for nodes or communities to uncover community structures. SCAN [33] borrows the idea from the classical density-based clustering algorithm, DBSCAN [34], to reveal communities, hubs, and outliers from networks. SCAN++ [35] is a derivative of SCAN; it reduces time consumption via introducing a new data structure and reducing the number of density evaluations in the detecting procedure. IsoFdp [36] maps the network nodes as data points into a low-dimensional manifold and then exploits the density peak clustering algorithm [32] to extract the final community structure. LCCD algorithm [37] also practices on the way proposed in the density peak clustering algorithm [32] to locate the structural centers from networks and then expands communities from the identified centers to the borders using a local search procedure.

Network dynamic-based methods explore community structures by simulating the dynamic processes in networks. Random walk is a typical dynamic procedure carried out in networks; random walk-based methods utilize the tendency of the walker being trapped into a community during a short walk, rather than walking across the community border into another community, to detect communities from networks. WalkTrap [38] makes use of random walk to calculate the probability of going from one node to another during a short-length walk and then calculates the distance to measure nodes’ similarities and community similarities. PPC algorithm [39] considers the network as a single community initially and recursively partitions each community utilizing node similarities computed using random walks until further partitioning cannot acquire a better value of modularity. RWA [40] employs random walks to calculate the probability of a node belonging to a community, and each community is expanded by repeatedly attracting the node which is most likely to belong to that community to join. Besides this, Attractor [41] utilizes distance dynamics to explore communities from networks, node interactions might change the distances among nodes, and the distance change will make an impact on the interaction in reverse. Members of the same community will gradually move together under such interplays, and nodes in different communities will keep far away from each other steadily. BiAttractor [42] extends the concept of distance dynamics and the idea of Attractor to bipartite networks, which is used to detect two-mode communities of bipartite networks.

Spectral methods engage eigenspectra of various network-associated matrices to extract communities. For example, Amini et al. [43] found the initial node partitions using the spectral clustering method based on the normalized Laplacian matrix derived from a regularized adjacency matrix; those partitions were used for fitting a stochastic block model by a pseudolikelihood algorithm to detect the resulting community structure. Siemon C. de Lange et al. [44] identified an integrative community structure in the macroscopic anatomical neural networks of the macaque and cat and the microscopic network of the* C. elegans* by examining the spectra of their normalized Laplacian matrices. Krzakala et al. [45] produced a class of spectral algorithms to detect communities based on the nonbacktracking matrix, which depicts a nonbacktracking walk on the directed edges of the network. Shi et al. [46] proposed a spectral community detection method, LLSA, which employs Lanczos method to obtain the approximated eigenvector of the transition matrix with the largest eigenvalue, and the elements of this eigenvector approximately indicate the affiliation probability of the corresponding nodes to the communities.

Most of the methods mentioned above are global ones; they detect communities often depending on some global information, such as the number of communities, information about eigenvalues or eigenvectors, as prior knowledge, but they are hard to acquire due to the size of networks involved getting larger and larger. Moreover, most of them are computationally demanding, leading to high time complexity. These limitations prevent them from being applied to large-scale applications. To overcome the deficiency of the global algorithms, many local methods have been proposed, including some of the aforementioned methods. For example, LPA and most of its variations determine which label should be adopted by a node according to its neighborhood only; LCCD takes into account both the local density of nodes and the relative distance between nodes to locate the local structural centers and expands communities from the structural centers with a local search procedure; LLSA applies a fast heat kernel diffusing to sample a small subnetwork including almost all members of a community, and the eigenvector whose elements suggest nodes for their memberships of communities is obtained by performing Lanczos method on the sampled subnetwork.

Besides this, ComSim algorithm [47] identifies cores of communities from bipartite networks by seeking for cycles which are node chains formed by following outgoing links and reaching a node already visited and then allocates the remaining nodes to the communities that maximize the similarity between the node and the community. In BLI algorithm [48], local clustering information and local structural similarity are employed to establish the primary community structure; then some small-scale communities whose sizes are smaller than a given threshold, , are absorbed by some larger ones.* k*SIM [49] is also a local method that works in a bottom-up way. At the beginning, each node is taken as a community; then the preliminary communities are formed by identifying for each node the neighbor community to which one of its most similar neighbors with the lowest degree belongs and assigning the node to that community. In this procedure, common neighbor index is employed as the similarity measure for each pair of nodes.

Compared to those global ones, these local methods show good performance in large-scale networks. Inspired by this, we also propose a local method to extract communities from networks. The proposed method is based on node similarity and is termed as* NSA* (Node Similarity based Algorithm) for short; it comprises of two phases: the first phase aims at constructing the preliminary community structure; the second phase tries to improve the quality of the final result by merging some small or sparse communities. To do so, we also propose a measure,* community metric*, to evaluate the sparsity or smallness of communities. The details of the proposed method are elaborated in the next section.

#### 3. The Proposed Method

##### 3.1. The Framework of the Proposed Method

The framework of the proposed method is outlined by the pseudocode listed in Algorithm 1.