Complexity

Volume 2017, Article ID 3719428, 10 pages

https://doi.org/10.1155/2017/3719428

## Social Network Community Detection Using Agglomerative Spectral Clustering

Department of Computer Engineering, Inha University, Incheon, Republic of Korea

Correspondence should be addressed to Sanggil Kang; rk.ca.ahni@gnakgs

Received 18 April 2017; Revised 24 July 2017; Accepted 23 August 2017; Published 7 November 2017

Academic Editor: Katarzyna Musial

Copyright © 2017 Ulzii-Utas Narantsatsralt and Sanggil Kang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Community detection has become an increasingly popular tool for analyzing and researching complex networks. Many methods have been proposed for accurate community detection, and one of them is spectral clustering. Most spectral clustering algorithms have been implemented on artificial networks, and accuracy of the community detection is still unsatisfactory. Therefore, this paper proposes an agglomerative spectral clustering method with conductance and edge weights. In this method, the most similar nodes are agglomerated based on eigenvector space and edge weights. In addition, the conductance is used to identify densely connected clusters while agglomerating. The proposed method shows improved performance in related works and proves to be efficient for real life complex networks from experiments.

#### 1. Introduction

In recent years, community detection in a network has become one of the main topics of fields, such as biology, computer science, physics, and applied mathematics [1–3]. In a network, , where is a set of nodes and is the edges (relation) between the nodes, a community is a group of nodes with tightly connected edges with each other and the nodes of community show similar characteristics. For example, in social network, people in a community show similar interest to a trend in a community, for example, buying the same products in online marketing. In a biology network, proteins in a community show similar specific functions, and, in the World Wide Web, sites clustered together show the same topic in their web page. Scientists in many fields made significant contributions to detecting communities by a number of different methods, such as graph partitioning [4, 5], hierarchical clustering [6, 7], and spectral clustering [8, 9].

In graph partitioning, a network is divided into clusters in such a way that the number of edges connecting the clusters is minimum, that is, the edges of a cluster are denser inside than outside (also referred as conductance [10]). In addition, the number of lowest sized clusters needs to be specified. Girvan and Newman [3] introduced a popular graph partitioning algorithm. Girvan and Newman [3] use modularity (also referred as conductance) to cluster communities but the method is slower than other community detection algorithms [11, 12]. Later, Djidjev [13] proposed a computationally faster version of the algorithm. However, the definition of conductance is not always definite, and the definition can be false in some cases [5]. Therefore, graph partitioning still needs further inference. A number of methods have been proposed to solve this problem. One of the famous methods is introduced by Newman [5]. They use spectral clustering algorithm with modularity maximization, in which the modularity function is implemented for only possible clusters of network and the result proved that spectral clustering with conductance can efficiently cluster communities.

Hierarchical clustering is used for complex networks because they often have a hierarchical structure [14]. Hierarchical clustering consists of a division [15] and agglomeration stage [16, 17]. In the division stage, a network is deemed to be one cluster in the beginning and the network is then divided into clusters in each iteration, where the most dissimilar nodes are separated. In the agglomeration stage, similar nodes are agglomerated together until the termination criteria are met or the clusters agglomerate into a single community. However, hierarchical clustering needs a well-defined similarity function and the clustering can be inaccurate if all nodes are similar to each other.

However, the problem of similar nodes and similarity function can be solved by projecting the nodes into high dimensional feature space using spectral clustering because the projected eigenvectors significantly distinguish the similar nodes into more distanced positions in feature space. The reason for using eigenvector space instead of using original point is that the properties of original clustering are made more distinct by the eigenvector space. In spectral clustering, original points are transformed into a set of points in eigenvector space and clustering is done by analyzing eigenvector space. One technique for clustering eigenvector space is to use -means algorithm [18] where similar nodes are clustered together. However, traditional spectral clustering has a problem with model selection which depends on heuristics. The problem can be solved using weighted kernel spectral clustering (KSC) with primal and dual representations [19–21]. KSC [21] focuses on the principle that projections of similar nodes are clustered together in eigenvector space. In another work of KSC, an agglomeration technique is introduced to the KSC which is called agglomerative hierarchical kernel spectral clustering (AH-KSC) [22]. AH-KSC uses eigenvector space to find distance between nodes and it agglomerates close distanced nodes. The main purpose of AH-KSC is to get hierarchical clustering but accuracy of AH-KSC does not improve significantly from KSC because AH-KSC allows indirectly connected nodes to be agglomerated together and also there are no termination criteria for satisfied community. The problems of KSC and AH-KSC are choosing eigenvector, improving accuracy of detected communities, and using only data generated by hand which usually do not show same characteristics as real life networks.

The above-mentioned methods focus on decreasing the computation time or improving the accuracy of community detection. Methods for improving the computational time have been well studied and it can be solved by advances in technology and techniques [23–25], such as parallel computing and GPU programming. Improving the accuracy of community detection has been challenging task because networks are usually structured with great complexity with millions of nodes and edges. Hence, this paper proposes an agglomerative spectral clustering method with conductance and edge weights to improve the accuracy of community detection. The characteristics of the proposed method are well suited for accurate community detection in complex networks because the eigenvector space from spectral clustering provides well distinct points that are used for the similarity function in agglomeration. The conductance is used for the sensitive termination criteria of agglomeration and the edge weight is a major factor for evaluating a more accurate similarity. In addition, performance of the proposed method was compared with that of AH-KSC and KSC using real life social network data with a ground-truth, which are the LiveJournal and Orkut network. This method can help improve the community detection performance from previous works [21, 22].

The remainder of this paper is as follows: Section 2 introduces the problem statement and background, which helps understand the proposed method. The core algorithm of the proposed method is explained in Section 3. The experiment is outlined in Section 4 and Section 5 reports the conclusions.

#### 2. Fundamental Concepts

##### 2.1. Problem Statement

In KSC, the data are divided into training, validating, and test sets. In the training stage, the eigenvector space of the training data is signed, which is used for clustering the nodes in a network. The sign of the eigenvector points in the same cluster is identical. In the validating stage, model selection is performed to identify the clustering parameters. The eigenvector space of the test data is used to evaluate the clusters obtained from the training data using the hamming distance function. The problem with the KSC is that clustering depends on encoding/decoding eigenvectors space. The encoded values are all signed in KSC [21] and distinction between the two elements of the eigenvector is only “1” or “0” so that similar eigenvector points become noisy. For example, if in eigenvector space, and , the two values can be binarized as “1” and “0,” respectively. Although, and are projected in similar feature space, the results of encoding show a different outcome. This problem can be solved by agglomerative hierarchical KSC (AH-KSC). In AH-KSC [22], instead of signing eigenvector space, the space is used as the data points to obtain the distance between nodes in a network and close distanced nodes are agglomerated together until there are only clusters or less.

KSC and AH-KSC still have certain disadvantages. Both methods calculate the kernel matrix by counting the number of edges connecting the common neighbors between two nodes, : , where is a set of common neighbors of , and are common neighbors, and is adjacency matrix of the graph. However, the common neighbors between nodes can cause indirectly connected nodes to be clustered together so that the nodes in different clusters can be clustered. To solve this problem, the adjacency matrix is used as a kernel matrix so agglomerated nodes can be connected directly. In addition, KSC and AH-KSC use only the first eigenvectors for encoding/decoding but the remaining eigenvectors can still provide correlated information for clustering. To take this into consideration, in this study, all eigenvector space was used to evaluate the similarity between nodes. Furthermore, in AH-KSC, there were no termination criteria for agglomerating clusters. Therefore, the conductance was used as termination criterion during the agglomeration of satisfied clusters.

##### 2.2. Background

In general, KSC is described by a primal-dual formulation. Given a network, , where denotes the vertices and the edges, and the training data , the primal problem [21] is where is the projection, which is the mapped points of training data in feature space with respect to the direction, , indicates the number of score variables, which is needed to encode the clusters, is the inverse matrix of the degree matrix of the kernel matrix, , is the feature matrix, where , is the regularization constant, and is the matrix of ones. The primal form of the data point is expressed as where is the map to high dimensional feature space, where is the number of nodes in graph, , is the number of eigenvectors, and is the bias term. The dual problem related to the primal problem is where is the kernel matrix with th entry, ; is the diagonal matrix of with elements of ; is a centering matrix defined as , where is the identity matrix; is the dual variable; and the kernel function is the similarity function of the graph. The parameters, such as the number of community , are estimated using the training data, , and validating data, . In addition, all the nodes are clustered in the training and validating stage. The eigenvector space is used to find unique code-word for all clusters , The codebook, , can be obtained from rows of the binarized eigenvector matrix. Finally, the eigenvector space of the test data, , is decoded using the hamming distance [21] and the clustered result is evaluated. Therefore, eigenvector space is used to derive the similarity among nodes, which will be explained in detail in the following section.

#### 3. Proposed Community Detection Algorithm

This section presents details of agglomerative spectral clustering with the conductivity method. The eigenvector space is used to find the similarity among nodes and agglomerate the most similar nodes to make a new combined node in a network graph. The new combined node is added to the graph after agglomeration and the changed graph is iterated until the termination criteria are satisfied.

To agglomerate two nodes, a similarity function is modified from the correlation distance function among the nodes as follows: where is the similarity between nodes* i* and* j *in the range of 0,2 with 0 being perfect similarity and 2 being perfect dissimilarity. is the value of the eigenvector, , in eigenvector space, th row, and th column.

The eigenvector space is not enough to fully express the similarity among agglomerated nodes because the nodes connected to each other are projected into a similar place in feature space and it is very difficult to distinguish similar projections. On the other hand, these similar projections can be distinguished using the disparity of the edge connections between the nodes. Agglomerated nodes can have more than one connecting edge with each other. Therefore, more tightly connected nodes have more similarity. For example, in Figure 1, similar nodes are combined in the 1st iteration and node has two connections to the agglomerated node of* n*_{4} and* n*_{5} and has one connection to the agglomerated node of and , so that is more likely to agglomerate with* n*_{5} and* n*_{4}. In the 2nd iteration, new agglomerated nodes are used to find new eigenvector space and agglomerate similar nodes. In doing so, the number of edges in the graph is unchanged and some nodes have more than one edge between them. As mentioned in the example of node* n*_{6}, the edges between two nodes are used as a mean to give a similarity score between nodes to improve the accuracy of the algorithm. On the other hand, the number of edges between nodes can be varied too much and the value of the similarity function in (4) is too different. Therefore, the similarity function will be overemphasized on a number of edges; that is, disregard the eigenvector space score. In the present study, a sigmoid function is used to normalize the edge values to solve the above-mentioned problem.