Abstract

Community detection has become an increasingly popular tool for analyzing and researching complex networks. Many methods have been proposed for accurate community detection, and one of them is spectral clustering. Most spectral clustering algorithms have been implemented on artificial networks, and accuracy of the community detection is still unsatisfactory. Therefore, this paper proposes an agglomerative spectral clustering method with conductance and edge weights. In this method, the most similar nodes are agglomerated based on eigenvector space and edge weights. In addition, the conductance is used to identify densely connected clusters while agglomerating. The proposed method shows improved performance in related works and proves to be efficient for real life complex networks from experiments.

1. Introduction

In recent years, community detection in a network has become one of the main topics of fields, such as biology, computer science, physics, and applied mathematics [13]. In a network, , where is a set of nodes and is the edges (relation) between the nodes, a community is a group of nodes with tightly connected edges with each other and the nodes of community show similar characteristics. For example, in social network, people in a community show similar interest to a trend in a community, for example, buying the same products in online marketing. In a biology network, proteins in a community show similar specific functions, and, in the World Wide Web, sites clustered together show the same topic in their web page. Scientists in many fields made significant contributions to detecting communities by a number of different methods, such as graph partitioning [4, 5], hierarchical clustering [6, 7], and spectral clustering [8, 9].

In graph partitioning, a network is divided into clusters in such a way that the number of edges connecting the clusters is minimum, that is, the edges of a cluster are denser inside than outside (also referred as conductance [10]). In addition, the number of lowest sized clusters needs to be specified. Girvan and Newman [3] introduced a popular graph partitioning algorithm. Girvan and Newman [3] use modularity (also referred as conductance) to cluster communities but the method is slower than other community detection algorithms [11, 12]. Later, Djidjev [13] proposed a computationally faster version of the algorithm. However, the definition of conductance is not always definite, and the definition can be false in some cases [5]. Therefore, graph partitioning still needs further inference. A number of methods have been proposed to solve this problem. One of the famous methods is introduced by Newman [5]. They use spectral clustering algorithm with modularity maximization, in which the modularity function is implemented for only possible clusters of network and the result proved that spectral clustering with conductance can efficiently cluster communities.

Hierarchical clustering is used for complex networks because they often have a hierarchical structure [14]. Hierarchical clustering consists of a division [15] and agglomeration stage [16, 17]. In the division stage, a network is deemed to be one cluster in the beginning and the network is then divided into clusters in each iteration, where the most dissimilar nodes are separated. In the agglomeration stage, similar nodes are agglomerated together until the termination criteria are met or the clusters agglomerate into a single community. However, hierarchical clustering needs a well-defined similarity function and the clustering can be inaccurate if all nodes are similar to each other.

However, the problem of similar nodes and similarity function can be solved by projecting the nodes into high dimensional feature space using spectral clustering because the projected eigenvectors significantly distinguish the similar nodes into more distanced positions in feature space. The reason for using eigenvector space instead of using original point is that the properties of original clustering are made more distinct by the eigenvector space. In spectral clustering, original points are transformed into a set of points in eigenvector space and clustering is done by analyzing eigenvector space. One technique for clustering eigenvector space is to use -means algorithm [18] where similar nodes are clustered together. However, traditional spectral clustering has a problem with model selection which depends on heuristics. The problem can be solved using weighted kernel spectral clustering (KSC) with primal and dual representations [1921]. KSC [21] focuses on the principle that projections of similar nodes are clustered together in eigenvector space. In another work of KSC, an agglomeration technique is introduced to the KSC which is called agglomerative hierarchical kernel spectral clustering (AH-KSC) [22]. AH-KSC uses eigenvector space to find distance between nodes and it agglomerates close distanced nodes. The main purpose of AH-KSC is to get hierarchical clustering but accuracy of AH-KSC does not improve significantly from KSC because AH-KSC allows indirectly connected nodes to be agglomerated together and also there are no termination criteria for satisfied community. The problems of KSC and AH-KSC are choosing eigenvector, improving accuracy of detected communities, and using only data generated by hand which usually do not show same characteristics as real life networks.

The above-mentioned methods focus on decreasing the computation time or improving the accuracy of community detection. Methods for improving the computational time have been well studied and it can be solved by advances in technology and techniques [2325], such as parallel computing and GPU programming. Improving the accuracy of community detection has been challenging task because networks are usually structured with great complexity with millions of nodes and edges. Hence, this paper proposes an agglomerative spectral clustering method with conductance and edge weights to improve the accuracy of community detection. The characteristics of the proposed method are well suited for accurate community detection in complex networks because the eigenvector space from spectral clustering provides well distinct points that are used for the similarity function in agglomeration. The conductance is used for the sensitive termination criteria of agglomeration and the edge weight is a major factor for evaluating a more accurate similarity. In addition, performance of the proposed method was compared with that of AH-KSC and KSC using real life social network data with a ground-truth, which are the LiveJournal and Orkut network. This method can help improve the community detection performance from previous works [21, 22].

The remainder of this paper is as follows: Section 2 introduces the problem statement and background, which helps understand the proposed method. The core algorithm of the proposed method is explained in Section 3. The experiment is outlined in Section 4 and Section 5 reports the conclusions.

2. Fundamental Concepts

2.1. Problem Statement

In KSC, the data are divided into training, validating, and test sets. In the training stage, the eigenvector space of the training data is signed, which is used for clustering the nodes in a network. The sign of the eigenvector points in the same cluster is identical. In the validating stage, model selection is performed to identify the clustering parameters. The eigenvector space of the test data is used to evaluate the clusters obtained from the training data using the hamming distance function. The problem with the KSC is that clustering depends on encoding/decoding eigenvectors space. The encoded values are all signed in KSC [21] and distinction between the two elements of the eigenvector is only “1” or “0” so that similar eigenvector points become noisy. For example, if in eigenvector space, and , the two values can be binarized as “1” and “0,” respectively. Although, and are projected in similar feature space, the results of encoding show a different outcome. This problem can be solved by agglomerative hierarchical KSC (AH-KSC). In AH-KSC [22], instead of signing eigenvector space, the space is used as the data points to obtain the distance between nodes in a network and close distanced nodes are agglomerated together until there are only clusters or less.

KSC and AH-KSC still have certain disadvantages. Both methods calculate the kernel matrix by counting the number of edges connecting the common neighbors between two nodes, : , where is a set of common neighbors of , and are common neighbors, and is adjacency matrix of the graph. However, the common neighbors between nodes can cause indirectly connected nodes to be clustered together so that the nodes in different clusters can be clustered. To solve this problem, the adjacency matrix is used as a kernel matrix so agglomerated nodes can be connected directly. In addition, KSC and AH-KSC use only the first eigenvectors for encoding/decoding but the remaining eigenvectors can still provide correlated information for clustering. To take this into consideration, in this study, all eigenvector space was used to evaluate the similarity between nodes. Furthermore, in AH-KSC, there were no termination criteria for agglomerating clusters. Therefore, the conductance was used as termination criterion during the agglomeration of satisfied clusters.

2.2. Background

In general, KSC is described by a primal-dual formulation. Given a network, , where denotes the vertices and the edges, and the training data , the primal problem [21] is where is the projection, which is the mapped points of training data in feature space with respect to the direction, , indicates the number of score variables, which is needed to encode the clusters, is the inverse matrix of the degree matrix of the kernel matrix, , is the feature matrix, where , is the regularization constant, and is the matrix of ones. The primal form of the data point is expressed as where is the map to high dimensional feature space, where is the number of nodes in graph, , is the number of eigenvectors, and is the bias term. The dual problem related to the primal problem is where is the kernel matrix with th entry, ; is the diagonal matrix of with elements of ; is a centering matrix defined as , where is the identity matrix; is the dual variable; and the kernel function is the similarity function of the graph. The parameters, such as the number of community , are estimated using the training data, , and validating data, . In addition, all the nodes are clustered in the training and validating stage. The eigenvector space is used to find unique code-word for all clusters , The codebook, , can be obtained from rows of the binarized eigenvector matrix. Finally, the eigenvector space of the test data, , is decoded using the hamming distance [21] and the clustered result is evaluated. Therefore, eigenvector space is used to derive the similarity among nodes, which will be explained in detail in the following section.

3. Proposed Community Detection Algorithm

This section presents details of agglomerative spectral clustering with the conductivity method. The eigenvector space is used to find the similarity among nodes and agglomerate the most similar nodes to make a new combined node in a network graph. The new combined node is added to the graph after agglomeration and the changed graph is iterated until the termination criteria are satisfied.

To agglomerate two nodes, a similarity function is modified from the correlation distance function among the nodes as follows: where is the similarity between nodes i and j in the range of 0,2 with 0 being perfect similarity and 2 being perfect dissimilarity. is the value of the eigenvector, , in eigenvector space, th row, and th column.

The eigenvector space is not enough to fully express the similarity among agglomerated nodes because the nodes connected to each other are projected into a similar place in feature space and it is very difficult to distinguish similar projections. On the other hand, these similar projections can be distinguished using the disparity of the edge connections between the nodes. Agglomerated nodes can have more than one connecting edge with each other. Therefore, more tightly connected nodes have more similarity. For example, in Figure 1, similar nodes are combined in the 1st iteration and node has two connections to the agglomerated node of n4 and n5 and has one connection to the agglomerated node of and , so that is more likely to agglomerate with n5 and n4. In the 2nd iteration, new agglomerated nodes are used to find new eigenvector space and agglomerate similar nodes. In doing so, the number of edges in the graph is unchanged and some nodes have more than one edge between them. As mentioned in the example of node n6, the edges between two nodes are used as a mean to give a similarity score between nodes to improve the accuracy of the algorithm. On the other hand, the number of edges between nodes can be varied too much and the value of the similarity function in (4) is too different. Therefore, the similarity function will be overemphasized on a number of edges; that is, disregard the eigenvector space score. In the present study, a sigmoid function is used to normalize the edge values to solve the above-mentioned problem.

Equation (4) is modified, as expressed inwhere is the maximum number of edges and is the number of edges between nodes . The vertical value of the sigmoid graph is deemed to be the edge similarity score and starts from 0.5 to 1 while the horizontal value, which is the number of edges, ranges from 0 to the maximum number of edges. Equation (5) is used to find the most similar node of node from the other nodes in graph . At the first iteration, the first node becomes a candidate and if there is a more similar node than the candidate, the candidate is then replaced with it. The process continues until the similarity of all nodes is evaluated. Thus, the most similar node to node is determined bywhere , is all the nodes in graph , and is the most similar node of node .

Furthermore, to obtain a more accurate clustering result, this study considers the definition of a good community, which is “density of the edge connection should be higher inside than outside” [10]. The similar nodes are agglomerated together in every iteration and the agglomerated nodes become a clustered community after a few iterations. If the cluster is connected tightly inside and sparsely connected to outside, there is no need for further agglomeration because the cluster is sufficiently satisfied to be a good community and agglomeration for this community is terminated. In addition, two communities are agglomerated when they are tightly connected to each other. For example, in Figure 1(c), where the inside edges are straight lines and the outside edges are dotted lines, the graph is clustered into three agglomerated communities, such as , , and . In the case of community , it has three inside edges and two outside edges connected to both and so that has a denser connection inside than outside. Consequently, no further agglomeration is needed. To consider the ratio of the inside and outside edges into (5), the two possible cases are divided when node is a candidate as the most similar node for node :(1)(2)

where is the number of nodes inside node and is the number of nodes inside node .

In the first case, the number of inside edges of node is at most equal to the number of outside edges connecting to node . However in the next case, the number of inside edges of node is more than the number of outside edges connecting to node . Therefore, to agglomerate only tightly connected nodes together, (5) can be modified using the inside and outside edges: where is the inside edges of node , is the inside edges of node , is the edges connecting node , and is the community density parameter.

After finding the similarity between nodes using (7), agglomeration of the most similar nodes starts. In every iteration, the most similar node for each node was found and if the most similar node of is , the opposite is not definite for . Therefore, only the case in which both nodes choose each other is agglomerated as the most similar node. Thus, the agglomerated node iswhere and is all the nodes in graph .

The termination condition is met when all agglomerated nodes are connected more tightly inside than outside as seen in Pseudocode 1.

Input: Graph , Nodes , Edges , density
parameter
Output: Hierarchically clustered communities
Find the eigenvectors
of
Find the similarities of each node i and j
using eigenvectors with Eq. (5)
Compute the most similar node i using
Eq. (7)
Agglomerate the node i and j if the two
nodes chose each other as the most
similar node
Re-initialize the graph with the
agglomerated nodes and start the next
iteration
Agglomerate the nodes into hierarchical
clusters when the iteration is finished

4. Experiment

This section presents the results of the proposed method and compares the data with that of conventional community detection works [21, 22] by varying the value of parameters. LiveJournal and Orkut are used for evaluation as the ground-truth social network. LiveJournal is a blogging and social networking site that has been around since 1999. LiveJournal data has 4 million nodes and 35 million edges. The LiveJournal ground-truth data has 287,512 communities. In order to show the change of detected community by varying the density parameter for different networks, we also use Orkut network because the density difference of two networks can clearly emphasize the importance of choosing optimal density parameter. Orkut is a free online social network where users form friendship with each other. Orkut data has 3 million nodes and 117 million edges. The Orkut ground-truth data has 6,288,363 communities. The network is massive and complex, which makes more difficult clustering task. The dataset is available at https://snap.stanford.edu/data/.

The evaluation is done using the measurement metrics, such as precision (), recall (), and -score. where is the number of nodes that are correctly clustered and is the number of nodes that are falsely clustered. where is the number of nodes that are supposed to be clustered but failed to do so. where score is the harmonic mean of precision and recall.

The results of clustering are shown by varying the value of , which is the community density parameter in Figure 2, where there are 3 communities. The right side community is the smallest with 61 nodes, the middle community has 109 nodes, and the left side community is the largest community with 331 communities. An optimal value was evaluated from the training data by a trial-and-error method. Beginning with , as shown in Figure 2(a), each community is clustered into small sized clusters internally. The clustering performance is improved by increasing the value of up to . The middle and right side communities are clustered successfully but the largest community on the left side failed to cluster because the left side community is very complex with many edges. In this case, the clustering performance can be improved by relaxing . The three communities are successfully clustered when , as shown in Figure 2(c). If the value of continues to increase, the clustering performance becomes worse than because the clustering criteria are too relaxed as increases. With , the left side community is clustered into 4 communities, as shown in Figure 2(d), and when the value of reaches 7, the left side community is separated into smaller communities, as shown in Figure 2(e).

Figure 3 shows comparison by varying values of density parameter of Orkut network. Unlike LiveJournal network, Orkut is a more densely clustered network where the ratio of node and edges of LiveJournal is 1 : 8.6 whereas Orkut is 1 : 38.1. Therefore, the density parameter of Orkut requires being more strict compared to LiveJournal because the clusters are all densely clustered with each other. If the density parameter is not strict, it will allow the densely clustered nodes to be agglomerated together. First column of Figure 3 shows the ground-truth community which is colored in yellow and the following columns are detected networks by varying density parameters ranging from 0.1 to 4. As shown in the first row of Figure 3, the detected community with density parameter 0.1 shows well desired result but the accuracy has sharply dropped when we increased the density parameter because the detected community’s size has continuously increased. The second row of Figure 3 shows different characteristics compared to the first row where the seed node has agglomerated into different cluster due to the relaxed density parameter. The third row network has similar characteristics to the second row network which shows that relaxed density parameter could lead to less densely clustered community. Fourth row network has similar results to the first row which shows that if we allow relaxed density parameter, the network will continue to expand. The optimal result of community detection has been obtained with 0.1 in Orkut network while the optimal value is 4 in LiveJournal network. Therefore, the experiment result shows that the density parameter is closely related to density of the network where the denser network requires stricter density parameter. It means that when evaluating the optimal value of density parameter, the density of the network should be considered.

Figure 4 shows the results of an analysis of the agglomeration process of the proposed method with and AH-KSC as the number of iterations increase. Figure 4(a) shows the early stage agglomeration result of middle community (at the 17th iteration). The early stage of AH-KSC was successfully clustering colored in red but in the halfway stage (at 26th iteration), the middle community was agglomerated together with some nodes that were included in the left side community, even though there were no direct connections to the nodes. In the late stage of agglomeration process (at the 30th iteration), the middle community was clustered with the right side community even though the right side community was the satisfied community, as shown in Figure 4(c). Figure 4(d) is the early stage of the proposed method. Like AH-KSC, the agglomeration process of the middle community is well clustered (at the 23rd iteration). In the halfway stage of agglomeration (at the 26th iteration), the middle community was clustered successfully because only directly connected nodes are agglomerated according to (5), where the number of edges between the nodes is added to the similarity function. At the late stage (at the 39th iteration), the right side community is clustered accurately because the ratio of the inner and outer edge connection is applied to (7) so that the smallest community on the right side has stopped agglomerating. Figure 4(f) shows the final result of the algorithm.

This study compared the accuracy of detection of the proposed method with AH-KSC and KSC using the ground-truth LiveJournal network. To show the comparison conveniently, only four parts of the network are used because the network is too large, that is, more than 4 million nodes. As shown in Figure 5, there are four subnetworks with different structures. The 1st network has 292 nodes and 1858 edges, and ground-truth community, to which the seed node belongs, has 24 nodes. The second network has 356 nodes and 33616 edges with a ground-truth community of 52 nodes. The third network has 652 nodes and 63044 edges, and the ground-truth community has 22 nodes. The last network has 119 nodes and 866 edges with a ground-truth community of 15 nodes. The 2nd and 3rd networks are so complex that it is difficult to detect communities while the 1st and 4th networks are well structured, that is, average in difficulty. In Figure 5, the light yellow colored node groups in the first column are the ground-truth community, the green colored node groups in the second column are the detected community from the proposed method, and the red colored node groups are the result of detected community from AH-KSC and KSC, respectively. From the observation of the experiment, AH-KSC agglomerates the neighbor nodes successfully in the early stages of agglomeration, as mentioned above, but it failed to terminate the agglomeration, as shown in the first and last networks in Figure 4 due to the lack of termination criteria. In addition, when the networks are too complex, such as the 2nd and 3rd networks, clustering is not done efficiently. KSC also produces similar results to AH-KSC but it clusters better than AH-KSC for the case in which the network is well organized, as with the 4th network. AH-KSC provides a better result than KSC when the network is very complex, such as the 2nd and 3rd network.

Table 1 shows overall accuracy of community detection using LiveJournal ground-truth network with respect to the precision, recall, and -score for the proposed method, AH-KSC, and KSC. For AH-KSC, the average precision, recall, and -score were 0.57, 0.7, and 0.61, respectively. For KSC, the average precision, recall, and -score were 0.55, 0.82, and 0.62, respectively. For the proposed method, the average precision, recall, and -score were 0.64, 0.95, and 0.75, respectively. The overall precision of AH-KSC and KSC were similar with a 2 percent difference but KSC has higher performance in the overall recall with more than 12 percent, which means the KSC detected more true positive nodes than AH-KSC from the network. The average -score for AH-KSC and KSC is similar with only a 1% difference. The proposed method outperformed AH-KSC and KSC in all evaluation metrics. In average precision, the proposed method improved 7 to 9% compared to AH-KSC and KSC. In the average recall, the proposed method had the highest improvement with 25 to 13%. The average -score of the proposed method is improved by 14%.

5. Conclusion

This paper introduced an agglomerative spectral clustering with conductance and edge weight for detecting communities. The proposed method projects the original points into eigenvector feature space in the first stage. In the second stage, the eigenvector space and the number of edges between nodes are used to evaluate the similarity between nodes. Each node finds candidate for the most similar nodes. The third stage finds the conductance between the node and its candidate. If only the conductance improves, the nodes are agglomerated. The three-stage process is iterated until the network requires no further agglomeration. The time complexity of the proposed method is increased compared to AH-KSC because we check the conductance of each agglomerated node but it is necessary for more accurate detection. From the analysis of the experiment, the proposed method outperformed the AH-KSC and KSC using a real life network, LiveJournal.

The two contributions of this method can be summarized as follows. One is the improvement accuracy compared to related works. The other is that the proposed method is feasible for practical situations because the performance of the method is well suited to real life social networks. On the other hand, the eigenvector space is calculated in every iteration so that the computation time is slower than that of KSC. Our future work will focus on improving the time complexity with a method such as parallel computing.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2016R1D1A1B03932447).