Abstract

Most of the existing clustering algorithms for networks are unsupervised, which cannot help improve the clustering quality by utilizing a small number of prior knowledge. We propose a semisupervised clustering algorithm for networks based on fast affinity propagation (SCAN-FAP), which is essentially a kind of similarity metric learning method. Firstly, we define a new constraint similarity measure integrating the structural information and the pairwise constraints, which reflects the effective similarities between nodes in networks. Then, taking the constraint similarities as input, we propose a fast affinity propagation algorithm which keeps the advantages of the original affinity propagation algorithm while increasing the time efficiency by passing only the messages between certain nodes. Finally, by extensive experimental studies, we demonstrate that the proposed algorithm can take fully advantage of the prior knowledge and improve the clustering quality significantly. Furthermore, our algorithm has a superior performance to some of the state-of-art approaches.

1. Introduction

Network as an expressive data structure is popularly used to model structural relationships between objects in many real-world applications. Examples include social networks, web, and citation networks. In these networks, individuals represented by nodes are linked with some special relationships, such as the friendships between people, hyperlinks between web pages, and references among papers.

Network clustering, as a key technology for network analysis, can discover the hidden structures and functions in networks [1], which is attracting a considerable amount of attention from researchers in various domains. A lot of related research achievements, including modularity optimization [25], spectral clustering [68], and similarities-based algorithms [911] have been implemented to partition the networks into clusters (or communities, groups), such that there are a dense set of edges within each cluster and few edges between clusters. However, most of the existing algorithms are unsupervised and cannot utilize any prior knowledge to improve the clustering quality. Usually, prior knowledge related to the data such as label information or pairwise constraints can be obtained in many real applications.

Semisupervised clustering is a very useful strategy that can guide the clustering process to get a high clustering quality in the presence of the obtained prior knowledge. There have been a lot of semisupervised clustering methods in data mining and machine learning domains, such as the constraint-based methods [1214], the distance (or similarity) metric learning methods [1518], and the methods integrating the two above [1921]. However, most of them are designed for the traditional vector data and not well fit for the current network data. For example, some distance-based algorithms use the Euclidean distance, which is invalid in measuring the distances between nodes in networks. Only a few graph-based semisupervised methods, such as semisupervised spectral clustering algorithm [2224] and semisupervised kernel k-means algorithms [25, 26], can be applied in network clustering, but they cannot always get clustering result with high and steady quality.

In this paper, we propose a novel semisupervised clustering algorithm for networks based on fast affinity propagation (SCAN-FAP), which has a strong ability to improve the clustering quality by making full use of a little of prior knowledge. The basic innovation of our approach consists of two important components. The first is to define an effective similarity measure taking into account both the structural and the pairwise constraint information, while the second is to provide a performable clustering algorithm taking the similarities as input. Our contributions are threefold.(i)We propose a constraint similarity measuring method, which is extending the idea of the basic SimRank [27]. This method adopts the most commonly used must-link and cannot-link pairwise constraints as the prior knowledge and embodies them into the SimRank equation, which tactfully integrates the structural information and the prior knowledge. (ii)Taking the constraint similarities as input, we propose a fast affinity propagation algorithm for network clustering. By the analysis of the factor graph, the theoretical basis of the affinity propagation, we modify the binary model by reducing the number of variables to be optimized. The proposed algorithm derived from the new model can generate clusters with high quality while speed up the running time compared with the original affinity propagation algorithm [28]. (iii)We demonstrate, by using various real networks, that the proposed algorithm method has the advantages of both the effectiveness and efficiency on improving the clustering quality, by making use of a small number of pairwise constraint information.

The paper is organized as follows. Section 2 reviews some related work including network clustering and semisupervised clustering. Section 3 describes the proposed constraint similarities and its calculations. Section 4 proposes the fast affinity propagation algorithm for networks. Section 5 analyzes the time complexity of our algorithm. We present the experimental results to show the advantages of our algorithm in Section 6. Finally, in Section 7, we present the conclusions of our work.

2.1. Network Clustering

Recently, there exists an increasing body of literature on network clustering methods, such as modularity optimization, spectral clustering, and similarity-based algorithms.

Modularity [2] is a very famous objective function for measuring the strength of dividing a network into clusters. The value of modularity is the fraction of the edges that fall within the given clusters minus the expected such fraction if edges were distributed at random. Networks with high modularity have dense links between nodes within clusters but sparse connections between nodes in different clusters. Therefore, lots of optimization methods, such as Mod-CSA [3], LPAm [4], and MIGA [5], are designed for getting the maximum modularity as their objective. It is worth mentioning that these methods have a resolution limit and may fail to identify some smaller clusters [29].

Spectral clustering algorithms are also popularly used for networks. They cut the networks by transforming the initial set of nodes into a set of points in space, whose coordinates are elements of eigenvectors. The set of points is then clustered via standard techniques, such as k-means. There are various kinds of spectral clustering, which are distinguished from each other by their “cut functions,” which will be then solved as quadratic optimization problems. The traditional objective functions include the famous “Ratio Cut” [7] and “Normalized Cut” [8]. Other recent “cut functions” proposed in [30, 31] are set equal to the modularity for real networks. In addition, Dhillon et al. [32] proved that, in certain conditions, spectral clustering algorithms can seem as the corresponding kernel-based k-means formations.

A nature property in networks is the similarity between nodes. Therefore, some similarity-based algorithms are proposed to utilize the similarities to partition the networks. SCAN [9], a density-based algorithm, clusters the networks by using the transitivity of the cosine similarities between nodes. It can not only find the clusters with connectivity and maximality but also detect the hubs and outliers. Liu [10] made use of the famous affinity propagation algorithm to partition the networks with various similarity measures. Cheng et al. [11] measured the similarities between nodes by considering the information of both the content and links and then proposed an efficient incremental computing approach for network clustering.

However, most of algorithms introduced above are unsupervised, which do not have the ability of utilizing a small number of prior knowledge to improve the clustering quality.

2.2. Semisupervised Clustering

Semisupervised clustering is a kind of learning methods, which can guide the clustering process by using a small number of prior knowledge. It becomes growing popular in machine learning and data mining due to its important ability of improving the clustering quality. Generally, there are two kinds of prior knowledge: label information and pairwise constraints. The pairs of constraints are more popular than the label information since they are faster or cheaper to be obtained [19]. Must-link and cannot-link are two common used pairwise constraints. The former indicates the two individuals must be in the same cluster while the latter indicates the individuals must be in different clusters.

There are two kinds of strategies about semisupervised clustering: constraint-based optimization and distance (or similarity) metric learning. The constraint-based optimization methods, such as [1214], modify the objective function by using the obtained label information or pairwise constraints and then solve the new objective optimization problem. The metric learning methods always use the prior information to adjust the original distances (or similarities) between objects and then cluster the dataset by some original method. Relevant examples are demonstrated in [1518]. Some other methods, such as [1921], combine the two strategies together to make use of the prior knowledge.

However, most of the existing semisupervised clustering methods will be invalid in network clustering, since they are only designed for the traditional vector data. A few graph-based methods in their semisupervised formation can be used to partition networks. For example, semisupervised spectral clustering [22] modifies the adjacent matrix of the network using the pairwise constraints at the beginning of the original algorithm; semisupervised kernel-based k-means method proposed by Kulis et al. [25] defines a penalized kernel matrix subject to pairwise constraints, which will guide the assignments of the clusters; the method in [26] defines a new penalized kernel matrix taking the density modularity [33] as the objective function.

3. Constraint Similarity Measure

One of the most important motivations of our method is to adjust the original similarities between nodes when some prior knowledge consists in the networks. For the prior knowledge, we adopt the pairwise constraints since it is more popular in real applications. For the similarities adjustments, we define a novel measure integrating the pairwise constraints and the basic SimRank similarities.

3.1. Basic SimRank

SimRank is a general and a global similarity measure, which takes fully into account the structural context of all the nodes in networks. The basic idea of SimRank is “two objects are similar if they are related to similar objects.”

Let represent a network, where is the set of nodes and is the set of edges. For a node represents the set of neighbors of is the number of elements in is the ith element in . Let us denote the similarity between nodes and by . Following the earlier definitions, a recursive equation is defined for . If , then is set to be 1.

Otherwise, where is a decay factor between 0 and 1. Note that either or may have no neighbors. In this situation, is set to be 0 since there is no way to infer any similarity between them. Formally, when or , .

3.2. Definition of the Constraint Similarity Measure

The basic SimRank is an unsupervised measure. However, in semisupervised network clustering, the prior knowledge should be taken fully into account for the similarity measure. The pairwise constraints are adopted to express the prior knowledge in our method. The formation of the pairwise constraints is a constraint matrix , the element of which represents the relationship between two nodes: where must-link means the pairs of two nodes must be in the same cluster and cannot-link means the pairs of two nodes must be in different clusters.

In particular to deserve to be mentioned, the must-link relationship satisfies three properties: reflexivity, symmetry, and transitivity. Formally,(1)reflexivity: for any node ; (2)symmetry: for any pair of two nodes and , if , then ; (3)transitivity: for any three nodes , , and , if and , then .

Extending the basic SimRank and combining with the known information of the pairs of constraints, we define a novel constraint similarity measure composed of three composed three components as follows:

In (3), the similarity between two nodes is set to 1 as the max value when the must-link relationship is satisfied and 0 as the min value when cannot-link relationship is satisfied. Note that any node has the must-link relationship with itself and any isolated node with no neighbors has the cannot-link relationship with all the remaining nodes. The similarity between two nonconstraint nodes, similar to the basic SimRank equation, is calculated according to the similarities among their neighbors. The score will be increasing or decreasing when the two non-constraint nodes are linked with more must-link or cannot-link relationships. As a result, the constraint similarity measure considering the prior constraints information can help to improve the clustering result efficiently.

3.3. Calculation of the Constraint Similarity Measure

A solution to the SimRank equations for a network can be reached by iteration to a fix value. Let be a given network with nodes. For each iteration , we store the elements , where implies the score of the similarity between and in the current iteration. We successively calculate using . Initially, each is defined to be

To calculate from , we use (5), which has the same form as the basic SimRank equation, when there is no constraint between and ; namely, . On each iteration , keeps 1 when and 0 when :

As shown previously, the current score of the similarity between and is updated by the score from previous iteration. We argue that this calculation method has the property of convergence.

Proposition 1. The similarity score calculated by the method introduced before will converge to the true value as the iteration increases; namely, .

Proof. Certainly, is always keeping the score 1 and 0 when and , respectively. So, we only need to prove the convergence for when . On the iteration , for any pair of nodes and , When ,
For , , and satisfy ; therefore,
Since , . So, it follows from the previous proposition that will converge to .

4. Fast Affinity Propagation Algorithm for Network Clustering

After the similarities are calculated, the following work we should do is to design an efficient and effective clustering method making use of the similarities. Although there are a number of candidate algorithms to be chosen, we adopt the famous affinity propagation algorithm (AP) [28], which has the ability to get a stable and reliable clustering result. Furthermore, we observe that the efficiency of AP can be increased by ingenious design, due to the sparse similarities in networks. So, in this section, we propose a fast affinity propagation algorithm for network clustering.

4.1. Factor Graph

We start from the optimization problem of the factor graph, which is the theoretical basis of the affinity propagation algorithm. Figure 1 is the binary model for affinity propagation demonstrated in [34]. There are two types of nodes representing variables and functions, respectively, in the binary model. Let be the set of binary variables in the graph, such that if the exemplar for node is node . In this notion, indicates that is an exemplar itself. For each variable , there are a similarity function node and two constraint function nodes. The similarity function is written in (9), where the value is equal to the input similarity between nodes and if . The 1-of-N constraint function denotes each node in affinity propagation clustering must be assigned to only a single exemplar. The consistency constraint function means that if a node itself is not an exemplar, it cannot be the exemplar for any other node. The goal of the affinity propagation is to maximize by finding the optimal assignment for . Obviously, the number of the variables to be solved is :

For semisupervised network clustering, the similarity between two nodes will be set to 0 when they are satisfied with cannot-link relationship. In the opposite way, there are lots of pairs of nodes with 0 similarity scores due to the influence both of the cannot-link relationships and the sparse structure of the real networks. Therefore, we argue that the pairs of nodes with 0 similarity scores are also constrained by cannot-link relationships. Formally, for nodes and . Since the pairs of nodes must be assigned into different clusters, if the similarity score between and is 0, then and are not represented by each other as the exemplar, such that both and are set to 0. In this way, parts of the binary variables will be set to 0 in the optimum assignment of , which makes the original factor graph in Figure 1 adjusted.

Firstly, to facilitate the description, we give the definition for the similarity neighborhoods. For a node is defined as the similarity neighborhoods of such that ; namely, the similarity score between and any element in is greater than 0. In addition, denotes the kth item in , and denotes the number of the items in . Then, using the similarity neighborhoods we give the local factor graph for nodes and in Figures 2 and 3.

Figure 2 indicates the relationships between the function node and the related binary variable nodes. If node , which means , the corresponding is set to 0, so cannot choose as its exemplar. Otherwise, if , is satisfied with the 1-of-N constraint in (10), which means that can only choose its exemplar in its similarity neighborhoods,

Figure 3 indicates the relationships between the function nodes and the related binary variable nodes. If node , which means , the corresponding is set to 0. Otherwise, if , then j is satisfied with the consistency constraint in (11), which means that all the nodes in cannot choose as their exemplars if is not an exemplar itself ,

The whole factor graph is integrating all the function nodes and like the ones represented in Figures 2 and 3. Each variable node becomes an isolated node without any constraint, and its value is set to 0. As a result, the objective function of this new factor graph is defined to be

The problem of maximizing (12) can be solved by the max-sum algorithm [35], which passes messages between function nodes and variable nodes in the factor graph. As depicted in Figure 4, there are five message types passed between nonisolated variable nodes and function nodes. The corresponding message update rules are defined as (13) to (17), when executing the max-sum algorithm on the new factor graph,

We eliminate and , while only keeping and . As a result, keeps the formation as (16) being only dependent on , while is calculated by (17) being only dependent on ,

Then, the optimal solution of the objective function can be obtained by updating and alternately until the convergence. We can see that the number of messages passed is reduced on the factor graph due to the decrease of the elements of the variable nodes to be assigned. Therefore the time efficiency will be increasing for the objective optimizing.

4.2. Clustering Algorithm

As the analysis in the previous factor graph, we extend the original affinity propagation, an exemplar-based algorithm, into a fast version for network clustering. There are some basic parameters.Responsibility , equivalent to in the factor graph, represents the evidence for how well-suited node is to serve as the exemplar for node . Availability , equivalent to in the factor graph, represents the evidence for how appropriate it would be for node to choose node as its exemplar.Preference , the initialized value for every node, represents the tendency of node being chosen as an exemplar. At the beginning of the algorithm, all nodes share the same preference, which controls the number of the final clusters. The larger value of the preference results in a larger number of clusters.Damping  factor is used to smooth the responsibility and the availability so as to avoid numerical oscillations. In our algorithm, is set to 0.8. The smoothed responsibility and the availability are calculated by (18) and (19),

According to the parameters defined before, for a given network and the constraint matrix , the basic process of our algorithm is as follows.

Step 1. Combined with the constraint matrix , calculate the similarities for all pairs of nodes in .

Step 2. Construct the similarity neighborhoods for each node.

Step 3. Initialize the preference and the maximum iteration number; for each node and node in the similarity neighborhoods of , set .

Step 4. For each node and node in the similarity neighborhoods of , update by (16) and (18); update by (17) and (19).

Step 5. For each node , calculate its exemplar by (20):

Step 6. When the exemplars for all nodes are not changed any more, or the iteration reaches the maximum value, the algorithm will be terminated, and the nodes with the same exemplar will be partitioned into the same cluster. Otherwise, go to Step 4.

5. Complexity Analysis

In this section, we present an analysis of the computation complexity of the proposed algorithm. Given a network with nodes, the algorithm is composed of two parts: the constraint similarities calculation and the fast affinity propagation clustering. We analyze the computation complexity of the two parts, respectively, in the following.

For the constraint similarities calculation, if there is no any constraint condition, the running cost is , where is the number of iterations of the calculation and is the average degree of all nodes in the network ( in many real networks). If some constraint information is obtained, the similarities between the pairs of nodes satisfied with constraint conditions need not to be calculated. This is because the value will be fixed to 1 and 0 for the must-link and cannot-link pairs all the time. Assuming the number of pairwise constraints is , the total cost of the constraint similarities calculation is .

For the fast affinity propagation algorithm, the worst case is that the similarities between all pairs of nodes are greater than 0. In this situation, the time cost is , where is the number of iterations of the algorithm. However, the edges are sparse in many real networks, which results in the fact that the similarity scores are 0 between many pairs of nodes. The fast affinity propagation algorithm does not take into account the messages passing between the pairs of nodes provided with 0 similarity scores. As a result, let represent the number of pairs of nodes provided with similarity scores greater than 0. Then, the total cost of the fast affinity propagation algorithm is .

6. Evaluation

In this section, we evaluate the performance of the proposed semisupervised network clustering algorithm based on fast affinity propagation (SCAN-FAP). We conduct all the experiments on a Pentium Core2 Duo 2.8 GHz PC with 2 GBytes of main memory, running on Windows 7. We implement our algorithm in C++, using Microsoft Visual Studio 2008.

6.1. Dataset

We use six real network datasets to evaluate the performance of SCAN-FAP. Table 1 lists the information about these networks.

6.2. Effectiveness
6.2.1. Evaluation Criteria

Since the underlying class labels of all the datasets are already known, we adopt the Normalized Mutual Information (NMI) [25, 32] and the F-Measure Score [17] in our experiments to evaluate the quality of the clusters generated by various methods.

NMI is currently widely used in measuring the performance of network clustering algorithms. Formally, the measurement of NMI can be defined as follows: where is the confusion matrix, is the number of nodes both in the ith class and the jth cluster, and are the number of classes and clusters, respectively, and and are the number of nodes in the ith class and the jth cluster, respectively.

F-Measure Score is another commonly used measure for evaluating clustering algorithms. Assume is the set of node pairs where nodes and belong to the same classes in the ground truth and is the set of node pairs that belong to the same clusters generated by an algorithm. Then the F-Measure Score is computed from both the precision and the recall synthetically: where precision and recall are written as follows   (23):

6.2.2. Experimental Methods

We compare SCAN-FAP with the other four representative semisupervised clustering algorithms, which are briefly described in the following.

SS-SP, a semisupervised spectral clustering algorithm [22], uses the known constraint relationships to modify the adjacent matrix of the original network. If the nodes and satisfy the must-link constraint, is set to 1; if and satisfy the cannot-link relationship, is set to 0. SS-SP will then execute the conventional spectral clustering algorithm on the modified adjacent matrix.

SS-NCut-KK [25] is a semisupervised kernel k-means algorithm based on Normalized Cut. In this algorithm, the original kernel matrix and the penalty term with constraint relationships are combined to construct a new kernel matrix, such that the wrong cluster assignments will go against the objective function. The kernel matrix is defined as , where is a constant large enough to ensure that is positive definite, is the identity matrix, is the corresponding Laplacian matrix, and is the matrix of the penalty term. SS-NCut-KK executes the kernel k-means algorithm using the matrix .

SS-DM-KK [26] is another semisupervised kernel k-means algorithm, which is similar to SS-NCut-KK. It takes the density modularity [33] integrating the constraint information as its objective function, which is the only difference from SS-NCut-KK.

SCAN-AP is a semisupervised network clustering algorithm most related to our work. The original affinity propagation algorithm is used taking as input the constraint similarities between all the nodes in the network.

We select pairwise constraints randomly. The proportion of the pairwise constraints to all pairs in the network is set to . Due to the randomness of the pairwise constraints, the clustering result of each experiment is the average value generated by running the corresponding algorithm over 20 times. The number of the clusters is set to the number of real class labels in the networks. Since the number of clusters generated by SCAN-AP and SCAN-FAP depends on the value of the preference parameter , we choose different value of to adjust the number of clusters to the ground truth.

6.2.3. Experimental Results

Figures 5, 6, 7, 8, 9, and 10 demonstrate the curves of the clustering results on the six real datasets generated by different algorithms, where the horizontal coordinates represent the proportion of pairwise constraints and the vertical coordinates represent the NMI and F-Measure scores, respectively. We can get some observations as follows.

Firstly, we compare all the algorithms in their unsupervised condition, which is reflected in the curves that the horizontal coordinates are 0. We can observe that SCAN-FAP and SCAN-AP get higher quality results on all the networks except Adjnoun than the other three algorithms. This indicates that SimRank is a good measure for similarities between nodes in networks. Note that the NMI of the results on Adjnoun obtained by all algorithms is close to 0, which implies that only the unsupervised structural information is insufficient for the clusters.

Then, we compare each semisupervised algorithm with its unsupervised version. When the pairwise constraints are provided, we can see from both NMI and F-Measure scores that the clustering results of SCAN-FAP and SCAN-AP are better than the ones without any pairwise constraints provided. So, it can be considered that the proposed constraint similarity measure can take full advantage of the prior knowledge to guide the clustering process so as to improve the quality of the clustering results. For the other algorithms, SS-NCut-KK and SS-DM-KK can improve the quality of clustering in most cases, but SS-SP fails to utilize the pairwise constraints on Adjnoun and PolBlogs.

We also compare the performance of SCAN-FAP with other semisupervised clustering algorithms except SCAN-AP. The NMI and F-Measure curves of SCAN-FAP can display ascending trend together with increasing the pairwise constraints. However, the SS-SP, SS-NCut-KK, and SS-DM-KK show occasionally fluctuant or unascending states, which reduces their reliability. For example, SS-SP on political blogs appears invalid with any percentage of constraints, while SS-NCut-KK and SS-DM-KK on Citeseer reveal decreased NMI with 25% and 40% constraints provided, respectively, and so on. One of the reasons for analyzing the instability of the three algorithms is their random initializations. For SS-SP, we should initialize the cluster centers when the k-means algorithm runs on the data composed of the eigenvectors and for SS-NCut-KK and SS-DM-KK, we should initialize the cluster assignment for each node.

Finally, we compare the performance of SCAN-FAP with its similar algorithm SCAN-AP. Meaningfully, the clustering quality of SCAN-FAP is always very close to SCAN-AP, which confirms the effectiveness of the messages passed only in the pairs of nodes with similarity scores greater than 0. We demonstrate that SCAN-FAP can take the place of SCAN-AP under keeping the clustering quality. At the same time, the time efficiency is much increased, which will be demonstrated next.

6.3. Efficiency

From the effectiveness evaluations, we can see that SCAN-FAP and SCAN-AP have the ability of getting clusters with high quality. They only need to be executed one time in real applications due to the stability of them. However, the other three methods have to be run many times and average the results, since they are influenced by some factors such as the initializations. Therefore, we only compare the time efficiency of SCAN-FAP and SCAN-AP. In addition, because both the two algorithms share the same module that calculates the constraint similarities, we only take the running time of the clustering process for comparisons. The datasets are adopted by the three relatively large networks: PolBlogs, Cora, and Citeseer.

Figure 11 demonstrates the results about the efficiency comparisons, where (a), (b), and (c) are the running time of the two algorithms on three datasets, respectively, and (d) illustrates the proportion of the number of the pairwise nodes with 0 similarity scores to all the pairs in each network. We can observe that SCAN-AP keeps the running time consistently, since the messages are passed between all the pairs of nodes. SCAN-FAP can increase the time efficiency obviously, and the degree of the increasing is inversely proportional to the ratio of the pairs of nodes with 0 similarity scores. This is because each node only chooses the exemplar from its similarity neighborhoods when using fast affinity propagation algorithm in SCAN-FAP.

7. Conclusions

In this paper, we propose a novel semisupervised clustering algorithm for networks based on fast affinity propagation, which can utilize the prior knowledge to guide the unsupervised clustering process. The algorithm firstly defines a constraint similarity measure, which takes the pairwise constraints information to extend the basic SimRank measure. The similarity scores calculated with the new measure are influenced by the must-link and cannot-link relationships, such that the semisupervised property is achieved. Then, taking as input the constraint similarities, a fast affinity propagation algorithm is proposed to partition the networks. This algorithm maintains the advantages of the original affinity propagation algorithm. Moreover, it increases the running efficiency by passing messages between only the nodes with similarity scores greater than 0. Experimental studies, by various kinds of real networks, demonstrate that our algorithm has a stronger ability of making use of a small number of prior knowledge than the other representative methods, to improve the clustering quality effectively and efficiently.