Abstract
The performance of graphbased clustering methods highly depends on the quality of the data affinity graph as a good affinity graph can approximate well the pairwise similarity between data samples. To a large extent, existing graphbased clustering methods construct the affinity graph based on a fixed distance metric, which is often not an accurate representation of the underlying data structure. Also, they require postprocessing on the affinity graph to obtain clustering results. Thus, the results are sensitive to the particular graph construction methods. To address these two drawbacks, we propose a component graph clustering (GC) approach to learn an intrinsic affinity graph and to obtain clustering results simultaneously. Specifically, GC learns the data affinity graph by assigning the adaptive and optimal neighbors for each data point based on the local distances. Efficient iterative updating algorithms are derived for GC, along with proofs of convergence. Experiments on several benchmark datasets have demonstrated the effectiveness of GC.
1. Introduction
Clustering is one of the most fundamental topics in computer vision and pattern recognition. The objective of clustering is to discover the data structure and partition a group of data points into several clusters, where the similarity of data points within the same cluster is greater than the similarity from different clusters [1–8].
Structure of data is usually characterized by the affinity matrix of the graph whose edges denote the similarities between data points. If vertices belonging to each cluster are connected to be a component, i.e., there is no edge connecting between different clusters, is assigned to a value of zero in graph theory. Our purpose is to learn a graph with exact number of components so that vertices in each connected component of the graph are partitioned into one cluster.
Given a cluster indicator matrix , it can be constructed by data labels; if a data point is assigned to the th cluster, and otherwise. Since is a strictly block diagonal matrix in the ideal case, many clustering methods are designed to obtain a block diagonal affinity matrix or they use it as an important prior [9–13]. Actually, we find that a good quality of graph structure results in a good cluster indicator matrix, even if the affinity matrix is not a strictly block diagonal matrix. For example, in Figure 1, we can obtain an ideal cluster indicator matrix with the graph itself because one connected component has exactly one cluster, but the affinity matrix (see Figure 1) constructed from the graph itself is not a strictly block diagonal matrix, i.e., some onblock diagonal elements are zeros. If the graph has connected components, we can directly obtain a good cluster indicator matrix with graph itself, even when the connected edges are relatively sparse in each connected component. It means that the graph structure rather than the strictly block diagonal affinity matrix is the intrinsic quality for obtaining a good clustering result.
We propose a novel graphbased clustering approach to exploit connected components for clustering, called component graph clustering (GC). Figure 1 shows schematic of the component graph learning. GC aims to learn a graph whose connected edges are tuned adaptively until the graph has exactly connected components [14]. To evaluate the effectiveness of GC, we have conducted experiments on six benchmark datasets in comparison to stateoftheart approaches. Experimental results have well demonstrated that GC performs better than other approaches consistently. GC makes following contributions:(1)GC learns to obtain a graph with exact connected components. Since the vertices in each connected component belong to one cluster, labels are obtained directly from the learned graph itself.(2)The clustering indicators are obtained by using the learned graph itself without performing postprocessing means clustering algorithm.(3)GC can be used as an alternative to spectral clustering (SC). Similar to SC, GC only needs to input an initial affinity graph without involving raw data.
2. Related Work
SC, which exploits the eigenstructure of a data affinity graph to partition data into different groups, has become one of the most fundamental clustering approaches. Standard SC uses the radial basis function to construct the affinity matrix [15], and its performance relies heavily on the eigenstructure of the affinity matrix [15–20].
In SC, the similarity between pairwise data points and is firstly computed by the radial basis function to construct the affinity matrix , then the number of eigenvectors of the normalized Laplacian matrix corresponding to the top smallest eigenvalues, , are regarded as the lowdimensional embedding of raw data , and means clustering is performed on to obtain labels finally.
However, due to the ambiguity and uncertainty inherent in data structure, the intrinsic affinity matrix cannot be determined by a unified function. Since most existing SC methods use a predefined affinity matrix of the graph, the value is usually minimized but not zeroed, e.g., ratio cut [21], normalized cut [16], and minmax cut [22], which results in a postprecessing means clustering algorithm to obtain the clustering labels [15, 17, 18].
Many approaches have been proposed to improve the performance of SC. Generally, they could be categorized into three paradigms:(1)How to improve data clustering using a predefined affinity matrix [12, 15, 23](2)How to construct a better affinity matrix to obtain a better result than standard SC [17, 24, 25](3)How to learn the affinity matrix and clustering structure simultaneously [26–29]
This paper is related to the third paradigm. In the third paradigm, objective functions of [26, 28, 29] usually involve the raw data but GC does not involve it. All of them employ a rank constraint on Laplacian matrix [26–29], and clustering with adaptive neighbors (CAN) [26] and constrained Laplacian rank (CLR) [27] are two methods more related to GC. CAN learns the graph from raw data. However, CLR learns the graph by minimizing the difference between the input initial graph and the learned graph.
In GC, the raw data are not involved in the objective function, and GC is directly modified from standard SC to tune the structure of graph so that an intrinsic graph with exact connected components is obtained.
3. Component Graph Clustering
Suppose that the data matrix is denoted by , where denotes the data matrix belonging to the th cluster, denotes the cluster number, denotes the th data point, is the number of data points, and denotes the dimension.
3.1. Preprocessing
We propose a new simple neighborhoodpreserving method to refine the raw data in this section. Assume that a data point and its nearest data point are both generated from a same point since they may be disturbed by noise. Then, we define an equation to replace by the weighted linear combination of and :where is defined by
Since a data point is generally closer to the nearest one belonging to a same cluster than one to a different cluster, it is straightforward to check from equation (2) that tends to 1 if and belong to a same cluster and tends to 0 if and belong to different clusters. It implies that two data points in a same cluster become closer during iteration while two data points in different clusters have very small or even no influence on each other after using the preprocessing. With the iteration of equation (1), data points in same cluster become closer and closer so that the clustering is easier to carry out.
3.2. Component Graph
An undirected finite graph is denoted by , and its vertex set is denoted by . We consider the graph is weighted and assign a nonnegative real weight to each pair vertices of and . The degree of a vertex is defined by . The Laplacian matrix of the graph [30] is defined bywhere is the diagonal matrix whose element is .
For any vector , we have , so is a positive semidefinite matrix. Since , the smallest eigenvalue of is equal to zero, and the corresponding eigenvector is .
Theorem 1. The number of connected components is equal to the multiplicity of 0 as an eigenvalue of .
Proof. Suppose there are connected components in graph , the th subgraph is denoted by , and the corresponding vertex set of is denoted by . Then, we have and . For an eigenvector of with eigenvalue zero, we have . In the th connected component , a connected edge is weighted by a positive value , and the corresponding term of needs to be zeroed, i.e., and have to be a constant . and , and the terms of need to be zeroed while these edges are connected with positive weights , which means that in the th component, , and each is of the same constant, i.e., . These eigenvectors of different components are linearly independent, so the multiplicity of 0 as eigenvalue is the number of components of the graph and vice versa.
Similar proof of Theorem 1 can be seen in previous works [13, 18, 31]. We explore Theorem 1 in the context of a specific example. In Figure 1, we show an intrinsic graph with two connected components, and an intrinsic affinity matrix of the graph being consistent in Theorem 1 is given bywhere we constrain the degree , i.e., the sum of each column of is equal to one.
The Laplacian matrix of equation (4) has an eigenvalue zero with the multiplicity , and we have the two corresponding eigenvectors and . They are concatenated by , and its transposition isIf the graph in Figure 1 corresponds to an affinity matrix in equation (4), i.e., the Laplacian matrix of equation (4) satisfies to Theorem 1, then the clustering labels are easy to obtain without further performing graphcut or means clustering algorithms. Since vertices in each connected component of the graph are partitioned into one cluster, the clustering labels can be obtained easily by using the graph with strongly connectedcomponent algorithm [32].
Since it is a positive semidefinite matrix, has nonnegative eigenvalues . Theorem 1 indicates that if , then the graph has connected components and the vertices are already partitioned into clusters [18, 30, 31, 33]. Then, according to Fan’s theorem [14, 34], we havewhere denotes the trace operator, is an identity matrix, is the Laplacian matrix, and is a diagonal matrix whose diagonal elements are column sums of .
The proof of Fan’s theorem can be seen in [13, 35, 36].
It is straightforward to check that the objective function value of equation (6) does not generally tend to zero because the structure of graph is varied by varying the graph construction methods. In the following section, we propose a method to tune the graph structure of adaptively so that the objective function value of equation (6) tends to zero.
3.3. GC Algorithm
In this section, we will explore how to learn a graph with exact connected components [14].
The right term of equation (6) can be solved with respect to by the eigenvectors of , but it has trivial solution with respect to , i.e., only one is assigned to a value and others are zeroed in each column of . We add norm regularization to smooth the weights in . Then, we optimize and simultaneously:where is the tradeoff parameter.
If the affinity matrix is consistent in Theorem 1, the procedure of obtaining clustering labels is named as GC. In other words, we guarantee the objective function of standard SC to be zeroed in GC by means of tuning the graph structure. We divide the problem equation (7) into two subproblems and alternately solve them.
The first subproblem is to fix , updating . Then, equation (7) becomes
Equation (8) is solved by calculating the eigenvectors of .
The second subproblem is to fix , updating . Then, equation (7) becomeswhere denotes the th column of .
Each column of is independent, so solving equation (9) is equal to optimizing the following problem:
is denoted by and is a fixed vector when we solve the th column . Solving equation (10) is equal to optimizing the following problem:where is a constant vector.
Equation (11) is a Euclidean projection problem on the simplex space, and there are several algorithms for solving it [37–39]. According to the Karush–Kuhn–Tucker condition [37], it can be verified that the optimal solution is
Equation (10) will be analysed specifically in Section 3.4, and we show how to obtain exact connected components.
We alternately optimize equations (9) and (8) until the sum of the top smallest eigenvalues of becomes zero. The algorithm for solving equation (7) is summarized by Algorithm 1.

3.4. Component Graph Learning
We find a tradeoff between two graph structures: the first case is that one vertex is connected with only one other vertex in vertex set and the second case is that all vertices are connected with each other by the same weight . The tradeoff renders the objective function of equation (6) close to zero.
We optimize each column of independently, and one column of is denoted by . Values in the vector are zero or positive: means has no edge with and means and are connected with an edge.
The first is to optimize the following:
It returns the minimum value , i.e., is only connected to the other .
The second is the following:
It returns .
Thus, an adaptive graph learning objective function iswhere is the tradeoff parameter.
If , then optimizing equation (15) is equivalent to solving equation (13); if tends to an infinity positive value, then equation (15) becomes equation (14). If we want to learn a sparse graph, we can tune to be a small value; if we want to learn a graph with more edges, we can tune to be a relatively large value.
Solving equation (15) is equal to optimizing the problem:
According [40], the optimal affinities are given by
Actually, it is straightforward to check from equation (17) that the number of neighbors is also determined by the setting of . In practice, the structure of the graph can be tuned coarsely by and can be tuned finely by the tradeoff parameter . has an explicit meaning while has an implicit relation with the structure of the graph. We tune both of them to obtain an intrinsic affinity matrix.
As standard SC, we can input any affinity matrix for GC. There is one parameter in equation (7). In Algorithm 1, we use Eq. (9) to obtain the initial graph by replacing with . For each , we have different , so before the iteration, we set for preserving the initial structure. The initial graph structure is mainly determined by .
3.5. Convergence Analysis
Since the secondorder derivative of equation (16) with respect is equal to , equation (16) is a convex problem. Because the Laplacian matrix is a positive semidefinite matrix, equation (8) is a convex optimization problem. Optimizing and alternately, both of them decrease monotonically. As a result, the overall objective function value of equation (7) decreases monotonically in each iteration until Algorithm 1 converges.
3.6. Computational Complexity Analysis
The first step of the objective function equation (7) is to solve equation (10). We need time to compute , and to solve equation (15) where is the iteration number. times are needed to calculate each , so the complexity of the first step of equation (7) is . The second step is an eigendecomposition procedure, and the complexity of the generalized eigenvector problem is . For solving equation (8), we need to calculate the eigenvectors of , so its cost is . Thus, the total complexity of equation (7) iswhere is the number of iterations of the two steps.
4. Experimental Results
In this section, we conduct experiments on six datasets to demonstrate the effectiveness of GC in terms of clustering accuracy (ACC) and normalized mutual information (NMI).
4.1. Dataset Description
Six datasets include the following:(1)Twomoon dataset is a randomly generated synthesis dataset and has two clusters of data distributed in the moon shape. Each cluster has 100 samples and the noise percentage is set to 0.20.(2)Pathbased dataset [41] is made up of 300 samples which belong to three clusters.(3)Yale dataset (http://vision.ucsd.edu/content/yalefacedatabase) contains 165 grayscale images of 15 individuals. Each individual has 11 images taken under different facial expression or configuration.(4)ORL dataset (http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html) contains ten different images of each with 40 distinct subjects. For some subjects, the images were taken at different time, varying the lighting, facial expressions, and facial details. All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position.(5)COIL20 dataset [42] contains 1440 images of 20 object categories. Each category includes 72 images and all the images are normalized to 32 32 pixel arrays.(6)NottingHill dataset is derived from the movie “Notting Hill” and includes 4660 faces in 76 tracks belonging to five main casts.
The first two datasets are twodimensional synthesis datasets. For datasets of Yale, ORL, COIL20, and NottingHill, we employ the intensity feature of these images [43, 44]. A small fraction of samples in these image datasets is shown in Figure 2. Dataset description is summarized in Table 1.
4.2. Experimental Setup
We compare GC with six methods, and the results are summarized in Table 2. We select three classic methods: means clustering [45], ratio cut clustering (Rcut) [21], and selftuning SC (STSC) [17]; besides, we select three related stateoftheart methods: robust graph for SC (RGSC) [24], clustering with adaptive neighbors (CAN) [26], and constrained Laplacian rank (CLR) [27].
The default parameters given by the respective authors are adopted.(1)means [45] clusters samples based on the similarity between samples and cluster centroids. There is no parameter to tune in means. We run means clustering 10 times to evaluate the performance. In order to reduce the effect of random initialization, we run 30 times and report the result with the minimum value of the objective function of means among these results in each time.(2)Rcut [21] finds the first eigenvectors of the Laplacian matrix of the graph so as to minimize the similarity between two parts in the graph. Specifically, Rcut advocates the second eigenvalue of the Laplacian matrix which gives the optimal solution.(3)STSC [17] constructs a graph in a local scale and each data point chooses different neighbors. It has a parameter to determine the number of nearest neighbors for graph construction, and we tune the neighbor number to report the best results in terms of the objective value of means.(4)RGSC [24] generates robust affinity graphs via identifying and exploiting discriminative features for spectral clustering based on the unsupervised clustering random forest. The default parameters are used in our experiments.(5)CAN [26] learns the data similarity matrix and clustering structure simultaneously and uses a rank constraint to create the clustering structure in the similarity matrix as several disconnected components. The number of nearest neighbors is searched from 2 to 50 with incremental 2.(6)CLR [27] imposes a Laplacian rank constraint on the learned graph which best approximates the input initial affinity graph. In our experiments, the number of nearest neighbors is searched from 2 to 50 with interval 2 to obtain the best parameter.
For GC, we tune the parameter in the range of with interval two, and is fixed by . For preprocessing, we iterate equation (1) ten times with . We select optimal in terms of objective value of means clustering preformed on . GC obtains the clustering results from the learned graph itself because data points in each connected component belong to one cluster, and the clustering indicators are directly obtained according to Tarjan’s strongly connectedcomponent algorithm [32]. In practice, we can determine in a heuristic way to accelerate the procedure [26]. As the iteration stopping condition of the sum of the top smallest eigenvalues of being zeroed, i.e., the first term of equation (7) is zeroed, is stronger than conventional iteration stopping condition [26], we adopt the former. Because of the iteration stopping condition, we set to a constant (e.g., ), then increase if the connected components of are larger than , and decrease if they are smaller than during iteration.
For all these methods, we run each method 10 times and report the mean of performance as well as the standard deviation in Table 2.
4.3. Results and Analysis
We choose six datasets to demonstrate the effectiveness of GC, the first two of which are synthetic while the rest are realworld datasets. In addition, quantitative results of different methods are shown in Table 2, from which we can find that we have improved the performance to a large extent in comparison to the stateoftheart methods. The clustering results of the comparing methods using the two synthetic datasets are visualized in Figure 3 so that we can observe in an intuitive way that GC approximates the ground truth mostly. What is notable is that on two synthetic datasets of twomoon and pathbased, we have a surprisingly good result approximating to 100%, meaning almost entirely accurate clustering. This implies the great importance of the intrinsic graph structure with a comprehensive analysis of which we obtain a significantly success over other methods.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
Figure 4 shows that the objective value of GC, equation (7), is nonincreasing during iterations. They all converge to a fixed value very fast in 10 iteration times. GC only needs several times to iterate alternately to obtain a cutzeroed graph while the conventional SCbased methods need to perform postprocessing by using means clustering algorithm. Therefore, the GC algorithm is effective overall.
(a)
(b)
(c)
(d)
(e)
(f)
5. Conclusion
In this paper, we have proposed a novel component graph clustering method, which learns to obtain a graph with exact connected components. Since the vertices in each connected component of the intrinsic graph belong to one cluster, labels are obtained directly from the learned graph itself without performing further graphcut or means clustering algorithms. GC learns the affinity matrix and clustering structure simultaneously. Moreover, GC can serve as an alternative to SC due to its simplicity and the effectiveness over the standard SC. This paper focuses on the spectral analysis of the graph Laplacian, to which the sum of top smallest eigenvalues is zeroed. It renders the graph structure update until it has exactly number of components. The efficient algorithm as well as the optimization algorithm is presented after plentiful analysis. Experiments on six benchmarks have demonstrated the superiority of GC.
Data Availability
All the datasets are available at https://github.com/lzucvpr/data_for_clustering.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This study was supported by the 13th FiveYear Informatization Plan of the Chinese Academy of Sciences (grant nos. XXH13506 and XXH13505220) and Data Sharing Fundamental Program for Construction of the National Science and Technology Infrastructure Platform (grant no. Y719H71006).