Exact -Component Graph Learning for Image Clustering
The performance of graph-based clustering methods highly depends on the quality of the data affinity graph as a good affinity graph can approximate well the pairwise similarity between data samples. To a large extent, existing graph-based clustering methods construct the affinity graph based on a fixed distance metric, which is often not an accurate representation of the underlying data structure. Also, they require postprocessing on the affinity graph to obtain clustering results. Thus, the results are sensitive to the particular graph construction methods. To address these two drawbacks, we propose a -component graph clustering (-GC) approach to learn an intrinsic affinity graph and to obtain clustering results simultaneously. Specifically, -GC learns the data affinity graph by assigning the adaptive and optimal neighbors for each data point based on the local distances. Efficient iterative updating algorithms are derived for -GC, along with proofs of convergence. Experiments on several benchmark datasets have demonstrated the effectiveness of -GC.
Clustering is one of the most fundamental topics in computer vision and pattern recognition. The objective of clustering is to discover the data structure and partition a group of data points into several clusters, where the similarity of data points within the same cluster is greater than the similarity from different clusters [1–8].
Structure of data is usually characterized by the affinity matrix of the graph whose edges denote the similarities between data points. If vertices belonging to each cluster are connected to be a component, i.e., there is no edge connecting between different clusters, is assigned to a value of zero in graph theory. Our purpose is to learn a graph with exact number of components so that vertices in each connected component of the graph are partitioned into one cluster.
Given a cluster indicator matrix , it can be constructed by data labels; if a data point is assigned to the -th cluster, and otherwise. Since is a strictly block diagonal matrix in the ideal case, many clustering methods are designed to obtain a block diagonal affinity matrix or they use it as an important prior [9–13]. Actually, we find that a good quality of graph structure results in a good cluster indicator matrix, even if the affinity matrix is not a strictly block diagonal matrix. For example, in Figure 1, we can obtain an ideal cluster indicator matrix with the graph itself because one connected component has exactly one cluster, but the affinity matrix (see Figure 1) constructed from the graph itself is not a strictly block diagonal matrix, i.e., some on-block diagonal elements are zeros. If the graph has -connected components, we can directly obtain a good cluster indicator matrix with graph itself, even when the connected edges are relatively sparse in each connected component. It means that the graph structure rather than the strictly block diagonal affinity matrix is the intrinsic quality for obtaining a good clustering result.
We propose a novel graph-based clustering approach to exploit -connected components for clustering, called -component graph clustering (-GC). Figure 1 shows schematic of the -component graph learning. -GC aims to learn a graph whose connected edges are tuned adaptively until the graph has exactly connected components . To evaluate the effectiveness of -GC, we have conducted experiments on six benchmark datasets in comparison to state-of-the-art approaches. Experimental results have well demonstrated that -GC performs better than other approaches consistently. -GC makes following contributions:(1)-GC learns to obtain a graph with exact -connected components. Since the vertices in each connected component belong to one cluster, labels are obtained directly from the learned graph itself.(2)The clustering indicators are obtained by using the learned graph itself without performing postprocessing -means clustering algorithm.(3)-GC can be used as an alternative to spectral clustering (SC). Similar to SC, -GC only needs to input an initial affinity graph without involving raw data.
2. Related Work
SC, which exploits the eigenstructure of a data affinity graph to partition data into different groups, has become one of the most fundamental clustering approaches. Standard SC uses the radial basis function to construct the affinity matrix , and its performance relies heavily on the eigenstructure of the affinity matrix [15–20].
In SC, the similarity between pairwise data points and is firstly computed by the radial basis function to construct the affinity matrix , then the number of eigenvectors of the normalized Laplacian matrix corresponding to the top smallest eigenvalues, , are regarded as the low-dimensional embedding of raw data , and -means clustering is performed on to obtain labels finally.
However, due to the ambiguity and uncertainty inherent in data structure, the intrinsic affinity matrix cannot be determined by a unified function. Since most existing SC methods use a predefined affinity matrix of the graph, the value is usually minimized but not zeroed, e.g., ratio cut , normalized cut , and min-max cut , which results in a postprecessing -means clustering algorithm to obtain the clustering labels [15, 17, 18].
Many approaches have been proposed to improve the performance of SC. Generally, they could be categorized into three paradigms:(1)How to improve data clustering using a predefined affinity matrix [12, 15, 23](2)How to construct a better affinity matrix to obtain a better result than standard SC [17, 24, 25](3)How to learn the affinity matrix and clustering structure simultaneously [26–29]
This paper is related to the third paradigm. In the third paradigm, objective functions of [26, 28, 29] usually involve the raw data but -GC does not involve it. All of them employ a rank constraint on Laplacian matrix [26–29], and clustering with adaptive neighbors (CAN)  and constrained Laplacian rank (CLR)  are two methods more related to -GC. CAN learns the graph from raw data. However, CLR learns the graph by minimizing the difference between the input initial graph and the learned graph.
In -GC, the raw data are not involved in the objective function, and -GC is directly modified from standard SC to tune the structure of graph so that an intrinsic graph with exact -connected components is obtained.
3. -Component Graph Clustering
Suppose that the data matrix is denoted by , where denotes the data matrix belonging to the -th cluster, denotes the cluster number, denotes the -th data point, is the number of data points, and denotes the dimension.
We propose a new simple neighborhood-preserving method to refine the raw data in this section. Assume that a data point and its nearest data point are both generated from a same point since they may be disturbed by noise. Then, we define an equation to replace by the weighted linear combination of and :where is defined by
Since a data point is generally closer to the nearest one belonging to a same cluster than one to a different cluster, it is straightforward to check from equation (2) that tends to 1 if and belong to a same cluster and tends to 0 if and belong to different clusters. It implies that two data points in a same cluster become closer during iteration while two data points in different clusters have very small or even no influence on each other after using the preprocessing. With the iteration of equation (1), data points in same cluster become closer and closer so that the clustering is easier to carry out.
3.2. -Component Graph
An undirected finite graph is denoted by , and its vertex set is denoted by . We consider the graph is weighted and assign a non-negative real weight to each pair vertices of and . The degree of a vertex is defined by . The Laplacian matrix of the graph  is defined bywhere is the diagonal matrix whose -element is .
For any vector , we have , so is a positive semidefinite matrix. Since , the smallest eigenvalue of is equal to zero, and the corresponding eigenvector is .
Theorem 1. The number of connected components is equal to the multiplicity of 0 as an eigenvalue of .
Proof. Suppose there are -connected components in graph , the -th subgraph is denoted by , and the corresponding vertex set of is denoted by . Then, we have and . For an eigenvector of with eigenvalue zero, we have . In the -th connected component , a connected edge is weighted by a positive value , and the corresponding term of needs to be zeroed, i.e., and have to be a constant . and , and the terms of need to be zeroed while these edges are connected with positive weights , which means that in the -th component, , and each is of the same constant, i.e., . These eigenvectors of different components are linearly independent, so the multiplicity of 0 as eigenvalue is the number of components of the graph and vice versa.
Similar proof of Theorem 1 can be seen in previous works [13, 18, 31]. We explore Theorem 1 in the context of a specific example. In Figure 1, we show an intrinsic graph with two connected components, and an intrinsic affinity matrix of the graph being consistent in Theorem 1 is given bywhere we constrain the degree , i.e., the sum of each column of is equal to one.
The Laplacian matrix of equation (4) has an eigenvalue zero with the multiplicity , and we have the two corresponding eigenvectors and . They are concatenated by , and its transposition isIf the graph in Figure 1 corresponds to an affinity matrix in equation (4), i.e., the Laplacian matrix of equation (4) satisfies to Theorem 1, then the clustering labels are easy to obtain without further performing graph-cut or -means clustering algorithms. Since vertices in each connected component of the graph are partitioned into one cluster, the clustering labels can be obtained easily by using the graph with strongly connected-component algorithm .
Since it is a positive semidefinite matrix, has non-negative eigenvalues . Theorem 1 indicates that if , then the graph has -connected components and the vertices are already partitioned into clusters [18, 30, 31, 33]. Then, according to Fan’s theorem [14, 34], we havewhere denotes the trace operator, is an identity matrix, is the Laplacian matrix, and is a diagonal matrix whose diagonal elements are column sums of .
The proof of Fan’s theorem can be seen in [13, 35, 36].
It is straightforward to check that the objective function value of equation (6) does not generally tend to zero because the structure of graph is varied by varying the graph construction methods. In the following section, we propose a method to tune the graph structure of adaptively so that the objective function value of equation (6) tends to zero.
3.3. -GC Algorithm
In this section, we will explore how to learn a graph with exact -connected components .
The right term of equation (6) can be solved with respect to by the eigenvectors of , but it has trivial solution with respect to , i.e., only one is assigned to a value and others are zeroed in each column of . We add -norm regularization to smooth the weights in . Then, we optimize and simultaneously:where is the trade-off parameter.
If the affinity matrix is consistent in Theorem 1, the procedure of obtaining clustering labels is named as -GC. In other words, we guarantee the objective function of standard SC to be zeroed in -GC by means of tuning the graph structure. We divide the problem equation (7) into two subproblems and alternately solve them.
The first subproblem is to fix , updating . Then, equation (7) becomes
Equation (8) is solved by calculating the eigenvectors of .
The second subproblem is to fix , updating . Then, equation (7) becomeswhere denotes the -th column of .
Each column of is independent, so solving equation (9) is equal to optimizing the following problem:
is denoted by and is a fixed vector when we solve the -th column . Solving equation (10) is equal to optimizing the following problem:where is a constant vector.
Equation (11) is a Euclidean projection problem on the simplex space, and there are several algorithms for solving it [37–39]. According to the Karush–Kuhn–Tucker condition , it can be verified that the optimal solution is
3.4. -Component Graph Learning
We find a trade-off between two graph structures: the first case is that one vertex is connected with only one other vertex in vertex set and the second case is that all vertices are connected with each other by the same weight . The trade-off renders the objective function of equation (6) close to zero.
We optimize each column of independently, and one column of is denoted by . Values in the vector are zero or positive: means has no edge with and means and are connected with an edge.
The first is to optimize the following:
It returns the minimum value , i.e., is only connected to the other .
The second is the following:
It returns .
Thus, an adaptive graph learning objective function iswhere is the trade-off parameter.
If , then optimizing equation (15) is equivalent to solving equation (13); if tends to an infinity positive value, then equation (15) becomes equation (14). If we want to learn a sparse graph, we can tune to be a small value; if we want to learn a graph with more edges, we can tune to be a relatively large value.
Solving equation (15) is equal to optimizing the problem:
According , the optimal affinities are given by
Actually, it is straightforward to check from equation (17) that the number of neighbors is also determined by the setting of . In practice, the structure of the graph can be tuned coarsely by and can be tuned finely by the trade-off parameter . has an explicit meaning while has an implicit relation with the structure of the graph. We tune both of them to obtain an intrinsic affinity matrix.
As standard SC, we can input any affinity matrix for -GC. There is one parameter in equation (7). In Algorithm 1, we use Eq. (9) to obtain the initial graph by replacing with . For each , we have different , so before the iteration, we set for preserving the initial structure. The initial graph structure is mainly determined by .
3.5. Convergence Analysis
Since the second-order derivative of equation (16) with respect is equal to , equation (16) is a convex problem. Because the Laplacian matrix is a positive semidefinite matrix, equation (8) is a convex optimization problem. Optimizing and alternately, both of them decrease monotonically. As a result, the overall objective function value of equation (7) decreases monotonically in each iteration until Algorithm 1 converges.
3.6. Computational Complexity Analysis
The first step of the objective function equation (7) is to solve equation (10). We need time to compute , and to solve equation (15) where is the iteration number. times are needed to calculate each , so the complexity of the first step of equation (7) is . The second step is an eigendecomposition procedure, and the complexity of the generalized eigenvector problem is . For solving equation (8), we need to calculate the eigenvectors of , so its cost is . Thus, the total complexity of equation (7) iswhere is the number of iterations of the two steps.
4. Experimental Results
In this section, we conduct experiments on six datasets to demonstrate the effectiveness of -GC in terms of clustering accuracy (ACC) and normalized mutual information (NMI).
4.1. Dataset Description
Six datasets include the following:(1)Two-moon dataset is a randomly generated synthesis dataset and has two clusters of data distributed in the moon shape. Each cluster has 100 samples and the noise percentage is set to 0.20.(2)Path-based dataset  is made up of 300 samples which belong to three clusters.(3)Yale dataset (http://vision.ucsd.edu/content/yale-face-database) contains 165 grayscale images of 15 individuals. Each individual has 11 images taken under different facial expression or configuration.(4)ORL dataset (http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html) contains ten different images of each with 40 distinct subjects. For some subjects, the images were taken at different time, varying the lighting, facial expressions, and facial details. All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position.(5)COIL-20 dataset  contains 1440 images of 20 object categories. Each category includes 72 images and all the images are normalized to 32 32 pixel arrays.(6)Notting-Hill dataset is derived from the movie “Notting Hill” and includes 4660 faces in 76 tracks belonging to five main casts.
The first two datasets are two-dimensional synthesis datasets. For datasets of Yale, ORL, COIL-20, and Notting-Hill, we employ the intensity feature of these images [43, 44]. A small fraction of samples in these image datasets is shown in Figure 2. Dataset description is summarized in Table 1.
4.2. Experimental Setup
We compare -GC with six methods, and the results are summarized in Table 2. We select three classic methods: -means clustering , ratio cut clustering (R-cut) , and self-tuning SC (ST-SC) ; besides, we select three related state-of-the-art methods: robust graph for SC (RG-SC) , clustering with adaptive neighbors (CAN) , and constrained Laplacian rank (CLR) .
The default parameters given by the respective authors are adopted.(1)-means  clusters samples based on the similarity between samples and cluster centroids. There is no parameter to tune in -means. We run -means clustering 10 times to evaluate the performance. In order to reduce the effect of random initialization, we run 30 times and report the result with the minimum value of the objective function of -means among these results in each time.(2)R-cut  finds the first eigenvectors of the Laplacian matrix of the graph so as to minimize the similarity between two parts in the graph. Specifically, R-cut advocates the second eigenvalue of the Laplacian matrix which gives the optimal solution.(3)ST-SC  constructs a graph in a local scale and each data point chooses different neighbors. It has a parameter to determine the number of nearest neighbors for graph construction, and we tune the neighbor number to report the best results in terms of the objective value of -means.(4)RG-SC  generates robust affinity graphs via identifying and exploiting discriminative features for spectral clustering based on the unsupervised clustering random forest. The default parameters are used in our experiments.(5)CAN  learns the data similarity matrix and clustering structure simultaneously and uses a rank constraint to create the clustering structure in the similarity matrix as several disconnected components. The number of nearest neighbors is searched from 2 to 50 with incremental 2.(6)CLR  imposes a Laplacian rank constraint on the learned graph which best approximates the input initial affinity graph. In our experiments, the number of nearest neighbors is searched from 2 to 50 with interval 2 to obtain the best parameter.
For -GC, we tune the parameter in the range of with interval two, and is fixed by . For preprocessing, we iterate equation (1) ten times with . We select optimal in terms of objective value of -means clustering preformed on . -GC obtains the clustering results from the learned graph itself because data points in each connected component belong to one cluster, and the clustering indicators are directly obtained according to Tarjan’s strongly connected-component algorithm . In practice, we can determine in a heuristic way to accelerate the procedure . As the iteration stopping condition of the sum of the top smallest eigenvalues of being zeroed, i.e., the first term of equation (7) is zeroed, is stronger than conventional iteration stopping condition , we adopt the former. Because of the iteration stopping condition, we set to a constant (e.g., ), then increase if the connected components of are larger than , and decrease if they are smaller than during iteration.
For all these methods, we run each method 10 times and report the mean of performance as well as the standard deviation in Table 2.
4.3. Results and Analysis
We choose six datasets to demonstrate the effectiveness of -GC, the first two of which are synthetic while the rest are real-world datasets. In addition, quantitative results of different methods are shown in Table 2, from which we can find that we have improved the performance to a large extent in comparison to the state-of-the-art methods. The clustering results of the comparing methods using the two synthetic datasets are visualized in Figure 3 so that we can observe in an intuitive way that -GC approximates the ground truth mostly. What is notable is that on two synthetic datasets of two-moon and path-based, we have a surprisingly good result approximating to 100%, meaning almost entirely accurate clustering. This implies the great importance of the intrinsic graph structure with a comprehensive analysis of which we obtain a significantly success over other methods.
Figure 4 shows that the objective value of -GC, equation (7), is nonincreasing during iterations. They all converge to a fixed value very fast in 10 iteration times. -GC only needs several times to iterate alternately to obtain a cut-zeroed graph while the conventional SC-based methods need to perform postprocessing by using -means clustering algorithm. Therefore, the -GC algorithm is effective overall.
In this paper, we have proposed a novel -component graph clustering method, which learns to obtain a graph with exact -connected components. Since the vertices in each connected component of the intrinsic graph belong to one cluster, labels are obtained directly from the learned graph itself without performing further graph-cut or -means clustering algorithms. -GC learns the affinity matrix and clustering structure simultaneously. Moreover, -GC can serve as an alternative to SC due to its simplicity and the effectiveness over the standard SC. This paper focuses on the spectral analysis of the graph Laplacian, to which the sum of top smallest eigenvalues is zeroed. It renders the graph structure update until it has exactly number of components. The efficient algorithm as well as the optimization algorithm is presented after plentiful analysis. Experiments on six benchmarks have demonstrated the superiority of -GC.
All the datasets are available at https://github.com/lzu-cvpr/data_for_clustering.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This study was supported by the 13th Five-Year Informatization Plan of the Chinese Academy of Sciences (grant nos. XXH13506 and XXH13505-220) and Data Sharing Fundamental Program for Construction of the National Science and Technology Infrastructure Platform (grant no. Y719H71006).
H. Parvin and B. Minaei-Bidgoli, “A clustering ensemble framework based on selection of fuzzy weighted clusters in a locally adaptive clustering algorithm,” Pattern Analysis and Applications, vol. 18, no. 1, pp. 87–112, 2015.View at: Google Scholar
A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: analysis and an algorithm,” in Proceedings of the NIPS, vol. 14, pp. 849–856, Vancouver, Canada, December 2002.View at: Google Scholar
L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,” in Proceedings of the NIPS, pp. 1601–1608, Vancouver, Canada, December 2005.View at: Google Scholar
J. Huang, F. Nie, and H. Huang, “A new simplex sparse learning model to measure data similarity for clustering,” in Proceedings of the IJCAI, pp. 3569–3575, Buenos Aires, Argentina, July 2015.View at: Google Scholar
F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering with adaptive neighbors,” in Proceedings of the KDD, vol. 20, pp. 977–986, ACM, August 2014.View at: Google Scholar
F. Nie, X. Wang, M. I. Jordan, and H. Huang, “The constrained laplacian rank algorithm for graph-based clustering,” in Proceedings of the AAAI, pp. 1969–1976, AAAI, Phoenix, AZ, USA, February 2016.View at: Google Scholar
X. Zhu, W. He, Y. Li et al., “One-step spectral clustering via dynamically learning affinity matrix and subspace,” in Proceedings of the AAAI, pp. 2963–2969, San Francisco, CA, USA, February 2017.View at: Google Scholar
Z. Kang, C. Peng, and Q. Cheng, “Twin learning for similarity and clustering: a unified kernel approach,” in Proceedings of the AAAI, pp. 2080–2086, San Francisco, CA, USA, February 2017.View at: Google Scholar
F. R. Chung, Spectral Graph Theory, American Mathematical Society, Providence, Rhode Island, 1997.
B. Mohar, Y. Alavi, G. Chartrand, and O. Oellermann, “The Laplacian spectrum of graphs,” Graph Theory, Combinatorics, and Applications, vol. 2, no. 12, pp. 871–898, 1991.View at: Google Scholar
B. Mohar, “Some applications of laplace eigenvalues of graphs,” in Graph Symmetry, G. Hahn and G. Sabidussi, Eds., pp. 225–275, Kluwer Academic Publishers, Berlin, Germany, 1997.View at: Google Scholar
S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, UK, 2004.
J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, “Efficient projections onto the -ball for learning in high dimensions,” in Proceedings of the ICML, vol. 25, pp. 272–279, ACM, Helsinki, Finland, July 2008.View at: Google Scholar
S. A. Nene, S. K. Nayar, and H. Murase, Columbia object image library (COIL-20), Columbia University, New York, NY, USA, 1996.
C. Zhang, H. Fu, S. Liu, G. Liu, and X. Cao, “Low-rank tensor constrained multiview subspace clustering,” in Proceedings of the ICCV, pp. 1582–1590, Santiago, Chile, December 2015.View at: Google Scholar
X. Cao, C. Zhang, H. Fu, S. Liu, and H. Zhang, “Diversity-induced multi-view subspace clustering,” in Proceedings of the CVPR, pp. 586–594, Santiago, Chile, December 2015.View at: Google Scholar
J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, pp. 281–297, Oakland, CA, USA, January 1967.View at: Google Scholar