Research Article  Open Access
LargeScale Spectral Clustering Based on Representative Points
Abstract
Spectral clustering (SC) has attracted more and more attention due to its effectiveness in machine learning. However, most traditional spectral clustering methods still face challenges in the successful application of largescale spectral clustering problems mainly due to their high computational complexity , where n is the number of samples. In order to achieve fast spectral clustering, we propose a novel approach, called representative pointbased spectral clustering (RPSC), to efficiently deal with the largescale spectral clustering problem. The proposed method first generates twolayer representative points successively by BKHK (balanced kmeansbased hierarchical kmeans). Then it constructs the hierarchical bipartite graph and performs spectral analysis on the graph. Specifically, we construct the similarity matrix using the parameterfree neighbor assignment method, which avoids the need to tune the extra parameters. Furthermore, we perform the coclustering on the final similarity matrix. The coclustering mechanism takes advantage of the cooccurring cluster structure among the representative points and the original data to strengthen the clustering performance. As a result, the computational complexity can be significantly reduced and the clustering accuracy can be improved. Extensive experiments on several largescale data sets show the effectiveness, efficiency, and stability of the proposed method.
1. Introduction
Clustering is one of the fundamental topics in unsupervised learning. It has been widely and successfully applied in data mining, pattern recognition, and many other fields. Spectral clustering is one of the most popular methods used in unsupervised clustering tasks [1–4]. Especially, it performs well in nonconvex pattern and linear nonseparable clusters and converges to the global optimal solution [5]. However, spectral clustering is limited in its applicability to largescale problems. Its bottleneck is the high computational complexity [6–8]. Many approaches have been proposed to speed up spectral clustering [9–11]. Unfortunately, these methods usually sacrifice a lot of information of the raw data, resulting in performance degradation.
The traditional spectral clustering needs two independent steps: constructing similarity graph and performing spectral analysis [12]. Both the steps are computational expensive for largescale data, and their computational complexity is and , respectively. The conventional spectral clustering has three methods to construct the similarity graph which is constructed by pairwise similarities or pairwise distances. The goal is to model the local neighborhood relationships between the data points [13]. The first method is to construct the graph in which is the pairwise distance. All points whose pairwise distances are smaller than can be connected. In this method, a large amount of information between sample points is discarded because of this single and rough criterion. The second method is to generate similarity graph by the knearest neighbor method in which two vertices and q are connected if the distance between and q is among the kth smallest distances from to other objects. The last similarity graph is the fully connected graph. That means all points with positive similarity are connected with each other. The computational cost of all graphs mentioned above is very expensive. In the past years, a lot of extensions of spectral clustering have been proposed. Wang and Wu proposed to simultaneously learn clustered orthogonal projection and optimized local graph structure for each view, while enjoying the same magnitude over them both for all views, leading to a superior multiview spectral clustering consensus [14]. Houthuys et al. proposed a new model multiview kernel spectral clustering (MVKSC). This model is formulated as a weighted kernel canonical correlation analysis in a primaldual optimization setting typical of leastsquares supportvector machines (LSSVM). A coupling term enforces the clustering scores corresponding to the different views to align [15]. Tao et al. proposed a novel robust spectral ensemble clustering algorithm. It learns lowrank representation for the coassociation matrix to uncover the cluster structure and handle the noises, and meanwhile, it performs spectral clustering with the learned representation to seek for a consensus partition [16]. Huang et al. proposed ultrascalable spectral clustering (USPEC) and ultrascalable ensemble clustering (USENC). In USPEC, a hybrid representative selection strategy and a fast approximation method for Knearest representatives are proposed for the construction of a sparse affinity submatrix. In USENC, multiple USPEC clusters are further integrated into an ensemble clustering framework to enhance the robustness of USPEC [17]. Yang et al. designed a constraint projection for a semisupervised spectral clustering ensemble (CPSSSCE) model. In this method, the original data are transformed to lowerdimensional representations by constraint projection before base clustering. Then, a similarity matrix is constructed using the base clustering results and modified using pairwise constraints. Finally, the spectral clustering algorithm is applied to process the similarity matrix to obtain a consensus cluster result [18].
In recent years, many spectral clustering methods for largescale data have been proposed. Zhao et al. proposed a spectral clustering based on iterative optimization (SCIO), which solves the spectral decomposition problem of largescale and highdimensional data set, and this method performs on multitask clustering [19]. The nonnegative matrix factorization (NMF) has been proposed as the relaxation technique for clustering with excellent performance [20, 21]. Yang et al. proposed a multitask spectral clustering model (MTSC) by exploring two types of correlations: intertask clustering correlation and intratask learning correlation [22]. He et al. reduced the complexity of spectral clustering by employing random Fourier features to explicitly represent data in kernel space [23]. Semertzidis et al. proposed an efficient spectral clustering method for largescale data sets in which a set of pairwise constraints were given to increase clustering accuracy and reduce clustering complexity [24]. These methods greatly extend spectral clustering for largescale data.
Recently, the representative pointbased graph has widely been adopted in the spectralbased method to speed up the procedure. Shinnou and Sasaki [25] constructed the similarity matrix on “committees” of raw data points, and largescale spectral clustering was performed using the reduced similarity matrix. Yan et al. [26] introduced a fast spectral clustering method which used the kmeans method to produce a set of reduced representative points for approximate spectral clustering (KASP: kmeansbased approximate spectral clustering). Liu et al. [27] addressed the scalability issue plaguing graphbased semisupervised learning via the anchor points. Liu et al. [28] proposed an efficient cluster algorithm for largescale graph data using spectral methods. In order to compress the original graph into a sparse bipartite graph, they repeatedly generate a small number of “supernodes” connected to the regular nodes. Cai and Chen [29] and Cai [30] proposed landmarkbased spectral clustering and spectral dimensionality reduction. Li et al. [31] adopted the salient pointbased subbipartite graph to achieve largescale multiview spectral clustering. Wang et al. [32] explored the spectral and spatial properties of the hyperspectral image and proposed a novel method, called fast spectral clustering with anchor graph (FSCAG). This method solved the largescale hyperspectral image clustering problem. These methods mentioned above adopt representative pointbased strategy to construct the similarity graph to accelerate the procedure of spectral clustering.
The representative point generation is one of the extremely important steps in representative pointbased spectral clustering methods. These generated points directly affect the final performance. There are two main strategies to generate representative points. One is random selection, and the other is the kmeans method. The random generation strategy is efficient, but the performance is not stable because of the randomness of representative points. Generally speaking, the kmeans generation strategy can achieve good performance while the computational cost is expensive. Much effort has been devoted for accelerating the kmeans procedure, e.g., early stopping iteration [33] or performing downsampling on data, but they can also sacrifice some performance.
To tackle the problem, a novel and efficient representative pointbased spectral clustering method is proposed to deal with largescale data sets. Three main contributions of this paper are listed as follows:(1)The twolayer bipartite graph is constructed using the generated representation points by BKHK. BKHK has low computational complexity and high performance compared with kmeans.(2)We construct the similarity matrix between adjacent layers using the parameterfree neighbor assignment method, which avoids extra parameters. Furthermore, the final similarity matrix is easily obtained by multiplying the similarity matrix between adjacent layers.(3)We perform the coclustering on the final similarity matrix. The coclustering mechanism takes advantage of the cooccurring cluster structure among the representative points and the original data to strengthen the clustering performance.(4)Extensive experiments on several largescale data sets demonstrate the effectiveness, efficiency, and stability of the proposed method.
2. Representative Points Generation
The most important step of representative pointbased spectral clustering is the generation of representative points. In this paper, we design twolayer representative points by BKHK to gradually reduce the data size. BKHK adopts balanced binary tree structure; in other words, it iteratively segments the data into two cluster with the same number of samples [34].
Given a data matrix , where denotes the th sample, n is the number of samples, and d is the dimensionality of the data. The two class kmeans can be formulated as follows:where is the index matrix, equals 1 or 0, the value of is 1, and the value of is 0 if the th sample belongs to the first cluster, or the value of is 0 and the value of is 1 otherwise, is the center of the cluster, is the columnvector of all ones, and are the number of samples in these two clusters, and k is the integer portion of . Therefore, we have . Furthermore, problem (1) can be rewritten aswhere is the th column of C.
We define matrix and the th element of the th row of E is , so problem (2) can be rewritten as
For convenience, let denote the first column of G, obviously, the second column is . Substitute and into problem (3), we have
Let when the th element of is the minimum of all its elements. Obviously, the solution to the problem in problem (4) is obtained.
We can obtain the firstlayer representative points by performing above process iteratively. Then the procedure is repeated on the firstlayer representative points to generate the secondlayer representative points.
3. Similarity Matrix
Similar to conventional similarity graph construction, the similarity graph construction between the obtained representative points and raw points also has the problem of selecting the neighbor assignment strategy. The kernelbased neighbor assignment strategy usually is sued in conventional methods, but it always brings extra parameters [13]. A parameterfree method is adopted in this paper [35]. Let denote the generated representative points, and is the set of nearest representative points for the th sample. is the distance between the th sample and its th nearest representative point. The neighbor assignment strategy can be formulated as follows:where is the similarity matrix between raw data and representative points and is the element of B. denotes the similarity between the th data point and the th representative point. Fllow Nie et al. [35], is equal to . The solution to problem (5) is as follows:
The twolayer representative points are successively generated by performing BKHK. denotes the similarity matrix between original data points and the firstlayer representative points which can be obtained by solving problem (6), and equals . denotes the similarity matrix between the firstlayer representative points and the secondlayer representative points which can be obtained by the same process, and equals . The final similarity matrix between the raw data points and the secondlayer representative points can be obtained as follows:where is the final similarity matrix.
4. Coclustering on Similarity Matrix
As the same as document data, duality exists between the raw data points and the representative points. The representative points can be clustered based on their relations with the corresponding raw data clusters, while the raw data clusters are obtained according to their associations with distinct representative point clusters. In order to make full use of the duality information and strengthen the clustering performance, the coclustering method is adopted on the similarity matrix between the raw data points and the secondlayer representative points.
4.1. Graph Partitioning
A signifies association between an original point, and a representative point signifies an edge in bipartite graph. It is easy to verify that the adjacency matrix of the bipartite graph can be written as follows:where denotes the adjacency matrix and is the similarity matrix between the raw data points and the secondlayer representative points.
Therefore, the degree matrix D of can be obtained as follows:where and are the diagonal matrices whose entries are and , respectively. Obviously, the Laplacian matrix is
To partition the bipartite graph, the optimization problem of finding the minimum normalized cut can be formalized as the second eigenvector of the generalized eigenvalue problem with suitable relaxation [1]:
Substitute equations (9) and (10) into equation (11), we get
Furthermore, equation (12) can be written as follows:
Letting and , and after a little algebraic manipulation, equation (13) can be written as follows:
Obviously, equation (14) is the singular value decomposition of the normalized matrix . is the left singular vector, and is the right singular vector while is the corresponding singular value. Thus instead of directly computing the eigenvector of (12), we prefer to compute the left and right singular vectors of to speed up the algorithm.
In order to achieve multipartitioning, the data set can be formed as follows:where and come from the minimum left and right singular vectors of .
To approximate the optimal multipartitioning, we can look for points such that the sum of squares is minimized [27]:
Obviously, the minimum of (16) can be achieved by the classical kmeans algorithm.
4.2. Representative PointBased Spectral Clustering
As mentioned, twolayer representative points can be successively generated by BKHK for largescale spectral clustering and similarity graph can be constructed. Then coclustering can be performed on the final similarity matrix between original data points and representative points. So we propose a novel spectral clustering approach for largescale data, called representative pointbased spectral clustering (RPSC). The detailed algorithm is summarized in Algorithm 1.

5. Experiments
In this section, several experiments are conducted to demonstrate the effectiveness and efficiency of the proposed RPSC.
5.1. Data Sets Description
In the experiments, four largescale data sets are collected to illustrate the performance of different spectral clustering methods. These data sets include two handwritten digit data sets, USPS, MNIST, one connect4 game data set, CONNECT4, and one capital letters in the English alphabet, LETTER. These data sets were downloaded from the UCI machine learning repository and the LibSVM data sets page. The brief description is listed below: USPS. A data set of handwritten digits (0–9). It contains 9298 samples from ten classes. Each sample has 256 features. MNIST. A data set of handwritten digits (0–9). It has 70000 samples from ten classes. Each sample has 784 features. CONNECT4. A data set of connect4 game. It consists of 67557 samples from three classes. Each sample has 126 features. LETTER. A data set of 26 capital letters in the English alphabet. It is composed of 20000 samples from 26 classes. Each sample has 16 features.
5.2. Evaluation Metric
All the codes in the experiments are implemented in MATLAB R2016a and run on a Windows 8.1 machine with 32 GB main memory. Every method was run 10 times, and the mean results were recorded. The average clustering accuracy and the average clustering time are displayed in Tables 1 and 2, respectively. The clustering accuracy and the clustering time on different number of twolayer representative points of two data sets are recorded in Figures 1–4.


5.3. Experimental Results
The performance of five methods evaluated by accuracy is reported in Table 1. The conventional algorithm SC can only run on the small data sets USPS and Letter, and the clustering accuracy is low. LSCK achieves high clustering accuracy on USPS and MNIST, but it does not perform well on LETTER and CONNECT4. LSCR has the low clustering accuracy on almost all data sets. KASP only performs well on one of the four data sets. The proposed RPSC achieves pretty high performance on all data sets. Table 2 summarizes the clustering time comparison on four data sets. Obviously, SC and LSCK have high computational cost. KASP has low computational cost on smaller data sets, but the time cost is very high on large data sets. LSCR is efficient while the performance is poor. It is easy to see that the proposed RPSC is efficient on all data sets, especially large data sets. Overall, RPSC is the best choice for large data sets among the compared approaches.
Several experiments were conducted on CONNECT4 and LETTER to demonstrate the influence of the parameter (number of twolayer representative points) for clustering time and performance. Figures 1–4 display that more representative points can increase the clustering accuracy at the beginning, but overly large amounts of representative points are useless. However increasing number of representative points can make the time cost increase.
6. Conclusions
In this paper, we proposed a novel representative pointbased spectral clustering approach, named RPSC, based on the twolayer bipartite graph. First, twolayer representative points are generated successively by BKHK. Then, the similarity matrices between adjacent layers are constructed. Different from the conventional kernelbased neighbor assignment strategy, we adopt a parameterfree yet effective neighbor assignment method, which avoids the need to tune the heatkernel parameter. Furthermore, the final similarity matrix can be easily obtained by multiplying the similarity matrices between adjacent layers. Finally, the coclustering is performed on the final similarity matrix. The coclustering mechanism takes advantage of the cooccurring cluster structure among the lastlayer representative points and the original data to strengthen the clustering performance. As a result, the computational complexity can be greatly reduced and the clustering accuracy can be improved. Extensive experiments conducted on 4 large data sets demonstrate the efficiency and effectiveness of the proposed RPSC in terms of computational speed and clustering accuracy. In future work, we will consider designing a more efficient representative point generation algorithm. Furthermore, the clustering ensemble based on the proposed method is an interesting research.
Data Availability
The data used to support the finding of this study are downloaded from the UCI machine learning repository and the LibSVM data sets page.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was supported by the National Key Research and Development Project (grant nos. 2017YFC0403600 and 2017YFC0403604), National for Young Scientists of China (no. 61702185), Innovation Scientists and Technicians Troop Construction Projects of Henan Province (2014), Key Scientific and Research Project in University of Henan Province (nos. 15A520021, 15A510003, and 18A520034), Henan Province Science and Technology Research Program (no. 172102210050), Open Research Foundation of Key Laboratory of Sediments in Chinese Ministry of Water Resources (no. 2017001), and Innovation Fund for Ph.D. candidate of North China University of Water Resources and Electric Power (2015).
References
 J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000. View at: Google Scholar
 L. ZelnikManor and P. Perona, “Selftuning spectral clustering,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 1601–1608, Vancouver, Canada, December 2005. View at: Google Scholar
 F. Nie, Z. Zeng, I. W. Tsang, D. Xu, and C. Zhang, “Spectral embedded clustering: a framework for insample and outofsample spectral clustering,” IEEE Transactions on Neural Networks, vol. 22, no. 11, pp. 1796–1808, 2011. View at: Publisher Site  Google Scholar
 U. Shaham, K. Stanton, H. Li, B. Nadler, R. Basri, and Y. Kluger, “Spectral clustering using deep neural networks,” 2018, https://arxiv.org/abs/1801.01587. View at: Google Scholar
 M. Filippone, F. Camastra, F. Masulli, and S. Rovetta, “A survey of kernel and spectral methods for clustering,” Pattern Recognition, vol. 41, no. 1, pp. 176–190, 2008. View at: Publisher Site  Google Scholar
 A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: analysis and an algorithm,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 849–856, Vancouver, Canada, December 2002. View at: Google Scholar
 X. Chen and D. Cai, “Large scale spectral clustering with landmarkbased representation,” in Proceedings of the TwentyFifth AAAI Conference on Artificial Intelligence, pp. 313–318, San Francisco, CA, USA, August 2011. View at: Google Scholar
 N. Tremblay, G. Puy, R. Gribonval, and P. Vandergheynst, “Compressive spectral clustering,” in Proceedings of the International Conference on Machine Learning, pp. 1002–1011, New York, NY, USA, June 2016. View at: Google Scholar
 C. Fowlkes, S. Belongie, F. Fan Chung, and J. Malik, “Spectral grouping using the nystrom method,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 214–225, 2004. View at: Publisher Site  Google Scholar
 B. Chen, B. Gao, T. Liu, Y. Chen, and W. Ma, “Fast spectral clustering of data using sequential matrix compression,” in Proceedings of the European Conference on Machine Learning, pp. 590–597, Berlin, Germany, September 2006. View at: Google Scholar
 M. Li, J. Kwok, and B. Lu, “Making largescale nystrom approximation possible,” in Proceedings of the 27th International Conference on Machine Learning, pp. 631–638, Haifa, Israel, June 2010. View at: Google Scholar
 F. Chung and F. Graham, Spectral Graph Theory, American Mathematical Society, Providence, RI, USA, 1997.
 U. von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007. View at: Publisher Site  Google Scholar
 Y. Wang and L. Wu, “Beyond lowrank representations: orthogonal clustering basis reconstruction with optimized graph structure for multiview spectral clustering,” Neural Networks, vol. 103, pp. 1–8, 2018. View at: Publisher Site  Google Scholar
 L. Houthuys, R. Langone, and J. A. K. Suykens, “Multiview kernel spectral clustering,” Information Fusion, vol. 44, pp. 46–56, 2018. View at: Publisher Site  Google Scholar
 Z. Tao, H. Liu, S. Li, Z. Ding, and Y. Fu, “Robust spectral ensemble clustering via rank minimization,” ACM Transactions on Knowledge Discovery from Data, vol. 13, no. 1, pp. 1–25, 2019. View at: Publisher Site  Google Scholar
 D. Huang, C.D. Wang, J. Wu, J.H. Lai, and C. Kwoh, “Ultrascalable spectral clustering and ensemble clustering,” IEEE Transactions on Knowledge and Data Engineering, 2019. View at: Publisher Site  Google Scholar
 J. Yang, L. Sun, and Q. Wu, “Constraint projections for semisupervised spectral clustering ensemble,” Concurrency and Computation Practice and Experience, vol. 31, no. 20, 2019. View at: Publisher Site  Google Scholar
 Y. Zhao, Y. Yuan, F. Nie, and Q. Wang, “Spectral clustering based on iterative optimization for largescale and highdimensional data,” Neurocomputing, vol. 318, pp. 227–235, 2018. View at: Publisher Site  Google Scholar
 A. C. Türkmen, “A review of nonnegative matrix factorization methods for clustering,” 2015, https://arxiv.org/abs/1507.03194. View at: Google Scholar
 T. Liu, M. Gong, and D. Tao, “Largecone nonnegative matrix factorization,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 9, pp. 2129–2142, 2017. View at: Publisher Site  Google Scholar
 Y. Yang, Z. Ma, Y. Yang, F. Nie, and H. Shen, “Multitask spectral clustering by exploring intertask correlation,” IEEE Transactions on Cybernetics, vol. 45, no. 5, pp. 1069–1080, 2015. View at: Publisher Site  Google Scholar
 L. He, N. Ray, Y. Guan, and H. Zhang, “Fast largescale spectral clustering via explicit feature mapping,” IEEE Transactions on Cybernetics, vol. 49, no. 3, pp. 1058–1071, 2019. View at: Publisher Site  Google Scholar
 T. Semertzidis, D. Rafailidis, M. G. Strintzis, and P. Daras, “Largescale spectral clustering based on pairwise constraints,” Information Processing & Management, vol. 51, no. 5, pp. 616–624, 2015. View at: Publisher Site  Google Scholar
 H. Shinnou and M. Sasaki, “Spectral clustering for a large data set by reducing the similarity matrix size,” in Proceedings of the International Conference on Language Resources and Evaluation (LREC), pp. 201–204, Marrakesh, Morocco, May 2008. View at: Google Scholar
 D. Yan, L. Huang, and M. I. Jordan, “Fast approximate spectral clustering,” in Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining, pp. 907–916, Paris, France, June 2009. View at: Google Scholar
 W. Liu, J. He, and S. F. Chang, “Large graph construction for scalable semisupervised learning,” in Proceedings of the 27th International Conference on Machine Learning (ICML), pp. 679–686, Haifa, Israel, June 2010. View at: Google Scholar
 J. Liu, C. Wang, M. Danilevsky, and J. Han, “Largescale spectral clustering on graphs,” in Proceedings of the International Joint Conference on Artificial Intelligence AAAI Press, pp. 1486–1492, Beijing, China, August 2013. View at: Google Scholar
 D. Cai and X. Chen, “Large scale spectral clustering via landmarkbased sparse representation,” IEEE Transactions on Cybernetics, vol. 45, no. 8, pp. 1669–1680, 2015. View at: Publisher Site  Google Scholar
 D. Cai, “Compressed spectral regression for efficient nonlinear dimensionality reduction,” in Proceedings of the 24th International Joint Conference on Artificial Intelligence, pp. 3359–3365, Buenos Aires, Argentina, July 2015. View at: Google Scholar
 Y. Li, F. Nie, H. Huang, and J. Huang, “Largescale multiview spectral clustering via bipartite graph,” in Proceedings of the 29th AAAI Conference on Artificial Intelligence, pp. 2750–2756, Austin, TX, USA, January 2015. View at: Google Scholar
 R. Wang, F. Nie, and W. Yu, “Fast spectral clustering with anchor graph for large hyperspectral images,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 11, pp. 2003–2007, 2017. View at: Publisher Site  Google Scholar
 T. Y. Liu, H. Y. Yang, X. Zheng, T. Qin, and W. Y. Ma, “Fast largescale spectral clustering by sequential shrinkage optimization,” in Proceedings of the European Conference on Information Retrieval, pp. 319–330, Rome, Italy, April 2007. View at: Google Scholar
 W. Zhu, F. Nie, and X. Li, “Fast spectral clustering with efficient large graph construction,” in Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2492–2496, New Orleans, LA, USA, March 2017. View at: Publisher Site  Google Scholar
 F. Nie, X. Wang, M. I. Jordan, and H. Huang, “The costrained laplacian rank algorithm for graphbased clustering,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AR, USA, February 2016. View at: Google Scholar
Copyright
Copyright © 2019 Libo Yang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.