Spectral Nonlinearly Embedded Clustering Algorithm
As is well known, traditional spectral clustering (SC) methods are developed based on the manifold assumption, namely, that two nearby data points in the high-density region of a low-dimensional data manifold have the same cluster label. But, for some high-dimensional and sparse data, such an assumption might be invalid. Consequently, the clustering performance of SC will be degraded sharply in this case. To solve this problem, in this paper, we propose a general spectral embedded framework, which embeds the true cluster assignment matrix for high-dimensional data into a nonlinear space by a predefined embedding function. Based on this framework, several algorithms are presented by using different embedding functions, which aim at learning the final cluster assignment matrix and a transformation into a low dimensionality space simultaneously. More importantly, the proposed method can naturally handle the out-of-sample extension problem. The experimental results on benchmark datasets demonstrate that the proposed method significantly outperforms existing clustering methods.
As one of the fundamental topics in data mining and machine learning, clustering has been successfully applied in various fields. Generally speaking, the target of clustering is to group the examples into a number of classes, or clusters. Over the past decades, a large family of clustering algorithms has been studied extensively, which is mainly divided into two categories: generative clustering approaches and discriminative clustering models. Generative clustering approaches, for example, mixture models [1, 2], generally integrate Bayesian approaches into its models. However, generative models add restrict assumptions on the class-conditional densities, which might lead to unconvincing clustering results when these assumptions do not hold. Discriminative methods, such as spectral clustering (SC)  and K-means clustering , learn discriminative models based on loss functions from unlabeled data through the low-density separation assumption.
Recently, discriminative clustering methods, such as the variants of kernel-based clustering and spectral clustering, have attracted more and more renewed attentions. It is easy to perform them to capture nonlinear cluster structures. Motivated by the outstanding performance of support vector machine (SVM) in supervised learning, maximum margin clustering (MMC) [5–7] methods have been developed to obtain a decision boundary that can separate data points into different clusters to the utmost extent. Although these clustering methods have the ability of exploiting nonlinear data structures, they are still sensitive to high-dimensional data points. For example, K-means clustering iteratively computes the distance between each data point and the center of each cluster. Hence, its clustering performance severely depends on the distance measurement. However, high-dimensional data, such as some image data, would have a bad influence on the similarity computation by virtue of Euclidian distance, and the performance of K-means clustering would be degraded dramatically. SC can perform clustering by utilizing the spectrum of the similarity matrix to discover the nonlinear and low-dimensional manifold structure of data points. In other words, it heavily relies on the manifold assumption [8, 9], namely, that two nearby data points of a low-dimensional manifold have the same class label. However, for high-dimensional and sparse data, the manifold assumption may not hold due to the bias caused by the curse of dimensionality. Nie et al.  have validated that graph-based spectral clustering methods cannot always exploit the low-dimensional manifold structure, which would result in the performance degradation of SC. Another challenge for traditional SC methods is that they do not solve the out-of-sample extension problem; that is, the discrete cluster assignment vectors for some new unseen samples cannot be automatically obtained. The algorithm proposed in  takes advantage of the Nyström method to approximate the eigenfunction for the unseen data points. The method described in  makes good use of some heuristics to evaluate the implicit eigenfunction for the new data points. But, the performance of these methods heavily relies on the estimated affinity matrix defined between training and new data points.
To improve the clustering performance of SC for high-dimensional data further, in this paper, we firstly propose a general spectral embedded clustering framework, which incorporates dimensionality reduction methods into the model of SC. Secondly, by using different low-dimensional embedding functions, we derive the corresponding optimization models and develop the spectral nonlinearly embedded algorithms based on extreme learning machine (ELM) and kernel functions, respectively. Our main contributions include the following:(1)A general spectral embedded clustering framework is presented by imposing a linearity regularization on the objective function of SC. The proposed framework introduces dimensionality reduction of the training data by controlling the error between the cluster assignment matrix and the low-dimensional embedding of the data.(2)Based on the proposed general framework, several models can be derived by using different embedding functions, which include the linear embedding functions and the nonlinear functions in Reproducing Kernel Hilbert Space (RKHS) as well as in ELM feature space. The spectral embedded clustering model (SEC) proposed in  can be considered as the special case of the general framework.(3)We prove that the spectral nonlinearly embedded clustering model based on ELM (ESEC) is an approximation of the kernel-based spectral nonlinearly embedded clustering (KSEC) method under some conditions. The fast spectral nonlinearly embedded clustering algorithm is proposed based on ESEC by utilizing the efficient learning ability of ELM.(4)The out-of-sample extension problem can be naturally solved for the clustering methods under our proposed SEC framework.(5)Experimental results on benchmark datasets demonstrate that the proposed ESEC outperforms the existing SC methods, K-means clustering, and SEC and KSEC for in-sample clustering. For out-of-sample clustering, ESEC also has better generalization capability over the Nyström method and superior performance than K-means clustering, SEC, and KSEC.
The rest of this paper is organized as follows. Related works are introduced in Section 2. In Section 3, we present the general spectral embedded clustering framework and derive several different models by using different embedding functions. The relationship between ESEC and KSEC is demonstrated and the ESEC clustering algorithm is described in detail. In addition, clustering for out-of-sample data is also discussed. To validate our model, experimental results are reported in Section 4. Finally, we give the related conclusions and a discussion of future works in Section 5. In order to avoid confusion, we give a list of the main notations used in this paper in Notations section.
2. Related Works
2.1. Spectral Clustering
Given a dataset , the main task of clustering is to partition into clusters. SC aims at finding a cluster assignment matrix of the training data by a weighted graph whose vertices are over . Several SC algorithms have been proposed in [3, 13, 14]. In this paper, we mainly discuss the SC algorithm with k-way normalized cuts .
Specifically, denote an undirected weighted graph by , where is a vertex set and represents an affinity matrix. Each entry of the symmetric matrix is used to record the edge weights that characterize the similarity relationship between a pair of vertices of . is commonly defined by The Laplacian graph is defined by , where is a diagonal matrix with the diagonal elements as . Based on the normalized cut criterion, where the size of a subset of a graph is measured by the weights of its edges and the normalized Laplacian matrix is used, the optimization problem can be transformed into the following trace maximization problem :where denotes the identity matrix of size by and represents the cluster assignment matrix with continuous values by relaxation. Then optimal solution of (2) can be obtained by eigenvalue decomposition of the matrix .
2.2. Extreme Learning Machine
The output function of ELM for generalized single-hidden-layer feedforward neural networks (SLFNs) in the case of one output node iswhere is the vector of the output weights between the hidden layer of L nodes and the output node and is the output (row) vector of the hidden layer with respect to the input . In fact, maps the data from the d-dimensional input space to the L-dimensional hidden-layer feature space (ELM feature space) . ELM is to minimize the training error as well as the norm of the output weights where is a tradeoff parameter between the complexity and fitness of the decision function and is the hidden-layer output matrix denoted by
Similar to support vector machine (SVM), to minimize the norm of the output weights is actually to maximize the distance of the separating margins of the two different classes in the ELM feature space: , which actually controls the complexity of the function in the ELM feature space.
3. General Spectral Embedded Clustering Framework
As mentioned above, SC methods greatly depend on the construction of the affinity matrix . For some high-dimensional data, it might not exhibit an evident low-dimensional manifold structure. In this case, the clustering performance of SC may be inferior to the K-means clustering.
In the following subsections, we will firstly propose a general spectral embedded clustering framework, which incorporates a linearity regularization into the traditional normalized SC model. By using different embedding functions, this framework can generate a family of spectral embedded clustering algorithms, such as SEC, KSEC, and ESEC. Secondly, we demonstrate the relationship between ESEC and KSEC. The ESEC algorithm is then proposed for high-dimensional data clustering. Finally, the out-of-sample extension problem is discussed for our proposed ESEC method.
Generally, clustering models of traditional SC methods can be transformed into the following minimization problem:where is the normalized Laplacian matrix.
To make use of the underlying dense grouping structure of data in a low-dimensional subspace, the proposed general framework introduces a regularization term into the optimization problem (6), which controls the error between the cluster assignment matrix and the low-dimensional embedding of the data. Specifically, we minimize the following objective function:where and are two regularization parameters and is the low-dimensional embedding of training data. The second term represents the error between the relaxed cluster assignment matrix and the low-dimensional embedding of the data. The third term is the norm penalty of and represents the complexity of functions in a high-dimensional feature space.
In dimensionality reduction, linear embedding functions and nonlinear embedding functions are commonly used to address out-of-sample problems. This is due to the fact that they contain few parameters, which are not expensive in computational time and memory. In this paper, we mainly discuss kernel-based and ELM-based nonlinear embedding functions.
If we use a nonlinear embedding function in RKHS, that is, , then , where is a symmetric kernel matrix and ; problem (7) can be rewritten aswhich is referred to as KSEC.
Alternatively, if we consider an embedding function in ELM feature space, that is, , then , where represents the hidden-layer output matrix of ELM. Problem (7) can be reformulated as which is referred to as ESEC.
Firstly, to solve the optimization problems (9), we transform them into another simple form and have the following theorem.
Theorem 1. The optimization problems (9) can be transformed into the following minimization problem:where and denotes the identity matrix of size n by n.
Proof. Problem (9) is firstly transformed into the following form:where .
By setting the derivatives of the objective function (16) with respect to to zero, we haveBy substituting in (12) by (13), the optimization problem (12) becomeswhich can be denoted as follows:where . This completes the proof of Theorem 1.
Based on Theorem 1, the relaxed cluster assignment matrix of KSEC can be achieved by computing the eigenvectors of corresponding to the smallest eigenvalues. The columns of are corresponding to the top eigenvectors. Finally, the discrete-valued cluster assignment matrix can be obtained by clustering each row of .
To inherit the advantage of fast learning speed of ELM, we mainly discuss ESEC based on ELM with multioutputs, since ELM with single output can be regarded as a special case of it. We have the following theorem on ESEC, which is the foundation of the proposed ESEC algorithm.
Theorem 2. The optimization problem (10) can be transformed into the following minimization problem:where or . denotes the identity matrix of size by and is the number of hidden layer nodes in ELM.
Proof. By setting the derivatives of the objective function (10) with respect to to zero, we haveBy substituting in (10) by (17), the optimization problem (10) becomesProblem (18) can be further transformed into the following objective function:which can be denoted as follows:where . can be transformed into another form as follows: This completes the proof of Theorem 2.
ESEC makes good use of an embedding function in ELM feature space instead of RKHS. Thus, the form of ESEC is similar to that of KSEC. It can be proved that there is a link between ESEC and KSEC. We have the following theorem.
Theorem 3. If the mapping in ELM is , where denotes any kernel function and L is the number of hidden nodes in ELM and ( is the parameter of kernel function ) are random sampling points from any continuous probability distribution, then ESEC is an approximation of KSEC by discretizing the embedding function .
Proof. Since in RKHS, can be denoted aswhere . Let and ; thenThus, we approximately derive the embedding function of ESEC from KSEC. This completes the proof of Theorem 3.
The proposed ESEC algorithm is described as follows.
Input. The input is the training dataset and the number of clusters .
Output. The output is the class assignment matrix of cluster .
Step 1. Construct the graph Laplacian from .
Step 2. Randomly generate input weights and initiate an ELM network of hidden neurons; calculate the output matrix of the hidden layer.
Step 3. If ; let .
Step 4. Compute the matrix .
Step 5. Find the eigenvectors of corresponding to the smallest eigenvalues, which form the optimal .
Step 6. Treat each row of as a new training sample, and use the K-means algorithm to cluster the training samples into clusters. Let be the final discrete class assignment matrix of cluster for training data.
Return the class assignment matrix of cluster .
3.3. Computational Complexity
From Algorithm 1, we can see that the most costly computation is computing the matrix and carrying out the eigen-decomposition of . If , computing needs to obtain the inversion of , whose computational complexity is . In addition, the computational complexity of eigenvalue decomposition of is . Thus, the total computational complexity of ESEC is , where . Correspondingly, for KSEC, computational complexity of calculating is and its total computational complexity is . Consequently, ESEC has lower computational complexity than KSEC.
3.4. Clustering for Out-of-Sample Data
By performing Algorithm 1, we can obtain the cluster assignment matrix for the training data. Thus, can be easily computed by using formula (17). Then, for any new data point , we can obtain the prediction result In this paper, we use the spectral rotation method to calculate the discrete cluster assignment vector for . Firstly, an orthogonal matrix is computed by the following spectral rotation method:where and denote the and vectors of all 1s, respectively. is an orthogonal matrix and is defined bywhere represents a diagonal matrix with the same diagonal elements as the square matrix . Secondly, the discrete cluster assignment vector for is calculated as follows:Finally, the class of the data point iswhere is the ith element in the vector .
To evaluate the in-sample clustering and out-sample clustering performance of different clustering methods, we test all algorithms on UCI datasets (Iris, Glass, Wine, WPBC, SpectHeart, and Isolet (http://archive.ics.uci.edu/ml/datasets.html.)), face recognition datasets (Yale (http://vision.ucsd.edu/~leekc/ExtYaleDatabase/Yale%20Face%20Database.htm.), ORL (http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.)), digits recognition datasets (USPS (http://www-i6.informatik.rwth-aachen.de/~keysers/usps.html)), and object recognition datasets (COIL-20 (http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php.)). Some datasets are resized, and the basic information of datasets is listed in Table 1. All the experiments have been performed in MATLAB R2013a running in a 3.10 GHZ Intel Core™i5-2400 with 4 GB RAM.
In in-sample clustering, we assign a cluster label to each unlabeled in-sample data point. The proposed ESEC algorithm is compared with K-means (KM) clustering, SC , SEC , and KSEC. For KM, an EM-like algorithm is used to assign cluster labels as in . In out-sample clustering, the cluster label of each unseen data point is assigned to the closest cluster center learned from the in-sample data points for KM. We use the proposed out-of-sample approach to cope with unseen data for ESEC. Similar out-of-sample method is also used for KSEC and SEC by using different embedding functions. Since the Nyström method  can be used to deal with unseen data, we also compare ESEC with the Nyström method for the out-of-sample SC.
4.1. Experimental Setup
Each dataset is randomly divided into seen and unseen samples, and we use the seen data to obtain the optimal parameters of different clustering methods by cross-validation. Then, we use the unseen data to test the performance of all algorithms using the obtained optimal parameters. In the experiments, 80% of the data are randomly selected as seen data and the remaining data are used as unseen data.
The self-tuning SC method  is used to determine the parameter in (1) for SC, SEC, KSEC, and ESEC. The parameter K of K-nearest-neighbor graph is set to 5 empirically. For fair comparisons, we set the parameters in SEC, KSEC, and ESEC as 1 and select the parameter in these methods from .
For the Nyström SC method, we set the same , where , with being the mean value of the square distance between the in-sample data as suggested in  and . For the in-sample clustering, the best clustering results from the best parameters for SEC, KSEC, and ESEC are reported in Table 2. For ESEC, we use the RBF kernel as the hidden node function and a grid search of the number of hidden nodes on is conducted to seek for the optimal result by using fivefold cross-validation. By means of the optimal parameters in in-sample setting, the results for the out-of-sample clustering are obtained and reported in Table 3.
It should be noted that the results of all clustering methods rely on the initialization. To get statistical results for different parameters and random partitions, all clustering algorithms are independently repeated 50 times, and we report the mean clustering result and standard deviation using the best parameters on the seen and unseen data. In the experiments, we set the number of clusters as the number of classes in each dataset. The clustering accuracy (ACC) (refer to  for its definition) and time cost are used to evaluate the clustering performance.
4.2. In-Sample Clustering Experiments
To compare clustering performances of various clustering algorithms, we report the in-sample clustering results on all the datasets in Table 2. As can be seen from Table 2, SC outperforms KM for most of the low dimensionality datasets, such as Iris, Glass, Wine, WPBC, and SpectfHeart. But, it might become worse on the high dimensionality datasets, such as Yale and Isolet. This is due to the fact that SC prefers the datasets that have a clear manifold structure in a low-dimensional space. If this assumption does not hold, it even performs worse than the KM algorithm. The performance of SEC is better than KM and SC on Glass, WPBC, USPS, and Isolet. Hence, it does not achieve overwhelming advantages for in-sample clustering on all datasets. One possible explanation is that SEC improves SC by introducing the linear embedding functions, which is only applicable to the data with linear or approximately linear structures. KSEC and ESEC significantly outperform KM, SC, and SEC in most cases. KSEC and ESEC all achieve 5 best in-sample clustering results among all datasets. It should be noted that KSEC and ESEC also have superior clustering performance for low dimensionality datasets, since they all introduce regularization terms into SC and can be considered as regularized SC. Compared with KSEC, ESEC achieves better or at least comparable results, which demonstrates that the proposed ESEC method is effective on all the datasets and has the ability to handle the datasets that do not have a clear manifold structure in a low-dimensional space. The running time of all algorithms is listed in Table 3. It is shown that ESEC runs much faster than KSEC, which is consistent with the theoretical analysis, and the running time of ESEC and KSEC is lower than that of KM, SC, and SEC for most of the datasets. Overall, compared with other methods, the proposed ESEC method has better or comparative in-sample performance at much faster training speed.
In Figure 1, we further analyze the sensitivity of the in-sample clustering performances of SEC, KSEC, and ESEC with respect to the parameter . We can see from Figure 1 that ESEC prefers a large value for on Yale and ORL, and its performance on these datasets is relatively stable when is set as a large value. While ESEC and KSEC favor a small value of for COIL20 and Isolet. We can observe that ESEC outperforms KSEC and SEC in a wide range of ; that is, the clustering accuracy of ESEC is less sensitive to the parameter for most of the datasets when compared with SEC and KSEC.
4.3. Out-of-Sample Clustering Experiments
We also study the performances of KM, Nyström SC, SEC, KSEC, and ESEC for the out-of-sample extension. Table 4 shows the clustering accuracies of these methods for the out-of-sample clustering on all the datasets. The optimal parameters of SEC, KSEC, and ESEC are determined by cross-validation from the in-sample clustering. From Table 4, it can be seen that SEC, KSEC, and ESEC significantly outperform the Nyström method for out-of-sample clustering. The reason is that the Nyström method utilizes Nyström extension to evaluate the similarity matrix between the unseen data, which might be inaccurate or even has a serious deviation. However, our proposed framework aims at minimizing the error between the cluster assignment matrix and the low-dimensional embedding of the data, which is feasible for handling real-world data. Thus, ESEC has the natural ability of solving out-of-sample extension problems. In addition, KM is sharply degraded on Yale and ORL compared with the corresponding results in Table 2. This is due to the fact that the unseen face data has the large variation compared to the seen data. On the other hand, the clustering accuracies of ESEC are comparable to the in-sample testing results, which validates that ESEC has better generalization performance. ESEC achieves 6 best clustering results among all ten testing results and has comparable results on the rest of the datasets when compared to KSEC. Consequently, the proposed ESEC algorithm provides a new way to cope with the out-of-sample data in clustering tasks.
In this paper, we propose a general spectral embedded clustering framework based on the objective function of SC, from which SEC, KSEC, and ESEC can all be derived by using different embedding functions. By virtue of ELM, the fast spectral nonlinearly embedded clustering algorithm (ESEC) is proposed, which can naturally solve the out-of-sample extension problem for the clustering tasks. Experimental results on benchmark datasets validate the effectiveness and efficiency of the proposed ESEC method for both in-sample and out-of-sample clustering. In the future, we intend to develop a new semisupervised clustering framework by incorporating pair constraints into the present framework and propose some semisupervised clustering algorithms based on spectral nonlinearly embedded clustering models.
|:||The input -dimensional Euclidean space|
|:||The output 0-1 binary space|
|:||The number of total training data points|
|:||The number of classes that the samples belong to|
|:||is the training data matrix|
|:||is the 0-1 class assignment matrix; is the label vector of , and all components of are s except one being|
|:||is the embedding vector function|
|:||Kernel function of variables and|
|:||; Its columns are the coefficients of kernel functions to represent the embedding function|
|:||The trace of the matrix , that is, the sum of the diagonal elements of the matrix .|
The authors declare that they have no competing interests.
This work was supported by the National Natural Science Foundation of China (no. 61403394) and the Fundamental Research Funds for the Central Universities (no. 2014QNA46).
L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, “Maximum margin clustering,” in Proceedings of the Advances in Neural Information Processing Systems (NIPS '05), pp. 1537–1544, Vancouver, Canada, 2005.View at: Google Scholar
Y. Li, I. W. Tsang, J. T. Kwok, and Z. Zhou, “Tighter and convex maximum margin clustering,” in Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 344–351, Clearwater Beach, Fla, USA, 2009.View at: Google Scholar
M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: a geometric framework for learning from labeled and unlabeled examples,” Journal of Machine Learning Research, vol. 7, pp. 2399–2434, 2006.View at: Google Scholar
Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. L. Roux, and M. Ouimet, “Out-of-sample extensions for LLE, Isomap, MDS, eigenmaps, and spectral clustering,” in Proceedings of the 17th Annual Conference on Neural Information Processing Systems (NIPS '03), pp. 126–133, Whistler, Canada, December 2003.View at: Google Scholar
A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: analysis and an algorithm,” in Proceedings of the Advances in Neural Information Processing Systems (NIPS '01), pp. 849–856, Vancouver, Canada, 2001.View at: Google Scholar
J. Ye, Z. Zhao, and M. Wu, “Discriminative K-means for clustering,” in Proceedings of the Neural Information Processing Systems, pp. 1649–1656, Vancouver, Canada, 2007.View at: Google Scholar
L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,” in Proceedings of the 18th Annual Conference on Neural Information Processing Systems (NIPS '04), pp. 1601–1608, Vancouver, Canada, December 2004.View at: Google Scholar
L. Duan, D. Xu, I. W. Tsang, and J. Luo, “Visual event recognition in videos by learning from web data,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '10), pp. 1959–1966, San Francisco, Calif, USA, June 2010.View at: Publisher Site | Google Scholar