Mathematical Problems in Engineering

Volume 2016 (2016), Article ID 9264561, 9 pages

http://dx.doi.org/10.1155/2016/9264561

## Spectral Nonlinearly Embedded Clustering Algorithm

^{1}School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou, Jiangsu 221116, China^{2}School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, Jiangsu 221116, China

Received 22 November 2015; Revised 16 May 2016; Accepted 1 June 2016

Academic Editor: Babak Shotorban

Copyright © 2016 Mingming Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

As is well known, traditional spectral clustering (SC) methods are developed based on the* manifold assumption*, namely, that two nearby data points in the high-density region of a low-dimensional data manifold have the same cluster label. But, for some high-dimensional and sparse data, such an assumption might be invalid. Consequently, the clustering performance of SC will be degraded sharply in this case. To solve this problem, in this paper, we propose a general spectral embedded framework, which embeds the true cluster assignment matrix for high-dimensional data into a nonlinear space by a predefined embedding function. Based on this framework, several algorithms are presented by using different embedding functions, which aim at learning the final cluster assignment matrix and a transformation into a low dimensionality space simultaneously. More importantly, the proposed method can naturally handle the out-of-sample extension problem. The experimental results on benchmark datasets demonstrate that the proposed method significantly outperforms existing clustering methods.

#### 1. Introduction

As one of the fundamental topics in data mining and machine learning, clustering has been successfully applied in various fields. Generally speaking, the target of clustering is to group the examples into a number of classes, or clusters. Over the past decades, a large family of clustering algorithms has been studied extensively, which is mainly divided into two categories: generative clustering approaches and discriminative clustering models. Generative clustering approaches, for example, mixture models [1, 2], generally integrate Bayesian approaches into its models. However, generative models add restrict assumptions on the class-conditional densities, which might lead to unconvincing clustering results when these assumptions do not hold. Discriminative methods, such as spectral clustering (SC) [3] and* K*-means clustering [4], learn discriminative models based on loss functions from unlabeled data through the low-density separation assumption.

Recently, discriminative clustering methods, such as the variants of kernel-based clustering and spectral clustering, have attracted more and more renewed attentions. It is easy to perform them to capture nonlinear cluster structures. Motivated by the outstanding performance of support vector machine (SVM) in supervised learning, maximum margin clustering (MMC) [5–7] methods have been developed to obtain a decision boundary that can separate data points into different clusters to the utmost extent. Although these clustering methods have the ability of exploiting nonlinear data structures, they are still sensitive to high-dimensional data points. For example,* K*-means clustering iteratively computes the distance between each data point and the center of each cluster. Hence, its clustering performance severely depends on the distance measurement. However, high-dimensional data, such as some image data, would have a bad influence on the similarity computation by virtue of Euclidian distance, and the performance of* K*-means clustering would be degraded dramatically. SC can perform clustering by utilizing the spectrum of the similarity matrix to discover the nonlinear and low-dimensional manifold structure of data points. In other words, it heavily relies on the manifold assumption [8, 9], namely, that two nearby data points of a low-dimensional manifold have the same class label. However, for high-dimensional and sparse data, the manifold assumption may not hold due to the bias caused by the curse of dimensionality. Nie et al. [10] have validated that graph-based spectral clustering methods cannot always exploit the low-dimensional manifold structure, which would result in the performance degradation of SC. Another challenge for traditional SC methods is that they do not solve the out-of-sample extension problem; that is, the discrete cluster assignment vectors for some new unseen samples cannot be automatically obtained. The algorithm proposed in [11] takes advantage of the Nyström method to approximate the eigenfunction for the unseen data points. The method described in [12] makes good use of some heuristics to evaluate the implicit eigenfunction for the new data points. But, the performance of these methods heavily relies on the estimated affinity matrix defined between training and new data points.

To improve the clustering performance of SC for high-dimensional data further, in this paper, we firstly propose a general spectral embedded clustering framework, which incorporates dimensionality reduction methods into the model of SC. Secondly, by using different low-dimensional embedding functions, we derive the corresponding optimization models and develop the spectral nonlinearly embedded algorithms based on extreme learning machine (ELM) and kernel functions, respectively. Our main contributions include the following:(1)A general spectral embedded clustering framework is presented by imposing a linearity regularization on the objective function of SC. The proposed framework introduces dimensionality reduction of the training data by controlling the error between the cluster assignment matrix and the low-dimensional embedding of the data.(2)Based on the proposed general framework, several models can be derived by using different embedding functions, which include the linear embedding functions and the nonlinear functions in Reproducing Kernel Hilbert Space (RKHS) as well as in ELM feature space. The spectral embedded clustering model (SEC) proposed in [12] can be considered as the special case of the general framework.(3)We prove that the spectral nonlinearly embedded clustering model based on ELM (ESEC) is an approximation of the kernel-based spectral nonlinearly embedded clustering (KSEC) method under some conditions. The fast spectral nonlinearly embedded clustering algorithm is proposed based on ESEC by utilizing the efficient learning ability of ELM.(4)The out-of-sample extension problem can be naturally solved for the clustering methods under our proposed SEC framework.(5)Experimental results on benchmark datasets demonstrate that the proposed ESEC outperforms the existing SC methods,* K*-means clustering, and SEC and KSEC for in-sample clustering. For out-of-sample clustering, ESEC also has better generalization capability over the Nyström method and superior performance than* K*-means clustering, SEC, and KSEC.

The rest of this paper is organized as follows. Related works are introduced in Section 2. In Section 3, we present the general spectral embedded clustering framework and derive several different models by using different embedding functions. The relationship between ESEC and KSEC is demonstrated and the ESEC clustering algorithm is described in detail. In addition, clustering for out-of-sample data is also discussed. To validate our model, experimental results are reported in Section 4. Finally, we give the related conclusions and a discussion of future works in Section 5. In order to avoid confusion, we give a list of the main notations used in this paper in Notations section.

#### 2. Related Works

##### 2.1. Spectral Clustering

Given a dataset , the main task of clustering is to partition into clusters. SC aims at finding a cluster assignment matrix of the training data by a weighted graph whose vertices are over . Several SC algorithms have been proposed in [3, 13, 14]. In this paper, we mainly discuss the SC algorithm with* k*-way normalized cuts [3].

Specifically, denote an undirected weighted graph by , where is a vertex set and represents an affinity matrix. Each entry of the symmetric matrix is used to record the edge weights that characterize the similarity relationship between a pair of vertices of . is commonly defined by The Laplacian graph is defined by , where is a diagonal matrix with the diagonal elements as . Based on the normalized cut criterion, where the size of a subset of a graph is measured by the weights of its edges and the normalized Laplacian matrix is used, the optimization problem can be transformed into the following trace maximization problem [3]:where denotes the identity matrix of size by and represents the cluster assignment matrix with continuous values by relaxation. Then optimal solution of (2) can be obtained by eigenvalue decomposition of the matrix .

##### 2.2. Extreme Learning Machine

The output function of ELM for generalized single-hidden-layer feedforward neural networks (SLFNs) in the case of one output node iswhere is the vector of the output weights between the hidden layer of* L* nodes and the output node and is the output (row) vector of the hidden layer with respect to the input . In fact, maps the data from the* d*-dimensional input space to the* L*-dimensional hidden-layer feature space (ELM feature space) . ELM is to minimize the training error as well as the norm of the output weights [15]where is a tradeoff parameter between the complexity and fitness of the decision function and is the hidden-layer output matrix denoted by

Similar to support vector machine (SVM), to minimize the norm of the output weights is actually to maximize the distance of the separating margins of the two different classes in the ELM feature space: , which actually controls the complexity of the function in the ELM feature space.

#### 3. General Spectral Embedded Clustering Framework

As mentioned above, SC methods greatly depend on the construction of the affinity matrix . For some high-dimensional data, it might not exhibit an evident low-dimensional manifold structure. In this case, the clustering performance of SC may be inferior to the* K*-means clustering.

In the following subsections, we will firstly propose a general spectral embedded clustering framework, which incorporates a linearity regularization into the traditional normalized SC model. By using different embedding functions, this framework can generate a family of spectral embedded clustering algorithms, such as SEC, KSEC, and ESEC. Secondly, we demonstrate the relationship between ESEC and KSEC. The ESEC algorithm is then proposed for high-dimensional data clustering. Finally, the out-of-sample extension problem is discussed for our proposed ESEC method.

##### 3.1. Formulation

Generally, clustering models of traditional SC methods can be transformed into the following minimization problem:where is the normalized Laplacian matrix.

To make use of the underlying dense grouping structure of data in a low-dimensional subspace, the proposed general framework introduces a regularization term into the optimization problem (6), which controls the error between the cluster assignment matrix and the low-dimensional embedding of the data. Specifically, we minimize the following objective function:where and are two regularization parameters and is the low-dimensional embedding of training data. The second term represents the error between the relaxed cluster assignment matrix and the low-dimensional embedding of the data. The third term is the norm penalty of and represents the complexity of functions in a high-dimensional feature space.

In dimensionality reduction, linear embedding functions and nonlinear embedding functions are commonly used to address out-of-sample problems. This is due to the fact that they contain few parameters, which are not expensive in computational time and memory. In this paper, we mainly discuss kernel-based and ELM-based nonlinear embedding functions.

If we choose a linear embedding function , , where and , the optimization problem (7) becomeswhich is equivalent to the SEC method proposed in [12].

If we use a nonlinear embedding function in RKHS, that is, , then , where is a symmetric kernel matrix and ; problem (7) can be rewritten aswhich is referred to as KSEC.

Alternatively, if we consider an embedding function in ELM feature space, that is, , then , where represents the hidden-layer output matrix of ELM. Problem (7) can be reformulated as which is referred to as ESEC.

##### 3.2. Method

Firstly, to solve the optimization problems (9), we transform them into another simple form and have the following theorem.

Theorem 1. *The optimization problems (9) can be transformed into the following minimization problem:where and denotes the identity matrix of size n by n.*

*Proof. *Problem (9) is firstly transformed into the following form:where .

By setting the derivatives of the objective function (16) with respect to to zero, we haveBy substituting in (12) by (13), the optimization problem (12) becomeswhich can be denoted as follows:where . This completes the proof of Theorem 1.

Based on Theorem 1, the relaxed cluster assignment matrix of KSEC can be achieved by computing the eigenvectors of corresponding to the smallest eigenvalues. The columns of are corresponding to the top eigenvectors. Finally, the discrete-valued cluster assignment matrix can be obtained by clustering each row of .

To inherit the advantage of fast learning speed of ELM, we mainly discuss ESEC based on ELM with multioutputs, since ELM with single output can be regarded as a special case of it. We have the following theorem on ESEC, which is the foundation of the proposed ESEC algorithm.

Theorem 2. *The optimization problem (10) can be transformed into the following minimization problem:where or . denotes the identity matrix of size by and is the number of hidden layer nodes in ELM.*

*Proof. *By setting the derivatives of the objective function (10) with respect to to zero, we haveBy substituting in (10) by (17), the optimization problem (10) becomesProblem (18) can be further transformed into the following objective function:which can be denoted as follows:where . can be transformed into another form as follows: This completes the proof of Theorem 2.

ESEC makes good use of an embedding function in ELM feature space instead of RKHS. Thus, the form of ESEC is similar to that of KSEC. It can be proved that there is a link between ESEC and KSEC. We have the following theorem.

Theorem 3. *If the mapping in ELM is , where denotes any kernel function and L is the number of hidden nodes in ELM and ( is the parameter of kernel function ) are random sampling points from any continuous probability distribution, then ESEC is an approximation of KSEC by discretizing the embedding function .*

*Proof. *Since in RKHS, can be denoted aswhere . Let and ; thenThus, we approximately derive the embedding function of ESEC from KSEC. This completes the proof of Theorem 3.

The proposed ESEC algorithm is described as follows.

*Algorithm 1. * *Input*. The input is the training dataset and the number of clusters . *Output*. The output is the class assignment matrix of cluster . *Step 1*. Construct the graph Laplacian from . *Step 2*. Randomly generate input weights and initiate an ELM network of hidden neurons; calculate the output matrix of the hidden layer. *Step 3*.* If *; let .*Else *. *Step 4*. Compute the matrix . *Step 5*. Find the eigenvectors of corresponding to the smallest eigenvalues, which form the optimal . *Step 6*. Treat each row of as a new training sample, and use the* K*-means algorithm to cluster the training samples into clusters. Let be the final discrete class assignment matrix of cluster for training data. *Return* the class assignment matrix of cluster .

##### 3.3. Computational Complexity

From Algorithm 1, we can see that the most costly computation is computing the matrix and carrying out the eigen-decomposition of . If , computing needs to obtain the inversion of , whose computational complexity is . In addition, the computational complexity of eigenvalue decomposition of is . Thus, the total computational complexity of ESEC is , where . Correspondingly, for KSEC, computational complexity of calculating is and its total computational complexity is . Consequently, ESEC has lower computational complexity than KSEC.

##### 3.4. Clustering for Out-of-Sample Data

By performing Algorithm 1, we can obtain the cluster assignment matrix for the training data. Thus, can be easily computed by using formula (17). Then, for any new data point , we can obtain the prediction result In this paper, we use the spectral rotation method to calculate the discrete cluster assignment vector for . Firstly, an orthogonal matrix is computed by the following spectral rotation method:where and denote the and vectors of all 1s, respectively. is an orthogonal matrix and is defined bywhere represents a diagonal matrix with the same diagonal elements as the square matrix . Secondly, the discrete cluster assignment vector for is calculated as follows:Finally, the class of the data point iswhere is the* i*th element in the vector .

#### 4. Experiments

To evaluate the in-sample clustering and out-sample clustering performance of different clustering methods, we test all algorithms on UCI datasets (Iris, Glass, Wine, WPBC, SpectHeart, and Isolet (http://archive.ics.uci.edu/ml/datasets.html.)), face recognition datasets (Yale (http://vision.ucsd.edu/~leekc/ExtYaleDatabase/Yale%20Face%20Database.htm.), ORL (http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.)), digits recognition datasets (USPS (http://www-i6.informatik.rwth-aachen.de/~keysers/usps.html)), and object recognition datasets (COIL-20 (http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php.)). Some datasets are resized, and the basic information of datasets is listed in Table 1. All the experiments have been performed in MATLAB R2013a running in a 3.10 GHZ Intel Core*™*i5-2400 with 4 GB RAM.