Abstract

Effective feature representation is the key to success of machine learning applications. Recently, many feature learning models have been proposed. Among these models, the Gaussian process latent variable model (GPLVM) for nonlinear feature learning has received much attention because of its superior performance. However, most of the existing GPLVMs are mainly designed for classification and regression tasks, thus cannot be used in data clustering task. To address this issue and extend the application scope, this paper proposes a novel GPLVM for clustering (C-GPLVM). Specifically, by combining GPLVM with the subspace clustering method, our C-GPLVM can obtain more representative latent variable for clustering. Moreover, it can directly predict the new samples by introducing a back constraint in the model, thus being more suitable for big data learning tasks such as analysis of chaotic time series and so on. In the experiment, we compare it with the related GPLVMs and clustering algorithms. The experimental results show that the proposed model not only inherits the feature learning ability of GPLVM but also has superior clustering accuracy.

1. Introduction

In machine learning tasks, data are often distributed in a high-dimensional space and have many redundant features. Training machine learning models in such high-dimensional data may result in not only higher computational and storage complexities but also the model overfitting problem [1]. Existing research studies have shown that high-dimensional data are often embedded in low-dimensional manifold. We therefore can utilize the dimension reduction and feature learning method to learn the low-dimensional manifold and obtain more representative feature for the improvements of machine learning model accuracy and efficiency. Thus, effective feature representation is the key to success of machine learning applications.

In the past decade, many related methods have been proposed, such as dictionary learning [2], autoencoder [3], Gaussian process latent variable model (GPLVM) [4], Isomap [5], and locally linear embedding [6]. Among these models, the GPLVM for nonlinear feature learning has received extensive attention because of its superior feature learning ability and has been used in many applications such as dynamical system [7], modelling and control of nonlinear system [8]. Given a few training samples, it can effectively learn the low-dimensional manifold that is embedded in the high-dimensional space, thus has been widely used in the dimension reduction and data visualization tasks [9, 10].

Although has the abovementioned advantages, the conventional GPLVM is just a fully unsupervised feature learning model, thus cannot meet the demands of real-world applications, when dealing with specific machine leaning tasks such as analysis of chaotic time series, dynamical system [7], and modelling and control of nonlinear system in which we also observe the response values of the inputs. How to modify the GPLVM and improve its performance is the key content of the related research studies. To date, the extensions of this model mainly focus on the supervised and unsupervised learning methods [9, 11, 12]. These methods assume that apart from the input features, we also observe labels of the samples. By their extensions, the GPLVM can effectively utilize the supervised information to improve the classification accuracy of the learned latent variables. However, in the real-world applications, we may also deal with unsupervised clustering tasks in which we cannot obtain the label information or any other auxiliary information, thus bringing more challenges to application of the GPLVM in clustering tasks.

In order to address the abovementioned issues, this paper proposes a fusion model that combines the GPLVM with the subspace clustering model [13] to simultaneously obtain more representative features and accurate clustering results. Moreover, we also use the back constraint trick [14] in the model, which makes the model predict new samples directly and more suitable for big data learning tasks such as analysis of chaotic time series. In the experiment, we verify the performance of the proposed model on multiple datasets. The experimental results show that our model has much superior clustering performance than the other related models.

2.1. Gaussian Process Latent Variable Model

The GPLVM is a fully unsupervised and nonlinear latent variable model. In this model, given observed samples (where denotes the training sample), our objective is to learn the corresponding low-dimensional latent variable. In this paper, we use () to denotes the latent variable of . Obviously, the GPLVM can realize dimension reduction by learning the latent variables. Specifically, the GPLVM assumes that the generation process of as follows:where is the feature of the training sample, is the noise term that follows a Gaussian distribution , and is a function that follows a Gaussian process prior. We use to denote the outputs of with inputs . Thus, we have , where is the kernel matrix that is computed by using kernel function on the latent variables in . The row and column element of is computed as . By integrating out the intermediate variable , we can obtainthe following marginal likelihood function:where denotes the hyperparameters involved in the kernel function and noise distribution. In the model optimization process, the GPLVM learns the latent variable and hyperparameters jointly by maximizing the above likelihood function and obtains the low-dimensional representation finally.

From the abovementioned generation process, as a fully unsupervised dimension reduction model, the GPLVM cannot embed auxiliary information when dealing with specific machine learning tasks, thus cannot meet demands of real-world application. For example, in analysis of chaotic time series, data of similar time will have similar features. If it can utilize this knowledge, the GPLVM will learn more representative features for the task and significantly improve the prediction accuracy. The existing methods for the extension of GPLVM mainly focus on embedding supervised information to improve its classification and regression accuracy, for example, the discriminative GPLVM (D-GPLVM) and supervised GPLVM (S-GPLVM). For the extension to the clustering task, the related works is much fewer. The existing unsupervised GPLVM just focuses on how to preserve the local distance and learn better latent variables or features. For example, local preserving projection GPLVM (LPP-GPLVM) combines the objective of local preserving projection with that of the GPLVM, thus simultaneously learning the low-dimensional representation and preserving the local structures [15]. The GPLVM with back constraints (B-GPLVM) introduces a back-constraint (from observed space to latent space) into the GPLVM. By this way, it can also realize the preservation of local distance.

2.2. Subspace Clustering

The goal of subspace clustering is to segment a set of data samples into different subspaces; thus, similar samples are in the same subspace, while dissimilar samples are in different subspaces. Over the past decade, subspace clustering has been used in various clustering tasks and many well-designed algorithms have been proposed such as Gaussian mixture model- (GMM-) based methods [16, 17], matrix factorization- (MF-) based methods [18, 19], algebra-based methods [20], and spectral clustering methods [13, 21, 22]. Among these models, the subspace clustering method based on spectral clustering has been widely applied because of its concise implementation process and reliable performance. It uses low-rank representation to construct the affinity matrix of the spectral clustering. Its objective is to find the low-rank representation of input data by optimizing the following function:where we assume that each sample can be expressed by the linear combination of other samples. The above low-rank penalty term can be considered as a global constraint on the subspace structure of samples and makes similar samples have similar weights. In general, we can use the following nuclear norm to replace the penalty term:where we use the nuclear norm to approximate the rank of . Considering that the data often contain noise, we use the following formulation to learn the self-representation matrix :

In low-rank subspace clustering, we can first construct the affinity matrix and the Laplace matrix and then use spectral clustering to cluster the data. and can be constructed as follows:where denotes a diagonal matrix and . After obtaining the Laplace matrix, we can optimize the following objective function to obtain the latent variable :

Obviously, is composed of the eigenvectors corresponding to the smallest eigenvalues. At last, we can run the k-means algorithm on the learned and obtain the clustering result.

3. Model Construction and Optimization

3.1. Designing of the Gaussian Process Latent Variable Clustering Model

Assuming that there are observed samples denoted as , our goal is to learn the low-dimensional latent variable corresponded to these observed variables and make the latent variable have more superior clustering performance (i.e., make the common clustering algorithms obtain accurate clustering result on the learned ).

In order to achieve the above goal, we assume that the latent variable has the following prior distribution:where is a constant that makes and has the following form:where is the row and column element of the affinity matrix . Equation (9) often can be written as follows:

In this paper, we assume that the generation process observed variables from the latent variable can be constructed by conditional distribution . Thus, from the Bayes formulation, we can obtainthe posterior distribution of latent variable as

Since is a constant, we therefore can obtain the optimal latent variable by maximizing the following joint marginal distribution:

To introduce the GPLVM into this model, we assume that is generated by latent function which follows a Gaussian process prior. Thus, equation (12) can be written aswhere denotes the hyperparameter that is involved in the kernel function and denotes the variance of Gaussian noise distribution.

By the above modelling process, the GPLVM can effectively embed the sample similarity information when learning the latent variable, thus improving its latent variable clustering ability. However, how to learn the affinity matrix is still an urgent problem of this paper and other related algorithms such as self-representation learning and subspace clustering. In this paper, we borrow the idea of low-rank self-representation learning and introduce the following low-rank subspace constraints into the model:

It is worth noting that, in this paper, we assume that , i.e., we directly use matrix as the affinity matrix. This setting is the same as that of [23] and its role is similar to the affinity matrix of the original subspace clustering. This C-GPLVM is very similar to the LPP-GPLVM. However, in the LPP-GPLVM, the Laplace matrix is fixed. Different from LPP-GPLVM, the Laplace matrix in our C-GPLVM can be learned in the training process. Thus, our C-GPLVM has more superior performance than the LPP-GPLVM.

One important limitation of GPLVM and self-representation is that they cannot effectively predict the new samples. To mitigate this problem, we introduce a back constraint on the proposed model. Thus, given a new sample, it can effectively predict the corresponding low-dimensional latent variable using the constraint function. Specifically, given an observed sample , we assume that we can use a function to obtain latent variable :where is the neural network function with learnable parameter . At last, we obtain the objective of the proposed model as follows:

The whole model structure is shown in Figure 1.

4. Model Optimization

In order to optimize (16), we transform it into the following optimization problem:where , , and are regularization terms. By this formulation, we can use the alternating iterative optimization method to learn all the parameters. First, we fix and write (17) as

This problem can be solved effectively by using gradient-based methods, and its gradient with respect to can be computed as

The gradients with respect to and are similar to the above formulation. For the sake of brevity, we have omitted their derivation processes. We then can fix , and and write (17) as

The gradient of the first term with respect to can be computed aswhere denotes the row of matrix and denotes the vector whose element is 1. The gradient of the second term can be computed as

The subgradient of the third term is

By the above derivations, we can learn the whole model, as descripted in Algorithm 1. The main computation complexity of C-GPLVM is the inversion of the kernel matrix, which has a complexity of , where is the number of training samples. The main storage complexity of this is the storage of kernel matrix, which has a complexity of . Thus, both the computation and storage complexities are the same as those of the conventional GPLVM.

Input: training set , dimension of latent variable, hyperparameters , , .
Output: parameter and the clustering result.
(1)Pretrain the following model to initialize the latent variable
  
(2)while ()
(3) With fixed , minimize (18) to obtain , , and .
(4) With fixed , , and , minimize (20) to obtain .
(5) Calculate the difference of the objective function between the last two iterations
(6)end while
(7)Compute the latent variable based on the learned parameter .
(8)Run k-means algorithm on to get the final clustering result.

5. Experiments and Analysis

5.1. Experimental Setup

To verify the effectiveness of C-GPLVM, we use 8 datasets in the experiments. The detailed information of these datasets is given in Table 1.

The YEAST is a dataset for the prediction of protein localization sites. The USPS is a digits dataset that was gathered at the Center of Excellence in Document Analysis and Recognition at SUNY Buffalo, as part of a project sponsored by the US Postal Service. YALE, JAFFE, and ORL are three face recognition datasets, as shown in Figure 2. TR11, TR41, and TR45 are three textual datasets.

In order to fully verify the advantage of C-GPLVM, we compare it to the related Gaussian process latent variable model (i.e., GPLVM, B-GPLVM, and LPP-GPLVM) and clustering methods, such as spectral clustering method (SC) [24], kernel spectral clustering (KSC) [25], and simplex sparse representation learning (SSR) [21]. All the kernel-based models (GPLVM, LPP-GPLVM, KSC, and C-GPLVM) used Radial Basis Function (RBF) as the kernel function. It is worth noting that some other kernel functions can also be used in the proposed model such as linear kernel, Laplacian kernel, and circular kernel. Furthermore, all the hyperparameters in these kernel functions can also be learned in the same form as descripted in the paper. During the experiment process, the hyperparameters , , and are chosen from . The hyperparameters involved in other models are set to be the same as those of original paper. In the experiment process, we use the Gaussian process toolkit (GPFlow) 1 to implement the GP-based model. Other related models are all implemented with python. All the algorithms are tested on the Windows computer with i7 9700 CPU, 16G RAM.

5.2. Clustering Results and Analysis

In the experiments, we use clustering accuracy, purity, and normalized mutual information (NMI) as the clustering measurement. At the clustering stage, the latent variables learned by different methods are used as inputs and the k-means algorithm is used to obtain the final clustering methods. The dimension and the number of clusters are set to be the same as the number of classes. At the same time, in order to mitigate the initial value sensitivity problem of k-means method, we randomly initialize and run the k-means method 20 times. Finally, we calculate the mean and standard deviation of these 20 experiments. The experimental results are shown in Tables 24, where the best results are given in bold.

From Tables 24, we can observe that the GPLVM, as an unsupervised dimension reduction model, usually obtain latent variables that have poor clustering performance. The B-GPLVM and LPP-GPLVM can preserve the local distance of samples during the feature learning process, thus obtaining more representative latent variables. Meanwhile, the LPP-GPLVM obtains much better result than the B-GPLVM which indicates that graph Laplace regularization is more suitable for clustering than back constraints. In general, spectral clustering and subspace clustering methods have better performance than the GPLVM. As we can see, SC, KSC, and SSR outperform GPLVM, B-GPLVM, and LPP-GPLVM. In this paper, the proposed C-GPLVM combines the subspace clustering with GPLVM, thus effectively improving the clustering performance of the GPLVM. As shown in the experimental results, the C-GPLVM has more superior clustering result than other related models in most cases.

6. Conclusion and Future Work

This paper proposes a joint model by combining the low-rank subspace with the back constraint GPLVM to address the poor clustering performance problem of the conventional GPLVM. The proposed C-GPLVM can not only obtain low-dimensional latent variables but also directly predict the new samples, thus effectively extending the application scope of GPLVM on tasks such as analysis of chaotic time series. The experimental results show that the C-GPLVM has much better latent variable learning ability and superior clustering performance. In the future work, we will further extend the C-GPLVM to make it suitable for much bigger dataset and supervise tasks such as classification and regression, improving its efficiency and application scope.

Data Availability

The experimental data used to support the findings of this study have been deposited in the UCI repository (https://archive.ics.uci.edu/ml/index.php).

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Project of Yulin Normal University Research under Grant 2015YJYB03.