Table of Contents Author Guidelines Submit a Manuscript
Mathematical Problems in Engineering
Volume 2015 (2015), Article ID 706180, 10 pages
Research Article

Semisupervised Tangent Space Discriminant Analysis

Shanghai Key Laboratory of Multidimensional Information Processing, Department of Computer Science and Technology, East China Normal University, 500 Dongchuan Road, Shanghai 200241, China

Received 8 July 2014; Revised 5 November 2014; Accepted 14 November 2014

Academic Editor: Xin Xu

Copyright © 2015 Yang Zhou and Shiliang Sun. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


A novel semisupervised dimensionality reduction method named Semisupervised Tangent Space Discriminant Analysis (STSD) is presented, where we assume that data can be well characterized by a linear function on the underlying manifold. For this purpose, a new regularizer using tangent spaces is developed, which not only can capture the local manifold structure from both labeled and unlabeled data, but also has the complementarity with the Laplacian regularizer. Furthermore, STSD has an analytic form of the global optimal solution which can be computed by solving a generalized eigenvalue problem. To perform nonlinear dimensionality reduction and process structured data, a kernel extension of our method is also presented. Experimental results on multiple real-world data sets demonstrate the effectiveness of the proposed method.

1. Introduction

Dimensionality reduction is to find a low-dimensional representation of high-dimensional data, while preserving data information as much as possible. Processing data in the low-dimensional space can reduce computational cost and suppress noises. Provided that dimensionality reduction is performed appropriately, the discovered low-dimensional representation of data will benefit subsequent tasks, for example, classification, clustering, and data visualization. Classical dimensionality reduction methods include supervised approaches like linear discriminant analysis (LDA) [1] and unsupervised ones such as principal component analysis (PCA) [2].

LDA is a supervised dimensionality reduction method. It finds a subspace in which the data points from different classes are projected far away from each other, while the data points belonging to the same class are projected as close as possible. One merit of LDA is that LDA can extract the discriminative information of data, which is crucial for classification. Due to its effectiveness, LDA is widely used in many applications, for example, bankruptcy prediction, face recognition, and data mining. However, LDA may get undesirable results when the labeled examples used for learning are not sufficient, because the between-class scatter and the within-class scatter of data could be estimated inaccurately.

PCA is a representative of unsupervised dimensionality reduction methods. It seeks a set of orthogonal projection directions along which the sum of the variances of data is maximized. PCA is a common data preprocessing technique to find a low-dimensional representation of high-dimensional data. In order to meet the requirements of different applications, many unsupervised dimensionality reduction methods have been proposed, such as Laplacian Eigenmaps [3], Hessian Eigenmaps [4], Locally Linear Embedding [5], Locality Preserving Projections [6], and Local Tangent Space Alignment [7]. Although it is shown that unsupervised approaches work well in many applications, they may not be the best choices for some learning scenarios because they may fail to capture the discriminative structure from data.

In many real-world applications, only limited labeled data can be accessed while a large number of unlabeled data are available. In this case, it is reasonable to perform semisupervised learning which can utilize both labeled and unlabeled data. Recently, several semisupervised dimensionality reduction methods have been proposed, for example, Semisupervised Discriminant Analysis (SDA) [8], Semisupervised Discriminant Analysis (SSDA) with path-based similarity [9], and Semisupervised Local Fisher Discriminant Analysis (SELF) [10]. SDA aims to find a transformation matrix following the criterion of LDA while imposing a smoothness penalty on a graph which is built to exploit the local geometry of the underlying manifold. Similarly, SSDA also builds a graph for semisupervised learning. However, the graph is constructed using a path-based similarity measure to capture the global structure of data. SELF combines the ideas of local LDA [11] and PCA so that it can integrate the information brought by both labeled and unlabeled data.

Although all of these methods have their own advantages in semisupervised learning, the essential strategy of many of them for utilizing unlabeled data relies on the Laplacian regularization. In this paper, we present a novel method named Semisupervised Tangent Space Discriminant Analysis (STSD) for semisupervised dimensionality reduction, which can reflect the discriminant information and a specific manifold structure from both labeled and unlabeled data. Unlike adopting the Laplacian based regularizer, we develop a new regularization term which can discover the linearity of the local manifold structure of data. Specifically, by introducing tangent spaces we represent the local geometry at each data point as a linear function and make the change of such functions as smooth as possible. This means that STSD appeals to a linear function on the manifold. In addition, the objective function of STSD can be optimized analytically through solving a generalized eigenvalue problem.

2. Preliminaries

Consider a data set consisting of examples and labels, , where denotes a -dimensional example, denotes the class label corresponding to , and is the total number of classes. LDA seeks a transformation such that the between-class scatter is maximized and the within-class scatter is minimized [1]. The objective function of LDA can be written aswhere denotes the transpose of a matrix or a vector, is the between-class scatter matrix, and is the within-class scatter matrix. The definitions of and arewhere is the number of examples from the th class, is the mean of all the examples, and is the mean of the examples from class .

Define the total scatter matrix asIt is well known that [1] and (1) is equivalent toThe solution of (5) can be readily obtained by solving a generalized eigenvalue problem: . It should be noted that the rank of the between-class scatter matrix is at most , and thus we can obtain at most meaningful eigenvectors with respect to nonzero eigenvalues. This implies that LDA can project data into a space whose dimensionality is at most .

In practice, we usually impose a regularizer on (5) to obtain a more stable solution. Then the optimization problem becomes where denotes the imposed regularizer and is a trade-off parameter. When we use the Tikhonov regularizer, that is, , the optimization problem is usually referred to as Regularized Discriminant Analysis (RDA) [12].

3. Semisupervised Tangent Space Discriminant Analysis

As a supervised method, LDA has no ability to extract information from unlabeled data. Motivated by Tangent Space Intrinsic Manifold Regularization (TSIMR) [13], we develop a novel regularizer to capture the manifold structure of both labeled and unlabeled data. Utilizing this regularizer, the LDA model can be extended to a semisupervised one following the regularization framework. Then we will first derive our novel regularizer for semisupervised learning and then present our Semisupervised Tangent Space Discriminant Analysis (STSD) algorithm as well as its kernel extension.

3.1. The Regularizer for Semisupervised Dimensionality Reduction

TSIMR [13] is a regularization method for unsupervised dimensionality reduction, which is intrinsic to data manifold and favors a linear function on the manifold. Inspired by TSIMR, we employ tangent spaces to represent the local geometry of data. Suppose that the data are sampled from an -dimensional smooth manifold in a -dimensional space. Let denote the tangent space attached to , where is a fixed data point on the . Using the first-order Taylor expansion at , any function defined on the manifold can be expressed as where is a -dimensional data point and is an -dimensional tangent vector which gives the -dimensional representation of in . is a matrix formed by the orthonormal bases of , which can be estimated through local PCA, that is, performing standard PCA on the neighborhood of . is an -dimensional vector representing the directional derivative of at with respect to on the manifold .

Consider a transformation which can map the -dimensional data to a one-dimensional embedding. Then the embedding of can be expressed as . If there are two data points and that have a small Euclidean distance, by using the first-order Taylor expansion at and , the embeddings and can be represented as Suppose that the data can be well characterized by a linear function on the underlying manifold . Then the remainders in (8) and (9) can be omitted.

Substituting into (8), we haveFurthermore, by substituting (9) into (8), we obtain which naturally leads toSince is formed by the orthonormal bases of , it satisfies for all , where is an -dimensional identity matrix. We can multiply both sides of (12) with ; then (12) becomes to

Armed with the above results, we can formulate our regularizer for semisupervised dimensionality reduction. Consider data () sampled from a function along the manifold . Since every example and its neighbors should satisfy (10) and (13), it is reasonable to formulate a regularizer as follows:where , denotes the set of nearest neighbors of , and is a trade-off parameter to control the influences of (10) and (13).

Relating data with a discrete weighted graph is a popular choice, and there are indeed a large family of graph based statistical and machine learning methods. It also makes sense for us to generalize the regularizer in (14) using a symmetric weight matrix constructed from the above data collection . There are several manners to construct . One typical way is to build an adjacency graph by connecting each data point to its -nearest-neighbors with an edge and then weight every edge of the graph by a certain measure. Generally, if two data points and are “close,” the corresponding weight is large, whereas if they are “far away,” then the is small. For example, the heat kernel function is widely used to construct a weight matrix. The weight is computed byif there is an edge connecting with and otherwise.

Therefore, the generalization of the proposed regularizer turns out to beand is an symmetric weight matrix reflecting the similarity of the data points. It is clear that when the variation of the first-order Taylor expansion at every data point is smooth, the value of , which measures the linearity of the function along the manifold , will be small.

The regularizer (16) can be reformulated as a canonical matrix quadratic form as follows:where is the data matrix and is a positive semidefinite matrix constructed by four blocks, that is, , , , and . This formulation will be very useful in developing our algorithm. Recall that the dimensionality of the directional derivative () is . Thereby the size of is . For simplicity, we omit the detailed derivation of .

It should be noted that, besides the principle that accorded with TSIMR, the regularizer (16) can be explained from another perspective. Recently, Lin et al. [14] proposed a regularization method called Parallel Field Regularization (PFR) for semisupervised regression. In spite of the different learning scenarios, PFR shares the same spirit with TSIMR in essence. Moreover, when the bases of the tangent space at any data point are orthonormal, PFR can be converted to TSIMR. It also provides a more theoretical but complex explanation for our regularizer from the vector field perspective.

3.2. An Algorithm

With the regularizer developed in Section 3.1, we can present our STSD algorithm. Suppose the training data include labeled examples belonging to classes and unlabeled examples where is a -dimensional example and is the class label associated with the example . Define , and let , be two augmented matrices extended from the between-class scatter matrix and the total scatter matrix . Note that in the semisupervised learning scenario discussed in this section, the mean of all the samples in (2) and (4) should be the center of both the labeled and unlabeled examples; that is, . The objective function of STSD can be written as follows:where is a trade-off parameter. It is clear that and . Therefore, STSD seeks an optimal such that the between-class scatter is maximized, and the total scatter as well as the regularizer defined in (17) is minimized at the same time.

The optimization of the objective function (18) can be achieved by solving a generalized eigenvalue problem: whose solution can be easily given by the eigenvector with respect to the maximal eigenvalue. Note that since the mean is the center of both labeled and unlabeled examples, the rank of is . It implies that there are at most eigenvectors with respect to the nonzero eigenvalues. Therefore, given the optimal eigenvectors , we can form a transformation matrix sized as , and then the -dimensional embedding of an example can be computed through .

In many applications, especially when the dimensionality of data is high while the data size is small, the matrix in (19) may be singular. This singularity problem may lead to an unstable solution and deteriorate the performance of STSD. Fortunately, there are many approaches to deal with the singularity problem. In this paper, we use the Tikhonov regularization because of its simplicity and wide applicability. Finally, the generalized eigenvalue problem (19) turns out to bewhere is the identity matrix and . Algorithm 1 gives the pseudocode for STSD.

Algorithm 1: STSD.

The main computational cost of STSD lies in building tangent spaces for data points and solving the generalized eigenvalue problem (20). The naive implementation for our algorithm has a runtime of for the construction of tangent spaces and for the generalized eigenvalue decomposition. This suggests that STSD might be a time-consuming method.

However, given a neighborhood size , there are only examples as the inputs of local PCA. Then we can obtain at most meaningful orthonormal bases to construct each tangent space, which implies that the dimensionality of the directional derivative () is always less than . In practice, is usually small to ensure the locality. This makes sure that is actually a small constant. Furthermore, recall that the number of eigenvectors with respect to nonzero eigenvalues is equal to the number of classes . Using the technique of sparse generalized eigenvalue decomposition, the corresponding computational cost is reduced to .

In summary, the overall runtime of STSD is . Since and are always small, STSD actually has an acceptable computational cost.

3.3. Kernel STSD

Essentially STSD is a linear dimensionality reduction method, which can not be used for nonlinear dimensionality reduction or processing structured data such as graphs, trees, or other types of structured inputs. To handle this problem, we extend STSD to a Reproducing Kernel Hilbert Space (RKHS).

Suppose examples (), where is an input domain. Consider a feature space induced by a nonlinear mapping . We can construct an RKHS by defining a kernel function using the inner product operation , such that . Let , be the labeled and unlabeled data matrix in the feature space , respectively. Then the total data matrix can be written as .

Let be the mean of all the examples in and define which is constituted by the mean vectors of each class in . Suppose that (it can be easily achieved by centering the data in the feature space) and the labeled examples in are ordered according to their labels. Then the between-class scatter matrix and the total scatter matrix in can be written as , where is a diagonal matrix whose th element is the number of the examples belonging to class and is a matrix where is the identity matrix sized .

Recall that STSD aims to find a set of transformations to map data into a low-dimensional space. Given examples , one can use the orthogonal projection to decompose any transformation into a sum of two functions: one lying in the and the other one lying in the orthogonal complementary space. Therefore, there exist a set of coefficients () satisfyingwhere and for all . Note that although we set and optimize and together, there is no need to reparametrize like . What we need is to estimate tangent spaces in through local Kernel PCA [15].

Let be the matrix formed by the orthonormal bases of the tangent space attached to . Substitute (21) into (17) and replace with (). We can reformulate the regularizer (17) as follows: where is a kernel matrix with . With this formulation, Kernel STSD can be converted to a generalized eigenvalue problem as follows:where we have defined . The definitions of , , and are given as follows: It should be noted that every term of vanishes from the formulation of Kernel STSD because for all . Since can be computed through the kernel matrix , the solution of Kernel STSD can be obtained without knowing the explicit form of the mapping .

Given the eigenvectors with respect to the nonzero eigenvalues of (23), the resulting transformation matrix can be written as . Then, the embedding of an original example can be computed as

4. Experiments

4.1. Toy Data

In order to illustrate the behavior of STSD, we first perform STSD on a toy data set (two moons) compared with PCA and LDA. The toy data set contains 100 data points and is used under different label configurations. Specifically, 6, 10, 50, and 80 data points are randomly labeled, respectively, and the rest are unlabeled, where PCA is trained by all the data points without labels, LDA is trained by labeled data only, and STSD is trained by both the labeled and unlabeled data. In Figure 1, we show the one-dimensional embedding spaces found by different methods (onto which data points will be projected). As can be seen in Figure 1(a), although LDA is able to find an optimum projection where the within-class scatter is minimized while the between-class separability is maximized, it can hardly find a good projection when the labeled data are scarce. In addition, PCA also finds a bad solution, since it has no ability to utilize the discriminant information from class labels. On the contrary, STSD, which can utilize both the labeled and unlabeled data, finds a desirable projection onto which data from different classes have the minimal overlap. As the number of labeled data increases, we can find that the solutions of PCA and STSD do not change, while the projections found by PCA are gradually close to those of STSD. In Figure 1(d), the solutions of LDA and STSD are almost identical, which means that by utilizing both labeled and unlabeled data, STSD can obtain the optimum solutions even when only a few data points are labeled. This demonstrates the usefulness and advantage of STSD in the semisupervised scenario.

Figure 1: Illustrative examples of STSD, LDA, and PCA on the two-moon data set under different label configurations. The circles and squares denote the data points in positive and negative classes, and the filled or unfilled symbols denote the labeled or unlabeled data, respectively.
4.2. Real-World Data

In this section, we evaluate STSD with real-world data sets. Specifically, we first perform dimensionality reduction to map all examples into a subspace and then carry out classification using the nearest neighbor classifier (1-NN) in the subspace. This measurement for evaluating semisupervised dimensionality reduction methods is widely used in literature, such as [810, 16]. For each data set, we randomly split out 80% of the data as the training set and the rest as the test set. In the training set, a certain number of data are randomly labeled while the rest of the data are unlabeled. Moreover, every experimental result is obtained from the average over 20 splits.

In our experiments, we compare STSD with multiple dimensionality reduction methods including PCA, LDA, SELF, and SDA, where LDA is performed only on the labeled data, while PCA, SELF, SDA, and STSD are performed on both the labeled and unlabeled data. In addition, we also compare our method with the baseline method which just employs the 1-NN classifier with the labeled data in the original space. Since the performances of PCA and SELF depend on the dimensionality of the embedding subspace discovered by each method, we show the best results for them.

For the graph based methods, including SELF, SDA, and STSD, the number of nearest neighbors for constructing adjacency graphs is determined by fourfold cross-validation. The parameters and for STSD are selected through fourfold cross-validation, while the Tikhonov regularization parameter is fixed to . In addition, the parameters involved in SELF and SDA are also selected through fourfold cross-validation. We use the heat kernel function (15) to construct the weight matrix, and the kernel parameter is fixed as unless otherwise specified where is the average of the squared distances between all data points and their nearest neighbors.

Two types of data sets under different label configurations are used to conduct our experiments. One type of data sets is the face images which consist of high-dimensional images, and the other one is the UCI data sets constituted by low-dimensional data. For the convenience of description, we name each configuration of experiments as “Data Set” + “Labeled Data Size.” For example, for the experiments with the face images, “Yale 3” means the experiment is performed on the Yale data set with 3 labeled data per class. Analogously, for the experiments with the UCI data sets, “BCWD 20” means the experiment is performed on the Breast Cancer Wisconsin (Diagnostic) data set with a total of 20 labeled examples from all classes.

4.2.1. Face Images

It is well known that high-dimensional data such as images and texts are supposed to live on or near a low-dimensional manifold. In this section, we test our algorithm with the Yale and ORL face data sets which are deemed to satisfy this manifold assumption. The Yale data set contains 165 images of 15 individuals and there are 11 images per subject. The images have different facial expressions, illuminations, and facial details (with or without glass). The ORL data set contains 400 images of 40 distinct subjects under varying expressions and illuminations. In our experiments, every face image is cropped to consist of pixels with 256 grey levels per pixel. Furthermore, for the Yale data set, we set the parameter of the heat kernel to . We report the error rates on both the unlabeled training data and test data. Tables 1 and 2 show that STSD is always better than or at least comparable with other counterparts in all the cases, which demonstrates that STSD can well exploit the manifold structure for dimensionality reduction. Notice that SELF gets inferior results. We conjecture that this is because it has no ability to capture the underlying manifold structures of the data.

Table 1: Mean values and standard deviations of the unlabeled error rates (%) with different label configurations on the face data sets.
Table 2: Mean values and standard deviations of the test error rates (%) with different label configurations on the face data sets.
4.2.2. UCI Data Sets

In this set of experiments, we use three UCI data sets [17] including Breast Cancer Wisconsin (Diagnostic), Climate Model Simulation Crashes, and Cardiotocography which may not well satisfy the manifold assumption. For simplicity, we abbreviate these data sets as BCWD, CMSC, and CTG, respectively. BCWD consists of 569 data points from two classes in . CMSC consists of 540 data points from two classes in . CTG consists of 2126 data points from ten classes in .

From the results reported in Tables 3 and 4, it can be seen that when the labeled data are scarce, the performance of LDA is even worse than the baseline method due to the inaccurate estimation of the scatter matrices. However, STSD achieves the best or comparable results among all other methods in all configurations, expect for the test error rate in BCWD 10. Although STSD adopts a relatively strong manifold assumption, it still has sufficient flexibility to handle general data which may not live on a low-dimensional manifold.

Table 3: Mean values and standard deviations of the unlabeled error rates (%) with different label configurations on the UCI data sets.
Table 4: Mean values and standard deviations of the test error rates (%) with different label configurations on the UCI data sets.

Notice that the error rates of several dimensionality reduction methods over the CMSC data set do not improve with the increasing size of labeled data. The reason may be that the data in the CMSC data set contain some irrelevant features as reflected by the original data description [18], which leads to the unexpected results. Nevertheless, SDA and STSD achieve more reasonable results due to their capabilities to extract information from both labeled and unlabeled data.

It should be noted that overall the experiments are conducted with 5 data sets, and in terms of the results of all the data sets STSD is likely to beat other methods account for a sign-test’s value of 0.031, which is statistically significant. This also demonstrates that STSD is better than the related methods.

4.3. Connection with the Laplacian Regularization

Essentially, both STSD and SDA are regularized LDA methods with specific regularizers. STSD imposes the regularizer (16) which prefers a linear function along the manifold, while SDA employs the Laplacian regularizer to penalize the function differences among “similar” examples. Now consider a regularized LDA method using both of these regularizers named STSLap, whose objective function can be written as follows: where is the Laplacian regularizer used in SDA with being the Laplacian matrix [19] and is the regularizer used in STSD, which is defined as (16). The parameters and are used to control the trade-off between the influences of and . Similar to STSD, STSLap can also be converted to a generalized eigenvalue problem, which can be easily solved through eigenvalue decomposition.

Although the previous experiments have shown that STSD gets better results than SDA in most situations, SDA can achieve similar results with STSD in some configurations. However, this does not mean that STSD and SDA are similar or, in other words, and have similar behavior. In fact, the two regularizers seem to complement each other. To demonstrate this complementarity, we compare STSLap with SDA and STSD under a medium-sized label configuration over all the data sets used in the previous experiments. Specifically, the experiments are performed on BCWD 30, CMSC 30, CTG 160, Yale 3, and ORL 2. For each data set, the neighborhood size used to construct the adjacency graph is set to be the one supported by the experimental results with both SDA and STSD in Sections 4.2.1 and 4.2.2. This means that all the methods compared in this section utilize the same graph to regularize the LDA model for each data set. The parameters , in (26) and in are selected through fourfold cross-validation.

Note that given a graph, the performance of STSLap can be at least, ideally, identical to SDA or STSD, because STSLap degenerates to SDA or STSD when the parameter or is set to zero. However, if STSLap achieves better results than both SDA and STSD, we can deem that and are complementary.

Tables 5 and 6 show that the performance of STSLap is better than both SDA and STSD in most of the cases. Moreover, although it is not shown in the tables, the trade-off parameters and are scarcely set to be zero by cross-validation. This means that STSLap always utilizes the information discovered from both and . In conclusion, the proposed regularizer can capture the manifold structure of data which can not be discovered by Laplacian regularizer. This implies that these two regularizers are complementary to each other, and we could use them together to yield probably better results in practice. It should be noted that our aim is not to compare STSD with SDA in this set of experiments, and we can not make any conclusion about whether or not STSD is better than SDA from Tables 5 and 6 because the neighbourhood size for each data set is fixed.

Table 5: Mean values and standard deviations of the unlabeled error rates (%) with medium-sized labeled data on different data sets.
Table 6: Mean values and standard deviations of the test error rates (%) with medium-sized labeled data on different data sets.

5. Discussion

5.1. Related Work

STSD is a semisupervised dimensionality reduction method under a certain manifold assumption. More specifically, we assume that the distribution of data can be well approximated by a linear function on the underlying manifold. One related method named SDA [8] adopts another manifold assumption. It simply assumes that the mapping function should be as smooth as possible on a given graph. This strategy is well known as the Laplacian regularization which is widely employed in the semisupervised learning scenario. However, STSD follows a different principle to regularize the mapping function, which not only provides an alternative strategy for semisupervised dimensionality reduction, but also attains the complementarity with the classic Laplacian regularization. SELF [10] is another related approach, which is a hybrid method of local LDA [11] and PCA. Despite its simplicity, SELF can only discover the linear structure of data, whereas our method is able to capture the nonlinear intrinsic manifold structure.

Rather than constructing an appropriate regularizer on a given graph, SSDA [9] and semisupervised dimensionality reduction (SSDR) [16] focus on building a good graph and then perform the Laplacian-style regularization on this graph. SSDA regularizes LDA on a graph constructed by a path-based similarity measure. The advantage of SSDA is its robustness against outliers, because SSDA aims to preserve the global manifold information. SSDR constructs a graph according to the so-called must-link and cannot-link pairwise constraints, which gives a natural way to incorporate prior knowledge into the semisupervised dimensionality reduction. However, this prior knowledge is not always available in practice. In contrast to SSDA and SSDR, our method is flexible enough to perform regularization on any graph and free from the necessity of extra prior knowledge. In fact, the advantage of SSDA or SSDR can be easily inherited through performing STSD with the graph constructed by corresponding method (SSDA or SSDR), which is another important merit of STSD.

5.2. Further Improvements

For the manifold related learning problem considered in STSD, the estimation of bases for tangent spaces is an important step. In this paper, we use local PCA with fixed neighborhood size to calculate the tangent spaces, and the neighborhood size is set to be same as the one used to construct the adjacency graph. This is certainly not the optimal choice, since manifolds can have varying curvatures and data could be nonuniformly sampled. Note that the neighborhood size can determine the evolution of calculated tangent spaces along the manifold. When a small neighborhood size is used, there are at most examples for the inputs of local PCA. However, when we need to estimate a set of tangent spaces which have relative high dimensionality (), it is almost impossible to get accurate estimates of the tangent spaces, because there are at most meaningful orthonormal bases obtained from local PCA. Moreover, noises can damage the manifold assumption as well to a certain extent. All these factors explain the necessity for using different neighborhood sizes and more robust subspace estimation methods.

In our method, each example in the data matrix can be treated as an anchor point, where local PCA is used to calculate the tangent space. The number of parameters that should be estimated in our method basically grows linearly with respect to the number of anchor points. Therefore, in order to reduce the parameters to be estimated, one possible approach is to reduce the anchor points where only “key” examples are kept as the anchor points. This will be a kind of research for data set sparsification. People can make different criteria to decide whether or not an example should be regarded as the “key” one.

The research of anchor point reduction is especially useful when training data are large-scale. For large-scale data, anchor point reduction can be promising to speed up the training process. In addition, data can exhibit different manifold dimensions at different regions, especially for complex data. Therefore, adaptively determining the dimensionality at different anchor points is also an important refinement of the current approach.

6. Conclusion

In this paper, we have proposed a novel semisupervised dimensionality reduction method named Semisupervised Tangent Space Discriminant Analysis (STSD), which can extract the discriminant information as well as the manifold structure from both labeled and unlabeled data, where a linear function assumption on the manifold is exploited. Local PCA is involved as an important step to estimate tangent spaces and certain relationships between adjacent tangent spaces are derived to reflect the adopted model assumption. The optimization of STSD is readily achieved by the eigenvalue decomposition.

Experimental results on multiple real-world data sets including the comparisons with related works have shown the effectiveness of the proposed method. Furthermore, the complementarity between our method and the Laplacian regularization has also been verified. Future work directions include finding more accurate methods for tangent space estimation and extending our method to different learning scenarios such as multiview learning and transfer learning.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.


This work is supported by the National Natural Science Foundation of China under Project 61370175 and Shanghai Knowledge Service Platform Project (no. ZF1213).


  1. K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, 2nd edition, 1990. View at MathSciNet
  2. I. T. Jolliffe, Principal Component Analysis, Springer, New York, NY, USA, 1986. View at Publisher · View at Google Scholar · View at MathSciNet
  3. M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol. 15, no. 6, pp. 1373–1396, 2003. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at Scopus
  4. D. L. Donoho and C. Grimes, “Hessian eigenmaps: locally linear embedding techniques for high-dimensional data,” Proceedings of the National Academy of Sciences of the United States of America, vol. 100, no. 10, pp. 5591–5596, 2003. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  5. S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000. View at Publisher · View at Google Scholar · View at Scopus
  6. X. He and P. Niyogi, “Locality preserving projections,” in Advances in Neural Information Processing Systems, S. Thrun, L. Saul, and B. Scholkopf, Eds., vol. 18, pp. 1–8, MIT Press, Cambridge, Mass, USA, 2004. View at Google Scholar
  7. Z. Zhang and H. Zha, “Principal manifolds and nonlinear dimensionality reduction via tangent space alignment,” SIAM Journal on Scientific Computing, vol. 26, no. 1, pp. 313–338, 2004. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  8. D. Cai, X. He, and J. Han, “Semi-supervised discriminant analysis,” in Proceedings of the 11th IEEE International Conference on Computer Vision (ICCV '07), pp. 1–7, Rio de Janeiro, Brazil, October 2007. View at Publisher · View at Google Scholar · View at Scopus
  9. Y. Zhang and D.-Y. Yeung, “Semi-supervised discriminant analysis using robust path-based similarity,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), pp. 1–8, Anchorage, Alaska, USA, June 2008. View at Publisher · View at Google Scholar · View at Scopus
  10. M. Sugiyama, T. Ide, S. Nakajima, and J. Sese, “Semi-supervised local Fisher discriminant analysis for dimensionality reduction,” Machine Learning, vol. 78, no. 1-2, pp. 35–61, 2010. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  11. M. Sugiyama, “Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis,” The Journal of Machine Learning Research, vol. 8, pp. 1027–1061, 2007. View at Google Scholar · View at Scopus
  12. J. H. Friedman, “Regularized discriminant analysis,” Journal of the American Statistical Association, vol. 84, no. 405, pp. 165–175, 1989. View at Publisher · View at Google Scholar · View at MathSciNet
  13. S. Sun, “Tangent space intrinsic manifold regularization for data representation,” in Proceedings of the IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP '13), pp. 179–183, Beijing, China, July 2013. View at Publisher · View at Google Scholar · View at Scopus
  14. B. Lin, C. Zhang, and X. He, “Semi-supervised regression via parallel field regularization,” in Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. C. N. Pereira, and K. Q. Weinberger, Eds., vol. 24, pp. 433–441, The MIT Press, Cambridge, Mass, USA, 2011. View at Google Scholar
  15. B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Computation, vol. 10, no. 5, pp. 1299–1319, 1998. View at Publisher · View at Google Scholar · View at Scopus
  16. D. Zhang, Z. Zhou, and S. Chen, “Semi-supervised dimensionality reduction,” in Proceedings of the 7th SIAM International Conference on Data Mining, pp. 629–634, April 2007. View at Scopus
  17. K. Bache and M. Lichman, UCI machine learning repository, 2013,
  18. D. D. Lucas, R. Klein, J. Tannahill et al., “Failure analysis of parameter-induced simulation crashes in climate models,” Geoscientific Model Development Discussions, vol. 6, no. 1, pp. 585–623, 2013. View at Publisher · View at Google Scholar
  19. F. R. K. Chung, Spectral Graph Theory, American Mathematical Society, Providence, RI, USA, 1997.