Abstract

In the recent years, manifold learning methods have been widely used in data classification to tackle the curse of dimensionality problem, since they can discover the potential intrinsic low-dimensional structures of the high-dimensional data. Given partially labeled data, the semi-supervised manifold learning algorithms are proposed to predict the labels of the unlabeled points, taking into account label information. However, these semi-supervised manifold learning algorithms are not robust against noisy points, especially when the labeled data contain noise. In this paper, we propose a framework for robust semi-supervised manifold learning (RSSML) to address this problem. The noisy levels of the labeled points are firstly predicted, and then a regularization term is constructed to reduce the impact of labeled points containing noise. A new robust semi-supervised optimization model is proposed by adding the regularization term to the traditional semi-supervised optimization model. Numerical experiments are given to show the improvement and efficiency of RSSML on noisy data sets.

1. Introduction

The problem of dimensionality reduction, that is, the transformation of high-dimensional data into meaningful low-dimensional features, has arisen much interest of researchers. Recently, there have been much research efforts on developing effective and efficient manifold learning algorithms which can discover the potential intrinsic low-dimensional structures of the high-dimensional data. These algorithms included Isometric Mapping (ISOMAP) [1], Locally Linear Embedding (LLE) [2], Laplacian Eigenmaps (LE) [3], and Local Tangent Space Alignment (LTSA) [4].

The above classical manifold learning methods are all unsupervised learning algorithms; that is, they do not consider the prior information. In many applications, we can get some prior information of the input data. For example, in a classification problem, the class labels of partial data can be obtained. Considering prior information in the form of low-dimensional coordinates of certain sample points, the classical manifold learning methods can be extended to semi-supervised manifold learning methods [5]. And the semi-supervised manifold learning algorithms can yield the low-dimensional coordinates that bear the same meaning as the prior information.

However, these unsupervised and semi-supervised methods may have a limited efficiency on real-world data, due to large noise or distortion of data. Practically, each method for dimensionality reduction requires certain assumptions on the data manifold to guarantee its expected efficiency. For example, ISOMAP needs a convex embedding domain of manifolds or a relatively uniform data distribution for estimating geodesic distance. LTSA should have neighbor sets that can approximately recover the local tangent spaces. In LLE, the local geometric structure of the manifold should be well determined via local combination of data neighborhoods.

There are some efforts on improving the original algorithms. One line is to preprocess the data set before applying the methods, without any modifications on algorithms. Smoothing the data set by weighted SVD, or equivalently, weighted PCA to reduce data noise before performing LTSA is suggested in [6]. In [7], the outliers are first detected by the histogram analysis of the neighborhood distances of each point, and the locally smoothed values of data are then computed using the linear error-in-variables (EIV) model. A fast outlier detection method for high-dimensional data sets is proposed in [8]. It also employs a local smoothing method and introduces a weighted global function to further reduce the undesirable effect of outliers and noise on the embedding results. These algorithms can be also improved by adaptively selected neighborhoods [9]. The other line is to adjust some details of algorithms. For example, multiple local weight vectors are used to solidify the structures determined by neighborhoods in [10]. In [11], the influence of noisy points on the reconstruction is greatly reduced by solving a new local optimization model. In [12], a robust DLPP version based on L1-norm maximization is proposed. In [13], the short-circuit errors can be reduced by solving the problem of selecting the right number and position of landmarks automatically. A robust version of LTSA is proposed in [14] to further reduce the influence of noise on embedding results by endowing clean data points and noise data points with different weights into local alignment errors. In [15], an out-of-sample extension framework for a global manifold learning algorithm (ISOMAP) that uses temporal information in out-of-sample points in order to make the embedding more robust to noise and artifacts is proposed.

Although the improved manifold learning algorithms are more robust against noise than the original algorithms, few works are done on the semi-supervised algorithms [16]. In fact, the undesirable effect caused by noise is more complicated in the semi-supervised problem. Firstly, it is difficult to accurately explore the local geometric structures when the local neighborhoods contain noisy points. Secondly, the provided prior information may be inexact for noisy points. And the constructed low-dimensional coordinates using the inexact prior information may be far from the real on-manifold coordinates of the sample points. The first issue can be solved by constructing noise-free neighbor sets [79] or constructing robust local geometric structures of the noisy neighbor sets [10, 11, 17]. And we do not extend the topic regarding the first issue in the paper.

We focus on the second issue in the paper. We estimate the noise levels of the sample points which reflect the confidence levels in the prior information. Then we construct a new semi-supervised optimization model to reduce the undesirable effect of the inexact prior information with low confidence levels. A framework for robust semi-supervised manifold learning (RSSML) is proposed by solving the new semi-supervised optimization model.

The rest of this paper is organized as follows. In Section 2, we give a brief review of semi-supervised manifold learning. In Section 3, we show how to extend the semi-supervised manifold learning algorithms so that they can handle inexact information for noisy points. The framework for robust semi-supervised manifold learning (RSSML) is presented in the section. After that, we give numerical experiments in Section 4 to show the effectiveness of RSSML.

2. A Brief Review of Semi-Supervised Manifold Learning

Our work is an extension of semi-supervised manifold learning (SSML). In this section, we give a brief review of (SSML) [5]. Assume that we are given a data set (possibly with noise) from a -dimensional manifold. Without loss of generality, suppose that the prior information of the first points is known. And denote by the constructed low-dimensional coordinates using the prior information. The goal of SSML is to calculate the unknown low-dimensional coordinates . SSML proceeds in the following steps.

Step 1 (finding local neighborhoods). Determine the neighbor set for each   .

Step 2 (extracting local geometry). The local geometry of the determined neighbor set can be extracted by solving the classical local optimization methods [14]. Take LLE as an example; the local geometry is characterized by the linear combination coefficients which can be computed by minimizing the least square optimization model:

Step 3 (constructing semi-supervised optimization model). In the unsupervised manifold learning algorithms, the global low-dimensional coordinates are calculated by solving the embedding cost functions which can preserve the extracted local geometries. For example, in LLE, the low-dimensional coordinates can be computed by minimizing the embedding cost function:Different from the embedding cost functions in the unsupervised manifold learning algorithms, a regularization term concerning the prior information is added such that the low-dimensional coordinates can obey the prior information. As in semi-supervised LLE, the low-dimensional coordinates are obtained by minimizing the semi-supervised optimization model:where is the regularization parameter that reflects the confidence level in prior information.

Semi-supervised manifold learning has been widely used in many real-life applications such as face recognition [18], remote sensing image classification [19], object tracking [5], and data visualization [18]. Figure 1 illustrates some of the real-life applications. The available prior information and the computed embedding coordinates are different in different applications. For example, in the applications of face recognition and remote sensing image classification, the known low-dimensional coordinates are constructed according to the labels of the training data points, and SSML aims to predict the labels of the remaining data points using the known low-dimensional coordinates (see Remark 2 in the next section for the way of constructing low-dimensional coordinates and predicting the labels). In object tracking, SSML aims to recover the real locations of the object, using the given locations of the object in certain frames. In data visualization, SSML projects the data points to 2-dimensional or 3-dimensional embedding space to discover the hidden relations of the data points, using the given 2-dimensional or 3-dimensional coordinates of the training points.

Generally, semi-supervised manifold learning algorithms work well if the data sets are well sampled from a manifold. In the situation, , and the minimization of (3) is equivalent toWhen the data sets contain noise, the effectiveness of SSML will be significantly decreased. This is because the noise level of each sample point may be different. For the sample points containing large noise, their prior information may not be trustworthy, and the regularization parameter tends to be small. For those points only containing small noise, we are confident in the provided prior information, and tends to be large. A fixed cannot reflect the different confidence levels in the prior information of each point. It is desirable to construct a robust semi-supervised optimization model against noise.

3. Robust Semi-Supervised Manifold Learning Algorithm

In this section, a framework for robust semi-supervised manifold learning algorithm (RSSML) will be proposed. Since the prior information may be inexact for the noisy points, the major problem is how to deal with inexact prior information according to the different noise levels of the sample points. Note that the noise levels of the sample points are unknown generally. It is desirable to measure the noise levels of the sample points before proposing the robust semi-supervised optimization model.

3.1. Measure the Noise Level

Recently, some work has been done to measure the noise levels of the points [2023]. In this paper, we measure the noise levels by the outlier detection algorithm based on reconstruction weights (ODBRW), due to little computation cost, low parameter requirement, and high effectiveness [20]. The ODBRW method is applied only on the training points , which consists of the following steps.

Step 1 (constructing the edge point sets). Search -nearest neighbors (KNN) of each ,  , and determine the neighbor set firstly. Then select the edge points from bywhich requires that angles between any adjacent edges and should be acute or right. If the angle between the adjacent edges and is obtuse, it means that separates and . A point is said to be an edge point of if there is no other neighbor that separates and . The determined edge point sets are very robust on the neighborhood size . See an illustrated example of the edge point set in Figure 2. More explanations about the edge point set can be found in [24].

Step 2 (calculating reconstruction weights). The reconstruction weights of the edge point set for can be obtained by solving the least square problem Denote and ; the least square problem can be solved bywhere and is a vector with all ones.

Step 3 (measuring the noise levels). Form an matrix by ; that is,where is the index set of . The noise level of can be measured by

3.2. Robust Semi-Supervised Optimization Model

It is shown in theory that the smaller is, the more likely that tends to be an outlier [20]. For a small , it means that the prior information on is not trustworthy. Hence, we hope that the effect of the prior information on computing the embedding coordinates can be reduced. For a large , the sample point tends to be a clean point, and we are more confident on its prior information. The computed low-dimensional coordinates should bear similar meaning as the prior information. Based on the above analysis, we construct a new robust semi-supervised optimization model:Here are the low-dimensional coordinates of , and are the constructed low-dimensional coordinates using the prior information of the training points . is the embedding cost function on the manifold, and is the measured noise level which is normalized to a real number in . Take LLE as an example; the robust semi-supervised optimization model is

Denote and   and is the diagonal matrix whose elements are . The optimization model (11) can be expressed aswhere and the elements with for and for .

We can solve the optimization model by setting the gradient of (12) to be zero. Partition with and ; it is easy to get thatPartition with ; the -dimensional embedding can be computed by solving the following linear system of equations:

Based on the above analysis, we propose a new algorithm called robust semi-supervised manifold learning (RSSML) which is summarized as follows.

RSSML Algorithm

Input. Data set , low-dimensional coordinates of labeled points , parameter , neighborhood size .

Output. The low-dimensional coordinates .

Step 1 (selecting local neighborhoods). Determine the neighborhood set of each point ; is the index set of the neighbors: .

Step 2 (extracting local geometry). Extract the local geometry by some local optimization models of manifold learning. Matrix is given by with if or 0 otherwise.

Step 3 (calculating noise levels). Obtain the noise levels for the first data points (training points) by the ODBRW method and construct the diagonal matrix .

Step 4 (embedding global coordinates). Compute the low-dimensional embedding coordinates by solving the linear system of equations (15).

Remark 1. Notice that many unsupervised learning methods can be extended to their semi-supervised versions by the proposed robust semi-supervised manifold learning (RSSML) algorithm. In this paper, we explore the local geometry by solving the least squares (LS) problem of LLE. And the local geometry can be explored by other local optimization methods such as RLLPE and LTSA. We call them RSSML-LLE, RSSML-RLLPE, and RSSML-LTSA.

Remark 2. In a classification problem, we are given the label information of training points and classes. Without loss of generality, assume that the first points are labeled. The low-dimensional coordinates of labeled points can be constructed with if has label and otherwise. Then, the labels of the unlabeled points can be estimated as .

4. Experiment Results

To verify the effectiveness of the proposed algorithm RSSML on real-world data, we perform experiments on CMU PIE data set [25], Handwritten-Alpha data set [26], and HAND_SHAPE [27]. For comparison, we apply the unsupervised methods LLE, RLLPE, and LTSA (see [2, 4, 11]), their semi-supervised versions, and the proposed robust semi-supervised versions on the above data sets. In the three real-world examples, some noisy points are also added to test the robustness of RSSML to noisy data sets.

CMU PIE [25]. The original data set contains 11560 samples of 68 individuals in 32 × 32 gray-scale image. In the experiment, we randomly selected 160 samples from 10 individuals (for a total of 1,600 samples). Some samples are plotted in Figure 3.

Handwritten-Alpha [26]. The data set (HW-Alpha) is extracted from “binaryalphadigs” data set. It consists of 936 images of 26 handwritten alphabets. Each class has 36 images which are of size of 20 × 16.

HAND_SHAPE [27]. The original data set (Cambridge Hand Gesture Data) contains 9 gesture classes in 320 × 240 gray-scale images, which are defined by 3 primitive hand shapes and 3 primitive motions. In this experiment, the target task is to classify different hand shapes. Therefore, the final data set is divided into five groups (see Figure 4). We randomly selected 350 points of each group to form the experiment set.

In the experiments, the parameters of these methods are set as follows. All of the manifold learning algorithms are involved with the neighborhood size parameter. The neighborhood size is selected from 8 to 36. For the unsupervised methods LLE, RLLPE, and LTSA, the intrinsic dimension is tried. The different regularization parameter is tried in our robust semi-supervised version of LLE, LTSA, or RLLPE. We only report the best results.

To perform classification, the data sets are firstly projected onto low-dimensional space by these unsupervised methods, and then the Nearest Feature Line (NFL) classifier (see [28, 29]) is used on the low-dimensional embedding results for the recognition. For the semi-supervised methods, the label information of the unlabeled points can be predicted directly.

In the first experiment, some noisy points are added to the original three data sets. We randomly select 10% sample points from the data sets to generate the noisy points. For each selected image, we randomly chose pixels from the original image and changed the pixel values from to . Some of the noisy images are shown in Figure 3. Half of the sample points are randomly selected as training samples and the remaining are used for testing.

Table 1 lists the classification rates of the manifold learning methods on the three data sets. It is clear that the unsupervised methods are sensitive to noise, especially for LTSA. When the data sets contain noisy points, LTSA may fail to find a reasonable local tangent space of each data point. For SS-LLE, SS-RLLPE, and SS-LTSA, the classification accuracies can be improved obviously. RSSML-LLE, RSSML-RLLPE, and RSSML-LTSA outperform the semi-supervised manifold learning approaches in the experiments. It is shown that the robustness to noisy points can be improved by the proposed robust semi-supervised model.

To better compare the effectiveness of the above manifold algorithms on noisy data sets, we generate the noisy data with different densities of Gaussian and reverse noise. We randomly select 10% samples from CMU PIE data set to generate the noisy images. The noisy images are generated in two ways. One way is to randomly choose 1/6, 1/8, or 1/12 pixels of each selected image and invert the value from to . Another way is to add the Gaussian noise to the selected image with different variances: 0.02, 0.05, or 0.1. The experimental results of the unsupervised, semi-supervised, and robust semi-supervised versions of LLE, RLLPE, and LTSA methods on the six experimental sets are shown in Table 2.

As can be seen in Table 2, SS-LLE, SS-RLLPE, and SS-LTSA are more sensitive to noisy points than RSSML (-LLE, -RLLPE, and -LTSA). Under different noise levels, RSSML methods can achieve higher classification accuracies than the other algorithms. It is further shown that RSSML can better handle the noisy points in the experiments. We notice that the classification rates of RSSML are also higher than those of the other methods on the original data set. This is because the original data set may be contaminated by noise in the sampling process. The proposed RSSML methods can also reduce the impact of the noisy points that come from the real world. It is interesting that, in some cases, the results on the noisy data sets may be better than those on the original data set. One explanation is that the linear combination coefficients of large noisy points calculated by the local least square optimization model tend to be small. When we added some large artificial noises to the original points, the effect of these points on the reconstruction can be greatly reduced. Hence, these outliers will not destroy the low-dimensional structure of the manifold.

To further quantitatively compare the performance of the above algorithms, the noisy HAND_SHAPE data (1/6 reverse noise) are divided into five different proportions of training and testing data (2 : 5, 2 : 3, 1 : 1, 3 : 2, and 5 : 2). As can be seen in Table 3, our robust versions (of LLE, RLLPE, and LTSA) have the best performance for different proportions of training and testing data. It is evident that our method can greatly reduce the classification error on the noisy data set.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by NSFC (61370006), NSF of Fujian Province (2014J01237 and 2015J01256), Program for New Century Excellent Talents in Fujian University (2012FJ-NCET-ZR01), and Program for Young and Middle-Aged Teacher in Science and Technology Research of Huaqiao University (ZQN-PY116).