Mathematical Problems in Engineering

Volume 2018, Article ID 2382803, 8 pages

https://doi.org/10.1155/2018/2382803

## Robust Semi-Supervised Manifold Learning Algorithm for Classification

The School of Computer Science and Technology, Huaqiao University, Xiamen 361021, China

Correspondence should be addressed to Jing Wang; nc.ude.uqh@gniraorw

Received 17 June 2017; Accepted 4 January 2018; Published 1 February 2018

Academic Editor: Nazrul Islam

Copyright © 2018 Mingxia Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In the recent years, manifold learning methods have been widely used in data classification to tackle the curse of dimensionality problem, since they can discover the potential intrinsic low-dimensional structures of the high-dimensional data. Given partially labeled data, the semi-supervised manifold learning algorithms are proposed to predict the labels of the unlabeled points, taking into account label information. However, these semi-supervised manifold learning algorithms are not robust against noisy points, especially when the labeled data contain noise. In this paper, we propose a framework for robust semi-supervised manifold learning (RSSML) to address this problem. The noisy levels of the labeled points are firstly predicted, and then a regularization term is constructed to reduce the impact of labeled points containing noise. A new robust semi-supervised optimization model is proposed by adding the regularization term to the traditional semi-supervised optimization model. Numerical experiments are given to show the improvement and efficiency of RSSML on noisy data sets.

#### 1. Introduction

The problem of dimensionality reduction, that is, the transformation of high-dimensional data into meaningful low-dimensional features, has arisen much interest of researchers. Recently, there have been much research efforts on developing effective and efficient manifold learning algorithms which can discover the potential intrinsic low-dimensional structures of the high-dimensional data. These algorithms included Isometric Mapping (ISOMAP) [1], Locally Linear Embedding (LLE) [2], Laplacian Eigenmaps (LE) [3], and Local Tangent Space Alignment (LTSA) [4].

The above classical manifold learning methods are all unsupervised learning algorithms; that is, they do not consider the prior information. In many applications, we can get some prior information of the input data. For example, in a classification problem, the class labels of partial data can be obtained. Considering prior information in the form of low-dimensional coordinates of certain sample points, the classical manifold learning methods can be extended to semi-supervised manifold learning methods [5]. And the semi-supervised manifold learning algorithms can yield the low-dimensional coordinates that bear the same meaning as the prior information.

However, these unsupervised and semi-supervised methods may have a limited efficiency on real-world data, due to large noise or distortion of data. Practically, each method for dimensionality reduction requires certain assumptions on the data manifold to guarantee its expected efficiency. For example, ISOMAP needs a convex embedding domain of manifolds or a relatively uniform data distribution for estimating geodesic distance. LTSA should have neighbor sets that can approximately recover the local tangent spaces. In LLE, the local geometric structure of the manifold should be well determined via local combination of data neighborhoods.

There are some efforts on improving the original algorithms. One line is to preprocess the data set before applying the methods, without any modifications on algorithms. Smoothing the data set by weighted SVD, or equivalently, weighted PCA to reduce data noise before performing LTSA is suggested in [6]. In [7], the outliers are first detected by the histogram analysis of the neighborhood distances of each point, and the locally smoothed values of data are then computed using the linear error-in-variables (EIV) model. A fast outlier detection method for high-dimensional data sets is proposed in [8]. It also employs a local smoothing method and introduces a weighted global function to further reduce the undesirable effect of outliers and noise on the embedding results. These algorithms can be also improved by adaptively selected neighborhoods [9]. The other line is to adjust some details of algorithms. For example, multiple local weight vectors are used to solidify the structures determined by neighborhoods in [10]. In [11], the influence of noisy points on the reconstruction is greatly reduced by solving a new local optimization model. In [12], a robust DLPP version based on L1-norm maximization is proposed. In [13], the short-circuit errors can be reduced by solving the problem of selecting the right number and position of landmarks automatically. A robust version of LTSA is proposed in [14] to further reduce the influence of noise on embedding results by endowing clean data points and noise data points with different weights into local alignment errors. In [15], an out-of-sample extension framework for a global manifold learning algorithm (ISOMAP) that uses temporal information in out-of-sample points in order to make the embedding more robust to noise and artifacts is proposed.

Although the improved manifold learning algorithms are more robust against noise than the original algorithms, few works are done on the semi-supervised algorithms [16]. In fact, the undesirable effect caused by noise is more complicated in the semi-supervised problem. Firstly, it is difficult to accurately explore the local geometric structures when the local neighborhoods contain noisy points. Secondly, the provided prior information may be inexact for noisy points. And the constructed low-dimensional coordinates using the inexact prior information may be far from the real on-manifold coordinates of the sample points. The first issue can be solved by constructing noise-free neighbor sets [7–9] or constructing robust local geometric structures of the noisy neighbor sets [10, 11, 17]. And we do not extend the topic regarding the first issue in the paper.

We focus on the second issue in the paper. We estimate the noise levels of the sample points which reflect the confidence levels in the prior information. Then we construct a new semi-supervised optimization model to reduce the undesirable effect of the inexact prior information with low confidence levels. A framework for robust semi-supervised manifold learning (RSSML) is proposed by solving the new semi-supervised optimization model.

The rest of this paper is organized as follows. In Section 2, we give a brief review of semi-supervised manifold learning. In Section 3, we show how to extend the semi-supervised manifold learning algorithms so that they can handle inexact information for noisy points. The framework for robust semi-supervised manifold learning (RSSML) is presented in the section. After that, we give numerical experiments in Section 4 to show the effectiveness of RSSML.

#### 2. A Brief Review of Semi-Supervised Manifold Learning

Our work is an extension of semi-supervised manifold learning (SSML). In this section, we give a brief review of (SSML) [5]. Assume that we are given a data set (possibly with noise) from a -dimensional manifold. Without loss of generality, suppose that the prior information of the first points is known. And denote by the constructed low-dimensional coordinates using the prior information. The goal of SSML is to calculate the unknown low-dimensional coordinates . SSML proceeds in the following steps.

*Step 1 (finding local neighborhoods). *Determine the neighbor set for each .

*Step 2 (extracting local geometry). *The local geometry of the determined neighbor set can be extracted by solving the classical local optimization methods [1–4]. Take LLE as an example; the local geometry is characterized by the linear combination coefficients which can be computed by minimizing the least square optimization model:

*Step 3 (constructing semi-supervised optimization model). *In the unsupervised manifold learning algorithms, the global low-dimensional coordinates are calculated by solving the embedding cost functions which can preserve the extracted local geometries. For example, in LLE, the low-dimensional coordinates can be computed by minimizing the embedding cost function:Different from the embedding cost functions in the unsupervised manifold learning algorithms, a regularization term concerning the prior information is added such that the low-dimensional coordinates can obey the prior information. As in semi-supervised LLE, the low-dimensional coordinates are obtained by minimizing the semi-supervised optimization model:where is the regularization parameter that reflects the confidence level in prior information.

Semi-supervised manifold learning has been widely used in many real-life applications such as face recognition [18], remote sensing image classification [19], object tracking [5], and data visualization [18]. Figure 1 illustrates some of the real-life applications. The available prior information and the computed embedding coordinates are different in different applications. For example, in the applications of face recognition and remote sensing image classification, the known low-dimensional coordinates are constructed according to the labels of the training data points, and SSML aims to predict the labels of the remaining data points using the known low-dimensional coordinates (see Remark 2 in the next section for the way of constructing low-dimensional coordinates and predicting the labels). In object tracking, SSML aims to recover the real locations of the object, using the given locations of the object in certain frames. In data visualization, SSML projects the data points to 2-dimensional or 3-dimensional embedding space to discover the hidden relations of the data points, using the given 2-dimensional or 3-dimensional coordinates of the training points.