Scientific Programming

Volume 2018, Article ID 4563040, 10 pages

https://doi.org/10.1155/2018/4563040

## Robust Matching Pursuit Extreme Learning Machines

Correspondence should be addressed to Badong Chen; nc.ude.utjx.liam@dbnehc

Received 25 August 2017; Revised 23 November 2017; Accepted 7 December 2017; Published 1 February 2018

Academic Editor: Wenbing Zhao

Copyright © 2018 Zejian Yuan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Extreme learning machine (ELM) is a popular learning algorithm for single hidden layer feedforward networks (SLFNs). It was originally proposed with the inspiration from biological learning and has attracted massive attentions due to its adaptability to various tasks with a fast learning ability and efficient computation cost. As an effective sparse representation method, orthogonal matching pursuit (OMP) method can be embedded into ELM to overcome the singularity problem and improve the stability. Usually OMP recovers a sparse vector by minimizing a least squares (LS) loss, which is efficient for Gaussian distributed data, but may suffer performance deterioration in presence of non-Gaussian data. To address this problem, a robust matching pursuit method based on a novel kernel risk-sensitive loss (in short KRSLMP) is first proposed in this paper. The KRSLMP is then applied to ELM to solve the sparse output weight vector, and the new method named the KRSLMP-ELM is developed for SLFN learning. Experimental results on synthetic and real-world data sets confirm the effectiveness and superiority of the proposed method.

#### 1. Introduction

Extreme learning machine [1] is a kind of single hidden layer feedforward network (SLFN) [2]. In the past decade, ELM became popular and attractive in the machine learning and pattern recognition communities for its fast adaptability and good generalization performance [3]. In general, ELM has the following advantages: (i) It not only has the ability of estimating the unknown mathematical model embedded in a mass of training samples but also possesses parallel schemes to be efficiently implemented in parallel for training and testing; (ii) it uses randomly generated input weights and hidden biases without tuning during the training phase, and therefore, the output weights can be analytically obtained by solving the standard least squares (LS) problem. Thus, extremely fast learning ability and efficient computation cost can be achieved, especially for big data applications. In view of these remarkable superiorities, ELM has been widely applied in many applications, such as face recognition [4], series compensated transmission line protection [5], time series analysis [6], and nonlinear model identification [7].

However, ELM still has several drawbacks. First, ELM encounters the problem of irrelevant variables when handling real-world data sets [8]. Second, choosing a proper hidden nodes number is an open problem for all ELM algorithms. An ELM network with too few hidden nodes may not be accurate for modeling the input data, whereas a network with too many hidden nodes tends to generate an overfitting model [9]. Moreover, when the number of hidden nodes is more than the input data, ELM might have the singularity problem [4]. Third, the original ELM learns the model with an -norm based loss function, which is very vulnerable to noise. It is well known that the -norm can magnify the bad effects of outliers associated with large deviations [10]. The presence of non-Gaussian noises or outliers in the training data may thus lead to an unreliable model with degraded performance.

To overcome the first and second limitations, several methods have been proposed in the regularization framework [9, 11–13]. Furthermore orthogonal matching pursuit (OMP) is a plain and efficient iterative algorithm which chooses an atom in the dictionary with the best correlation to the remaining elements at each iteration [14]. As such, OMP has been embedded to ELM (OMP-ELM) to overcome the singularity problem and led to more stable solution than the original ELM [15]. Most of the existing methods learn the model with an -norm based loss function, which may perform poorly in the presence of non-Gaussian noises (which exist in many real-world situations) or outliers [16–18]. To combat non-Gaussian noises or outliers and improve the generalization ability, the regularized correntropy criterion is used to replace the -norm based loss function in original ELM model to develop the ELM-RCC [16]. In [19], ELM with -norm based loss function (ORELM) was proposed to achieve robust performance.

The kernel risk-sensitive loss (KRSL) is a nonlinear similarity measure firstly proposed in [20], which can reach a more satisfying robust performance. The KRSL is based on the original structure of risk-sensitive loss and is defined in the reproducing kernel Hilbert space (RKHS) [21, 22]:where denotes the mathematical expectation, is the Gaussian kernel with bandwidth , and is the risk-sensitive parameter. In this paper, we propose a KRSL based matching pursuit (KRSLMP) method. The KRSLMP is then embedded to ELM to construct a robust and sparse ELM model.

The rest of the paper is structured as follows. In Section 2, we sketch the related work, including similarity measures in kernel space, kernel risk-sensitive loss, ELM model, and orthogonal matching pursuit algorithm. In Section 3, we develop the KRSLMP-ELM. In Section 4, experiments on regression problem with synthetic and real-world data sets are conducted to verify the effectiveness of the proposed algorithm. The sensitivity of the KRSLMP-ELM to free parameters is also analyzed. Finally, conclusion is given in Section 5.

#### 2. Preliminaries and Related Works

For convenience of presentation, the following notations used in this paper are introduced. Vectors and matrices are represented with boldface lowercase letters and boldface capital letters, respectively. For any vector , we use to denote its th entry. The notation denotes the subvector of with entries indexed by the set . The complementary set of is denoted as .

##### 2.1. Similarity Measures in Kernel Space

Let and be two random variables; the correntropy between and is defined by [17, 23]where is the joint distribution function of . The Gaussian kernel with bandwidth is given by

Correntropy is a local correlation measure in the kernel space . According to Mercers theorem [24], it can be expressed in terms of the inner product as

It applies a kernel trick that nonlinearly maps the original space to a higher dimensional feature space. It can be shown that correntropy is directly related to the probability of how similar two random variables are in a neighborhood of the joint space controlled by the kernel bandwidth [17, 25, 26].

##### 2.2. Kernel Risk-Sensitive Loss

Similarity measures in kernel space have the ability to extract higher-order statistics of data, which can significantly improve the learning performance in non-Gaussian environments [21]. The optimization problem can be determined by maximizing the correntropy criterion (MCC) or equivalently minimizing the correntropic loss (C-Loss) [27, 28] between the output estimation and the target response. However, highly nonconvex problem may happen in C-Loss performance surface which has steep slopes around the optimal solution but is extremely flat far from the solution. This may lead to slow convergence and poor performance. Choosing a large kernel bandwidth may overcome the above problem. But the robustness will decrease significantly when outliers occur with kernel bandwidth increasing [29]. To achieve a satisfying performance surface, the KRSL was proposed in [20].

The KRSL is defined bywhich can also be expressed in a traditional risk-sensitive loss form as [30]where is the risk-sensitive parameter that controls the shape of performance surface.

In practice, the joint distribution function of and is usually unknown and only a finite number of samples are available. The KRSL can thus be estimated byAs one can see, (6) defines a distance between the vectors and .

##### 2.3. Extreme Learning Machine

Extreme learning machine (ELM) was proposed by Huang et al. for training single hidden layer feedforward neural networks (SLFNs) [2, 31]. The input weights and biases are initialized randomly in ELM and remain unchanged during training. The network learning thus becomes optimizing the output weights, which can be formulated as solving a linear equation. Let be given by training samples, where input and corresponding desired output ; the relationship between and can be represented under the assumption of the model. The network model of ELM with hidden neurons can be modeled and expressed aswhere is hidden nodes number, is the weight connecting the th hidden node and output nodes, is the activation function (in this work, is a sigmoid function without explicit mention), denotes the weight that connects the th hidden node and input nodes, and represents the randomly chosen bias of the th hidden node. Equation (7) can be compactly written as a matrix notationwhereand is the minimal norm least squares solution of (8). The parameter can be obtained bywhere is the Moore Penrose generalized inverse of the hidden layer output matrix .

##### 2.4. Orthogonal Matching Pursuit

Matching pursuit method is one of the effective methods for sparse representation [14, 32, 33]. In general, a sparse representation problem can be formulated aswhere denotes the measurement matrix, is the sparse vector, and represents the noise vector. The main purpose is to recover the sparse vector from the observation and the measurement matrix . The OMP uses the -norm constrained least squares modelwhere counts the number of nonzero coordinates of .

In the following, we briefly describe the OMP method. First, we initialize the residual , the index set , and the iteration . At each iteration, OMP algorithm selects a column of the measurement matrix which is most correlated to the residual aswhere denotes the residual in th iteration and is the th column of . Then collect to index set

We can solve an LS problem to obtain a new estimation supported in :where supp() denotes the support set of . If the stopping criterion is satisfied, we output as the estimate of .

Then one can update the residual

From (8) and (11), we can find that ELM has a similar network model for sparse representation problem. Thus, one can take advantage of the OMP algorithm for selecting the best hidden nodes of the ELM network. The OMP estimates the sparse vector by using the -norm based criterion, which performs well with the Gaussian error distribution. However, the presence of non-Gaussian noise may give rise to performance degradation.

#### 3. Kernel Risk-Sensitive Loss Based Matching Pursuit Extreme Learning Machine

To address the aforementioned issue, we propose a robust kernel risk-sensitive loss based orthogonal matching pursuit extreme learning machine algorithm (KRSLMP-ELM) in this section. In the KRSLMP-ELM, we initialize the residual as and the initial index set as . Then, similar to OMP, a column of most correlated with the residual is selected and the index set is augmented at each iteration. Then we obtain a new estimation by solving the following KRSL minimization problem:

We utilize the half-quadratic (HQ) theory [34] to construct the optimization algorithm. Considering that the measurements may include both large and small noise, we can use HQ optimization to estimate the importance of different samples. The samples severely corrupted will be assigned small weight values in learning procedure to decrease the impact of large noise. Thus, the performance of KRSLMP-ELM can be significantly further improved.

According to the convex optimization theory [35], the dual function for is convex and defined asand thenwhere the infimum is reached at . We point out here that when the parameter , the KRSLMP-ELM can also work well in our simulations. Substituting (18) for (20), the KRSLMP-ELM objective function can be reformulated aswhere diag() represents a diagonal matrix with its primary diagonal element and is the regularization parameter. Inspired by the HQ theory, (21) can be solved by the following alternate technique:where denotes the iteration number. In the proposed algorithm, the bandwidth is adaptively chosen during the iteration. In order to make the scheme robust to outliers, we calculate the value of as follows.

Denote the training error as , . We can then reorder the error in an ascending order, and we get the reordered as . Let , where scalar and outputs the largest integer smaller than . We can select as the bandwidth in accordance with the proportion of outlier. Discussions on the detailed experimental results by choosing different bandwidths are given in the experiment section. A solution for the optimization problem in (21) can be derived as follows:where and denotes the identity matrix.

Since the importance degree of the measurements is employed to adaptively update the output weight vector in the KRSLMP-ELM, we update the residual

It is noted that the sparsity level has to be assigned in advance in the KRSLMP-ELM. The sparsity directly determines the number of the active hidden nodes used in ELM due to the fact that more hidden nodes than necessary are generated. To obtain the best sparsity level , namely, the best number of hidden nodes used in ELM, we utilize the root mean square error (RMSE) as the criterionwhere denotes the target response and the corresponding output estimated by the KRSLMP-ELM.

For different sparsity level , the corresponding RMSE is first calculated. Then the best coefficients associated with the minimum RMSE value are selected.

The iteration is repeated until achieving the stopping criterion. The KRSLMP-ELM is summarized in Algorithm 1.