Abstract

Extreme learning machine (ELM) is a popular learning algorithm for single hidden layer feedforward networks (SLFNs). It was originally proposed with the inspiration from biological learning and has attracted massive attentions due to its adaptability to various tasks with a fast learning ability and efficient computation cost. As an effective sparse representation method, orthogonal matching pursuit (OMP) method can be embedded into ELM to overcome the singularity problem and improve the stability. Usually OMP recovers a sparse vector by minimizing a least squares (LS) loss, which is efficient for Gaussian distributed data, but may suffer performance deterioration in presence of non-Gaussian data. To address this problem, a robust matching pursuit method based on a novel kernel risk-sensitive loss (in short KRSLMP) is first proposed in this paper. The KRSLMP is then applied to ELM to solve the sparse output weight vector, and the new method named the KRSLMP-ELM is developed for SLFN learning. Experimental results on synthetic and real-world data sets confirm the effectiveness and superiority of the proposed method.

1. Introduction

Extreme learning machine [1] is a kind of single hidden layer feedforward network (SLFN) [2]. In the past decade, ELM became popular and attractive in the machine learning and pattern recognition communities for its fast adaptability and good generalization performance [3]. In general, ELM has the following advantages: (i) It not only has the ability of estimating the unknown mathematical model embedded in a mass of training samples but also possesses parallel schemes to be efficiently implemented in parallel for training and testing; (ii) it uses randomly generated input weights and hidden biases without tuning during the training phase, and therefore, the output weights can be analytically obtained by solving the standard least squares (LS) problem. Thus, extremely fast learning ability and efficient computation cost can be achieved, especially for big data applications. In view of these remarkable superiorities, ELM has been widely applied in many applications, such as face recognition [4], series compensated transmission line protection [5], time series analysis [6], and nonlinear model identification [7].

However, ELM still has several drawbacks. First, ELM encounters the problem of irrelevant variables when handling real-world data sets [8]. Second, choosing a proper hidden nodes number is an open problem for all ELM algorithms. An ELM network with too few hidden nodes may not be accurate for modeling the input data, whereas a network with too many hidden nodes tends to generate an overfitting model [9]. Moreover, when the number of hidden nodes is more than the input data, ELM might have the singularity problem [4]. Third, the original ELM learns the model with an -norm based loss function, which is very vulnerable to noise. It is well known that the -norm can magnify the bad effects of outliers associated with large deviations [10]. The presence of non-Gaussian noises or outliers in the training data may thus lead to an unreliable model with degraded performance.

To overcome the first and second limitations, several methods have been proposed in the regularization framework [9, 1113]. Furthermore orthogonal matching pursuit (OMP) is a plain and efficient iterative algorithm which chooses an atom in the dictionary with the best correlation to the remaining elements at each iteration [14]. As such, OMP has been embedded to ELM (OMP-ELM) to overcome the singularity problem and led to more stable solution than the original ELM [15]. Most of the existing methods learn the model with an -norm based loss function, which may perform poorly in the presence of non-Gaussian noises (which exist in many real-world situations) or outliers [1618]. To combat non-Gaussian noises or outliers and improve the generalization ability, the regularized correntropy criterion is used to replace the -norm based loss function in original ELM model to develop the ELM-RCC [16]. In [19], ELM with -norm based loss function (ORELM) was proposed to achieve robust performance.

The kernel risk-sensitive loss (KRSL) is a nonlinear similarity measure firstly proposed in [20], which can reach a more satisfying robust performance. The KRSL is based on the original structure of risk-sensitive loss and is defined in the reproducing kernel Hilbert space (RKHS) [21, 22]:where denotes the mathematical expectation, is the Gaussian kernel with bandwidth , and is the risk-sensitive parameter. In this paper, we propose a KRSL based matching pursuit (KRSLMP) method. The KRSLMP is then embedded to ELM to construct a robust and sparse ELM model.

The rest of the paper is structured as follows. In Section 2, we sketch the related work, including similarity measures in kernel space, kernel risk-sensitive loss, ELM model, and orthogonal matching pursuit algorithm. In Section 3, we develop the KRSLMP-ELM. In Section 4, experiments on regression problem with synthetic and real-world data sets are conducted to verify the effectiveness of the proposed algorithm. The sensitivity of the KRSLMP-ELM to free parameters is also analyzed. Finally, conclusion is given in Section 5.

For convenience of presentation, the following notations used in this paper are introduced. Vectors and matrices are represented with boldface lowercase letters and boldface capital letters, respectively. For any vector , we use to denote its th entry. The notation denotes the subvector of with entries indexed by the set . The complementary set of is denoted as .

2.1. Similarity Measures in Kernel Space

Let and be two random variables; the correntropy between and is defined by [17, 23]where is the joint distribution function of . The Gaussian kernel with bandwidth is given by

Correntropy is a local correlation measure in the kernel space . According to Mercers theorem [24], it can be expressed in terms of the inner product as

It applies a kernel trick that nonlinearly maps the original space to a higher dimensional feature space. It can be shown that correntropy is directly related to the probability of how similar two random variables are in a neighborhood of the joint space controlled by the kernel bandwidth [17, 25, 26].

2.2. Kernel Risk-Sensitive Loss

Similarity measures in kernel space have the ability to extract higher-order statistics of data, which can significantly improve the learning performance in non-Gaussian environments [21]. The optimization problem can be determined by maximizing the correntropy criterion (MCC) or equivalently minimizing the correntropic loss (C-Loss) [27, 28] between the output estimation and the target response. However, highly nonconvex problem may happen in C-Loss performance surface which has steep slopes around the optimal solution but is extremely flat far from the solution. This may lead to slow convergence and poor performance. Choosing a large kernel bandwidth may overcome the above problem. But the robustness will decrease significantly when outliers occur with kernel bandwidth increasing [29]. To achieve a satisfying performance surface, the KRSL was proposed in [20].

The KRSL is defined bywhich can also be expressed in a traditional risk-sensitive loss form as [30]where is the risk-sensitive parameter that controls the shape of performance surface.

In practice, the joint distribution function of and is usually unknown and only a finite number of samples are available. The KRSL can thus be estimated byAs one can see, (6) defines a distance between the vectors and .

2.3. Extreme Learning Machine

Extreme learning machine (ELM) was proposed by Huang et al. for training single hidden layer feedforward neural networks (SLFNs) [2, 31]. The input weights and biases are initialized randomly in ELM and remain unchanged during training. The network learning thus becomes optimizing the output weights, which can be formulated as solving a linear equation. Let be given by training samples, where input and corresponding desired output ; the relationship between and can be represented under the assumption of the model. The network model of ELM with hidden neurons can be modeled and expressed aswhere is hidden nodes number, is the weight connecting the th hidden node and output nodes, is the activation function (in this work, is a sigmoid function without explicit mention), denotes the weight that connects the th hidden node and input nodes, and represents the randomly chosen bias of the th hidden node. Equation (7) can be compactly written as a matrix notationwhereand is the minimal norm least squares solution of (8). The parameter can be obtained bywhere is the Moore Penrose generalized inverse of the hidden layer output matrix .

2.4. Orthogonal Matching Pursuit

Matching pursuit method is one of the effective methods for sparse representation [14, 32, 33]. In general, a sparse representation problem can be formulated aswhere denotes the measurement matrix, is the sparse vector, and represents the noise vector. The main purpose is to recover the sparse vector from the observation and the measurement matrix . The OMP uses the -norm constrained least squares modelwhere counts the number of nonzero coordinates of .

In the following, we briefly describe the OMP method. First, we initialize the residual , the index set , and the iteration . At each iteration, OMP algorithm selects a column of the measurement matrix which is most correlated to the residual aswhere denotes the residual in th iteration and is the th column of . Then collect to index set

We can solve an LS problem to obtain a new estimation supported in :where supp() denotes the support set of . If the stopping criterion is satisfied, we output as the estimate of .

Then one can update the residual

From (8) and (11), we can find that ELM has a similar network model for sparse representation problem. Thus, one can take advantage of the OMP algorithm for selecting the best hidden nodes of the ELM network. The OMP estimates the sparse vector by using the -norm based criterion, which performs well with the Gaussian error distribution. However, the presence of non-Gaussian noise may give rise to performance degradation.

3. Kernel Risk-Sensitive Loss Based Matching Pursuit Extreme Learning Machine

To address the aforementioned issue, we propose a robust kernel risk-sensitive loss based orthogonal matching pursuit extreme learning machine algorithm (KRSLMP-ELM) in this section. In the KRSLMP-ELM, we initialize the residual as and the initial index set as . Then, similar to OMP, a column of most correlated with the residual is selected and the index set is augmented at each iteration. Then we obtain a new estimation by solving the following KRSL minimization problem:

We utilize the half-quadratic (HQ) theory [34] to construct the optimization algorithm. Considering that the measurements may include both large and small noise, we can use HQ optimization to estimate the importance of different samples. The samples severely corrupted will be assigned small weight values in learning procedure to decrease the impact of large noise. Thus, the performance of KRSLMP-ELM can be significantly further improved.

According to the convex optimization theory [35], the dual function for is convex and defined asand thenwhere the infimum is reached at . We point out here that when the parameter , the KRSLMP-ELM can also work well in our simulations. Substituting (18) for (20), the KRSLMP-ELM objective function can be reformulated aswhere diag() represents a diagonal matrix with its primary diagonal element and is the regularization parameter. Inspired by the HQ theory, (21) can be solved by the following alternate technique:where denotes the iteration number. In the proposed algorithm, the bandwidth is adaptively chosen during the iteration. In order to make the scheme robust to outliers, we calculate the value of as follows.

Denote the training error as , . We can then reorder the error in an ascending order, and we get the reordered as . Let , where scalar and outputs the largest integer smaller than . We can select as the bandwidth in accordance with the proportion of outlier. Discussions on the detailed experimental results by choosing different bandwidths are given in the experiment section. A solution for the optimization problem in (21) can be derived as follows:where and denotes the identity matrix.

Since the importance degree of the measurements is employed to adaptively update the output weight vector in the KRSLMP-ELM, we update the residual

It is noted that the sparsity level has to be assigned in advance in the KRSLMP-ELM. The sparsity directly determines the number of the active hidden nodes used in ELM due to the fact that more hidden nodes than necessary are generated. To obtain the best sparsity level , namely, the best number of hidden nodes used in ELM, we utilize the root mean square error (RMSE) as the criterionwhere denotes the target response and the corresponding output estimated by the KRSLMP-ELM.

For different sparsity level , the corresponding RMSE is first calculated. Then the best coefficients associated with the minimum RMSE value are selected.

The iteration is repeated until achieving the stopping criterion. The KRSLMP-ELM is summarized in Algorithm 1.

Input: samples
Output: weight vector
  Parameters setting: number of hidden nodes , regularization parameter and sparsity level .
  Initialization: randomly initialize ELM parameters: input weights and biases in measurement matrix .
   Set the index set , the residual , the iteration counter and .
for      do
    
   Find a column of most correlated with the residual
     
   Augment the index set
     
   Solve the KRSLMP minimization problem by the following iterations
     
     
     The solution is denoted as
   Update residual
end   for

4. Experimental Results

To validate the effectiveness of the proposed KRSLMP-ELM algorithm, experiments on two synthetic data sets and seven benchmark data sets are conducted in this section. The performance of the new method is compared to five state-of-the-art algorithms, namely, ELM, RELM, ELM-RCC, OMP-ELM, and ORELM. Sigmoid function is used as the activation function for all methods.

4.1. Synthetic Data Sets

In this subsection, experiments on two synthetic regression data sets for nonlinear function approximation problem are carried out. Descriptions of the two data sets are as follows.

Sinc.The synthetic data set is generated by , where andand contains two mutually independent noises that are inner noise and outliers noise . Specifically, is defined as , where is binary distributed with the probability masses and . and are independent of . In this experiment, is set at 0.1. The outlier is generated by using a zero-mean Gaussian distributed noise with standard deviation 4.0. For the inner noise , two different noises are tested, which are (a) uniform distribution over and (b) Sine wave noise , with uniformly distributed over . We uniformly generate the input data from , where 200 data points are used for training and another 200 clean data points which are not contaminated by any noise are used for testing.

Func. This synthetic data set is generated bywhere is a zero-mean Gaussian distributed noise vector with standard deviation 0.4. The input data vectors and are uniformly generated from . Similar to the previous experiments, 200 data samples are used for training and another 200 data samples without noise are used for testing.

Parameters used in the six methods for experiments of the two synthetic data sets are summarized in Table 1, where , , , and represent the number of hidden layer nodes, regularization parameter, sparsity level, and risk-sensitive parameter in KRSLMP-ELM. We set in Sinc synthetic data set experiment and in Func synthetic data set experiment. For the convenient distinguishment of the proposed method with other methods in Sinc function approximation problem, only the estimation results of the original ELM, ORELM, ELM-RCC, and KRSLMP-ELM are illustrated in Figure 1. In Figure 2, we plot the squared training errors obtained by the KRSLMP-ELM, ELM-RCC, ORELM, and the original ELM, respectively. As shown in these figures, the KRSLMP-ELM wins the best approximation performance. The testing RMSEs of six algorithms are presented in Table 2. It is indicated that the KRSLMP-ELM is more robust than the other five methods.

Further, we perform another experiment to compare the performance of KRSLMP-ELM to that of the original ELM with different outliers. We consider the Sinc function approximation problem and set the inner noise as a zero-mean Gaussian distributed noise with standard deviation 0.1, and the outliers noise is zero-mean Gaussian with standard deviation ranging between 0.1 and 10. We run 100 trials for different outliers noises and show the RMSE results in Figure 3. One can see that the original ELM’s performance degrades severely when the outliers get enhanced while the KRSLMP-ELM’s performance is much less influenced by outliers.

4.2. Benchmark Data Sets

In this subsection, seven benchmark regression data sets from UCI machine learning repository [36] are tested to support the superiority of the proposed method. Specifications of the data sets are shown detailedly in Table 3. It should be pointed out that the training and testing data samples are randomly chosen in each data set and all the features are normalized into . The parameters of each method are all chosen by the fivefold cross-validation and are given in Table 4. For all algorithms, 100 independent trials are conducted and the average results are reported. The training and testing RMSEs and their standard deviation of all algorithms are listed in Table 5. As highlighted in boldface, the ELM-KRSLMP achieves the best performance in most regression data sets.

4.3. Sensitivity of Parameters

We analyze the sensitivity of the parameters , , , and of KRSLMP-ELM in this subsection. For illustration, we use the regression results obtained by the Servo data set as an example. For each parameter, its sensitivity is tested by fixing the remaining parameters as the ones used in Table 4. Then, the testing RMSEs are recorded as criteria for performance comparison. The results of the regression performance are demonstrated in Figure 4.

5. Conclusion

In this paper, a robust matching pursuit based ELM algorithm, called the kernel risk-sensitive loss based matching pursuit extreme learning machine (KRSLMP-ELM), has been developed. Kernel risk-sensitive loss (KRSL) is a nonlinear similarity measure defined in kernel space, and it can achieve better performance than the conventional MSE criterion when dealing with non-Gaussian and nonlinear problems. Incorporating the KRSL into the existing orthogonal matching pursuit algorithm, we developed an improved KRSLMP-ELM algorithm, which is more robust than the OMP-ELM method. Comparisons with several existing state-of-the-art algorithms have also been provided to validate the superiority of the proposed KRSLMP-ELM algorithm.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was partially supported by the National Natural Science Foundation-Shenzhen Joint Research Program (no. U1613219) and National Natural Science Foundation of China (no. 91648208 and no. 61372152).