Abstract
Extreme learning machine (ELM) is a new class of single-hidden layer feedforward neural network (SLFN), which is simple in theory and fast in implementation. Zong et al. propose a weighted extreme learning machine for learning data with imbalanced class distribution, which maintains the advantages from original ELM. However, the current reported ELM and its improved version are only based on the empirical risk minimization principle, which may suffer from overfitting. To solve the overfitting troubles, in this paper, we incorporate the structural risk minimization principle into the (weighted) ELM, and propose a modified (weighted) extreme learning machine (M-ELM and M-WELM). Experimental results show that our proposed M-WELM outperforms the current reported extreme learning machine algorithm in image quality assessment.
1. Introduction
Extreme learning machine (ELM) was proposed as a new class of single-hidden layer feedforward neural network by Huang et al. [1]. Its basic idea is to set a suitable number of nodes in the hidden layer before training and randomly assign the values for the input weights and offsets of the hidden layer in the implementation procedure. The algorithm completes the whole process at once and generates a unique optimal solution without the necessity of iterations. So it has the advantages of easy parameter selection and fast learning speed. Liang et al. [2] also proposed an online sequential extreme learning machine algorithm (OS-ELM) that can learn data one-by-one or chunk-by-chunk. Although OS-ELM provides better generalization performance, it excessively depends on experimental data. Lan et al. [3] presented an ensemble of online sequential extreme learning machine (EOS-ELM), which is a more stable integrated network structure consisting of multiple OS-ELM networks. Rong et al. [4] developed an OS-Fuzzy-ELM algorithm by combining TSK fuzzy inference system and ELM algorithm, which reduces the training time significantly. Feng et al. [5] presented an improved ELM algorithm based on error minimization. In [6], Zong et al. proposed a weighted ELM for dealing with data with imbalanced class distribution, which is able to be generalized to balanced data and maintains the advantages from original ELM. However all these algorithms only consider the empirical risk minimization principle, which can easily lead to overfitting [7].
Support vector machine (SVM), proposed by Cortes and Vapnik [8], is actually also a single-hidden layer feedforward network. In [9–11], Suykens et al. proposed the least-squares support vector machine (LS-SVM), which transforms the linear inequality constraints into linear equality constraints in the support vector machine and, thus, converts solving the QP problem into solving linear equations. It reduces the difficulty of support vector machine learning a great deal of samples and also improves the efficiency. Both the SVM and LS-SVM are general algorithms based on guaranteed risk bounds of statistical learning theory, that is, the so-called structural risk minimization (SRM) principle, which improves their generalization ability.
In this paper, to lower the overfitting phenomena of the extreme learning machine algorithms, we refer to the LS-SVM algorithm, draw the structural risk minimization principle into the ELM and WELM algorithms, and propose a modified ELM and WELM algorithm and call them as M-ELM and M-WELM. Our experimental results suggest the validity of our proposed M-ELM and M-WELM algorithm.
The structure of this paper is organized as follows. The brief introduction to ELM and weighted ELM is done in Section 2. In Section 3, the principles of our proposed M-ELM and M-WELM are described. Experimental results and performance assessment are presented in Section 4. In Section 5, the conclusion is presented.
2. Brief Introduction to Extreme Learning Machine
2.1. Extreme Learning Machine (ELM)
Extreme learning machine (ELM) proposed is a single-hidden layer feedforward networks (SLFNs) which randomly selected the input weights and analytically determines the output weights of SLFNs [1, 2, 12]. One key principle of the ELM is that one may randomly choose and fix the hidden node parameters. After the hidden nodes parameters are chosen randomly, SLFN becomes a linear system where the output weights of the network can be analytically determined using simple generalized inverse operation of the hidden layer output matrices [13].
For an observation data set with nodes in the hidden layer and the excitation function , the extreme learning machine model can be expressed as where is the output weight of the th hidden layer node and the output neuron, is the input weight of the input neuron and the th hidden layer node, and is the offset of the th hidden layer node. Consider denotes the output matrix of hidden layer. and are randomly selected before training and remain the same in the training procedure. The output weights can be obtained by solving the least-squares solutions of the following linear equation: The least-squares solution to the equations is where is called the Moore-Penrose generalized inverse of the hidden layer output matrix .
2.2. LS-SVM Regression
Assume that an input and output sample data set for regression analysis is , where and , . LS-SVM regression algorithm maps the data into a high-dimensional feature space through a nonlinear mapping and does linear regression in the space . The regression estimation for the observation data set given above can be formulated as below, where and are the regression factors,
LS-SVM regression method is used to solve the weight vector and deviation . Based on the structural risk minimization, the optimization model of the optimal regression function [9–11] can be established as where is the penalty constant, which is a compromise between complexity and fitting accuracy of regression model. Higher value means higher fitting degree. is the slack variable. LS-SVM transforms the inequality constraints into equality constraints by defining loss functions different from those in the standard SVM. It constructs the following Lagrange function: where is the Lagrange multiplier. According to KKT optimal conditions, the linear equations can be obtained as follows: where , , , is the identity matrix, is a square matrix, and is the th row and the th column data element, where is the kernel function that satisfies the Mercer condition.
Solve the linear equations and get the nonlinear mapping equation as follows:
2.3. Weighted Extreme Learning Machine (WELM)
2.3.1. Basic Theory
In [6], the authors proposed a weighted extreme learning machine for imbalance learning, which defined an diagonal matrix associated with every training sample . Usually if training data comes from a minority class (assumed to be positive class), the associated weight will be set relatively larger than others. To maximize the marginal distance and to minimize the weighted cumulative error with respect to each sample, an optimization problem mathematically are written as where . More precisely, where is the feature mapping vector in the hidden layer with respect to , represents the output weight vector connecting the hidden layer and output layer, and is the regularization parameter to represent the trade-off between the minimization of training errors and the maximization of the marginal distance. , the training error of sample , is caused by the difference of the desired output and the actual output .
2.3.2. Weighting Schemes
The key issue of the WELM is to define an appropriate weight matrix , , which determines what degree of rebalance users are seeking for and how much further the boundary is pushed towards the majority class [6]. In [6], two weighting schemes are proposed.
The simple one is the weight value that can be automatically generated from the class information, which is in fact a special case of the cost sensitive learning: where is the number of samples belonging to class , .
Another weighting scheme is the authors of [6] adopts the value of golden standard that represents the perfection in nature and minishes the balancing step into the ratio of 0.618 : 1 between minority classes and the majority classes, as shown in
Compared to weighting scheme W1, the boundary using weighting scheme W2 is pushed slightly backwards the minority class so that the misclassification cases in compromise on the majority side are sought of being alleviated, so we adopt weighting scheme W2 in the next experiments.
3. Modified Extreme Learning Machine Algorithm
The traditional extreme learning machines are based on the empirical risk minimization principle and the training error minimization principle, whose drawback is that it is likely to suffer from overfitting, which reduces the generalization capability consequently.
According to the statistical theory, the actual risks include the empirical and structural risks, and a model with good generalization performance should be able to balance empirical and structural risks to obtain the best compromise. So we lead the structural risk minimization principle into the ELM algorithm and propose a modified weighted ELM and WELM model based on ELM and WELM, which we call it as M-ELM and M-WELM.
Assume that an input and output sample data set for regression analysis is , where and, . We draw into the condition of the structural risk and adjust the proportion of the empirical and structural risks by instead of the in formula (10), and the optimization model of the optimal regression function can be established as follows: where , the sum of the square errors, represents the empirical risk and represents the structural risk, according to the maximal margin principle in statistical theory [2]. According to formula (6), the formula above is the conditional extreme problem and can be transformed into the Lagrange equation as follows: where the Lagrange multiplier is the constant factor of sample in the linear combination to form the final decision function. Further, by making the partial derivatives with respect to variables all equal to zero, the KKT optimality conditions are obtained:
The solution of can be derived from (17) regarding left pseudoinverse. Usually, left pseudoinverse is more suitable since it is much easier to compute matrix inversion of size , when is much smaller than :
The same as formula (7) in Section 2.2, we can obtain the following linear equations: where ; ; ; is the identity matrix; is a square matrix, and the th row and the th column data element is where is the excitation function.
The sigmoid function is used in this paper as follows:
Solve the linear equations and then get the following nonlinear mapping equation below that is derived from (8):
The whole steps of the M-ELM or M-WELM algorithm can be summarized as follows.
Given a training set , activation function , and hidden node number , consider the following.
Step 1. Transform formula (13) of conditional extreme problem into formula (9) of Lagrange equation.
Step 2. Calculate using formulas (16) and (17).
Step 3. Substitute into formula (15) and calculate the output weight .
As it can be seen, the M-WELM is able to be generalized to cost sensitive learning and can also deal with data with imbalanced class distribution as the WELM. On the other hand, its overfitting risk can be reduced by considering both the empirical and structural risks simultaneously.
4. Experiments and Performance Assessment
In this section, we present the performance comparison of proposed M-WELM and M-ELM and current reported ELM, OS-ELM [2], EOS-ELM [3], B-ELM [15], and C-ELM [14], and classifiers on benchmark prediction data sets first and then show the results for the image quality estimation problem. In the next whole experiments, the activation functions of ELM adopt sigmoid functions, and other parameters are gained by the cross validation method.
4.1. Test on the Benchmark Boston Housing Data Set
Boston Housing data, obtained from the UCI database, is a data set commonly used for measuring the performance of regression algorithm. It contains the information of 506 sets of commodity houses in Boston Housing, including 12 continuous characteristics, one discrete characteristic, and house prices [16]. The purpose of regression estimation is to predict the average house price by training part of the samples.
In the experiments, the samples are randomly divided into two sample groups: random 70% of them for training and the remaining 30% for test. We repeat the random train-test procedure 100 times and calculate the mean square training and prediction error of every algorithm, and the experimental results of several algorithms are shown in Table 1. We have adjusted the parameters for every algorithm so that every algorithm can get a pretty good result. The number of hidden neurons of M-WELM, M-ELM, W-ELM, OS-ELM, and EOS-ELM is set to 180, and the number of hidden neurons of ELM, B-ELM, and C-ELM is set to 65.
It can be seen from Table 1, for the Boston Housing data set from a real-world multi-input single-output system, that our proposed M-WELM algorithm shows the best prediction performance than other types of ELMs, and M-ELM ranks number 2, with both of which indicating the robustness of our proposed idea for modifying ELM algorithm.
4.2. Test on the LIVE IQA Database
Algorithms that automatically assess perceptual image quality are critical for numerous image processing applications. Recently, machine learning based blind image quality assessment has great progress, such as the BRISQUE [17], the LBIQ [18], the DIIVINE [19], and the BLIINDS [20] using SVR and the GRNN-based method [21]. Of these indices, the BRISQUE shows the best performance in overall, so here we use the same image features adopted by the BRISQUE index [17] to test our proposed MELM and M-WELM algorithms for image quality assessment (IQA).
In reference [17], the authors used 36 natural scene statistical features in the spatial domain to predict image quality as shown in Table 2, that is, the shape and variance from a GGD fit of the MSCN coefficients, the shape, mean, left variance, and right variance from a GGD fit of the H pairwise products, V pairwise products, D1 pairwise products, and D2 pairwise products, which are extracted at two scales, the original image scale, and at a reduced resolution (low pass filtered and downsampled by a factor of 2). Here we also adopt these 36 image statistical features to predict image quality by using different ELM algorithms and then compare our proposed modified ELM algorithms with the reported ELM algorithms.
Firstly, we test out proposed algorithm on the LIVE IQA database [22], which consists of 29 reference images with 779 distorted images spanning five different distortion categories” JPEG2000 (JP2K) and JPEG compression, additive white Gaussian noise (WN), Gaussian blur (Blur), and a Rayleigh fast-fading channel simulation (FF). Each of the distorted images has an associated difference mean opinion score (DMOS) which represents the subjective quality of the image.
Three performance metrics are used to evaluate the algorithms. The first is the Spearman rank ordered correlation coefficient (SROCC), which measures the prediction monotonicity of the quality index. The second is the root mean square error (RMSE). The third is the running time.
Because learning based method requires a training stage in order to construct the relationship between the extracted statistical features and DMOS, we split the LIVE dataset into two nonoverlapping sets—a training set and a testing set. The training set consists of 80%, 50%, or 30% of the 29 reference images and their associated distorted versions, respectively, while the testing set consists of the remaining 20%, 50%, or 70% of the 29 reference images and their associated distorted versions. The regression models are trained on the training set and the results are then tested on the testing set. In order to ensure that the proposed method is robust across content and is not governed by the specific train-test split utilized, we repeat this random 80% train—20% test, 50% train—50% test, and 30% train—70% test split 1000 times on the LIVE dataset and evaluate the performance on each of these test sets. The median Spearman rank ordered correlation coefficient (SROCC), RMS, and running time across these 1000 train-test trials are reported in Tables 3, 4, and 5. For comparison, the results of SVR and other current reported ELM algorithms are also listed in Tables 3–5, using the same random train-test procedure 1000 times. The SVR is implemented by utilizing the libSVM package [23]. The kernel used for SVR is the radial basis function (RBF) kernel, whose parameters are estimated using cross validation on the training set. Other ELM algorithms are implemented by us. The used number of hidden neurons of M-WELM, M-ELM, W-ELM, OS-ELM, and EOS-ELM is set to 120, and the number of hidden neurons of ELM, B-ELM, and C-ELM is set to 75.
As it can be seen from Tables 3–5, compared with ELM (or WELM), our proposed M-ELM (or M-WELM) shows better subjective judgment no matter how much the percentage of samples is used, which suggests the validity of introducing the structural risk minimization principle. Our M-WELM algorithm shows the best performance against the other reported ELM algorithm and SVR algorithm, especially when using less training samples, which further demonstrates the effectiveness of integrating the structural risk minimization principle and the weight method into the ELM model. In addition, we can find that our proposed M-WELM is far faster than the SVR, which provides an effective real-time solution to IQA.
4.3. Test on the TID 2008 Database
To prove the promotion of the proposed M-WELM, we further test on the same (available) distortions in an alternate database—the TID2008 [24]. It consists of 25 reference images and 17 distortion types with 1700 distorted images. Of these 25 reference images only 24 are natural images, so we test our algorithm only on these 24 images. Here we use all 779 distorted images in the LIVE IQA database as the training set and the images in the TID 2008 as the testing set. We still repeat the random train-test procedure 1000 times and report the median SROCC, RMS, and running time as shown in Table 6. The values of parameters of every algorithm are the same as the used in Section 4.2.
From Table 6, we can find our proposed M-WELM showing the highest consistency with the subjective scores amongst all types of ELM algorithms, and it is also competitive with the SVR in the performance, but it is far faster than the SVR, which provides a real-time solution to IQA.
5. Conclusion
Current reported ELM and weighted ELM algorithms are based on empirical risk minimization principle, which may easily lead to the overfitting risk during learning process. By introducing the structural risk minimization principle to the ELM and weighted ELM algorithms, we propose an improved (weighted) extreme learning machine algorithm (M-WELM and M-ELM) to solve the overfitting problem, which takes into account both the empirical risk and the structural risk simultaneously and adjusts the proportion of the two risks properly. Our experimental results show that the M-WELM outperforms the current reported ELM algorithms in IQA and also has competitive performance with the SVR, but it is far faster than the SVR, which provides an effective real-time solution to IQA.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This research is supported in part by the National Natural Science Foundation of China (no. 61170120), Program for New Century Excellent Talents in University (NCET-12-0881), the Natural Science Foundation of Jiangsu Province (no. BK2011147), China Agriculture Research System (CARS-49), and the Fundamental Research Funds for the Central Universities (JUSRP51410B).