Abstract

Extreme learning machine (ELM) is a competitive machine learning technique, which is simple in theory and fast in implementation; it can identify faults quickly and precisely as compared with traditional identification techniques such as support vector machines (SVM). As verified by the simulation results, ELM tends to have better scalability and can achieve much better generalization performance and much faster learning speed compared with traditional SVM. In this paper, we introduce a multiclass AdaBoost based ELM ensemble method. In our approach, the ELM algorithm is selected as the basic ensemble predictor due to its rapid speed and good performance. Compared with the existing boosting ELM algorithm, our algorithm can be directly used in multiclass classification problem. We also carried out comparable experiments with face recognition datasets. The experimental results show that the proposed algorithm can not only make the predicting result more stable, but also achieve better generalization performance.

1. Introduction

Many research works have been done in feedforward neural networks, which pointed out that the feedforward neural networks are able to not only approximate complex nonlinear mapping, but also provide models for some natural and artificial problems which classic parametric technics are unable to handle.

Recently, Huang et al. [1] proposed a new simple algorithm based on single layer feedforward networks (SLFNs) called extreme learning machine (ELM). For ELM randomly generates parameters of the networks, its learning speed can be thousands of times faster than traditional feedforward network learning algorithms like back-propagation (BP) algorithm, which needs to iterate many times to get optimal parameters.

In addition, Huang [2] also shows that in theory ELMs (with the same kernels) tend to outperform SVM and its variants in both regression and classification applications with much easier implementation. Based on this conclusion, the paper in the literature proposed by Wong et al. [3] explores the superiority of the fault identification time of ELM.

In view of the advantages of the algorithm, Cao et al. put it into some areas, such as landmark recognition [4] and protein sequence classification [5]. Besides, Cao et al. [6] proposed an improved learning algorithm which incorporates the voting method into the popular extreme learning machine in classification applications and outperforms the original ELM algorithm as well as several recent classification algorithms.

AdaBoost [7] is one of the most popular algorithms of classifier ensemble to improve the generalization performance. Wang and Li in [8] proposed an algorithm named dynamic AdaBoost ensemble ELM (named DAEELM in this paper). The proposed algorithm takes the ELM as the basic classifier and applies AdaBoost to solve binary classification problem. Similarly, Tian and Mao in [9] combined the modified AdaBoost.RT [10] with ELM to propose a new hybrid artificial intelligent technique called ensemble ELM. Ensemble ELM aims to improve ELM’s performance in regression problem.

However, until now, not so much works have been done to apply AdaBoost to ELM for multiclass classification problem directly. In Freund and Schapire’s work [11], they give two extensions of their boosting algorithm to multiclass prediction problems in which each example belongs to one of several possible classes (rather than just two). Since ELM can directly work for multiclass classification problem, this paper proposes an algorithm named multiclass AdaBoost ELM (MAELM). This new algorithm applies multiclass AdaBoost as an ensemble method to a number of ELMs. In addition, this paper proposes a structure to apply ELM and MAELM to local binary patterns (LBP) [12] based face recognition problem. Experiments in LBP based face recognition will show that the proposed algorithm outperforms the original ELM.

This paper is an extension of our previous work [13]. In this paper, we extend our previous work by proposing a new way to combine ELM with PCA instead of using random weights between the input layer and the hidden layer, as well as the bias of the activation function. Experiments in LBP based face recognition will show the stable and good performance with our extended approach.

The rest of the paper is organized as follows. Section 2 gives a brief review of the ELM and PCA, original and multiclass AdaBoost and LBP. The proposed MAELM is presented in Section 3. The experimental result will be shown in Section 4 and a short discussion about the proposed algorithm will be presented in Section 5. Finally, in Section 6, we conclude the paper.

In this section, a review of the original ELM algorithm and PCA and multiclass AdaBoost and the LBP based face recognition is presented.

2.1. ELM

For arbitrary distinct samples , where and , standard SLFNs with hidden nodes and activation function are mathematically modeled as follows:where .

Here, is the weight vector connecting the th hidden node and the input nodes, is the weight vector connecting the th hidden node and the output nodes, and is the threshold of the th hidden node.

The standard SLFNs with hidden nodes with activation function can be compactly written as follows:where

Different from the conventional gradient-based solution of SLFNs, ELM simply solves the function by

is the Moore-Penrose generalized inverse of matrix . As Huang et al. have pointed out in [14], can be represented bywhere is an identity matrix, which has the same dimension with . is a constant number which can be set by the user. Adding can avoid the situation that is singular. Huang et al. [1] successfully applied ELM to solve binary classification problem and Huang et al. [14] extended the ELM to directly solve the multiclass classification problem.

Since the original ELM randomly generates the weights between the input layer and the hidden layer, as well as the bias of the activation function, its performance may be not so stable. Instead of that, some other ways like PCA algorithm rewards to try.

2.2. PCA

Principal component analysis (PCA) was invented in 1901 by Pearson [15], as an analogue of the principal axes theorem in mechanics, which was later independently developed (and named) by Hotelling in the 1930s [16]. Now, it is mostly used as a tool in exploratory data analysis and for making predictive models. PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using -scores) the data matrix for each attribute [17]. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point) and loadings (the weight by which each standardized original variable should be multiplied to get the component score).

The procedure of PCA is as follows:Step 1. Compute the matrix which is the covariance matrix of .Step 2. Find out the eigenvalue of , .Step 3. Compute the standardization feature vector of   .Step 4. Yield the principal components   . is an identity matrix, which has the same dimension with . The matrix consists of row vectors, where each vector is the projection of the corresponding data vector from matrix .

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. Principal components are guaranteed to be independent if the dataset is jointly normally distributed. PCA is sensitive to the relative scaling of the original variables.

2.3. Original AdaBoost and Multiclass AdaBoost

AdaBoost has been very successfully applied in binary classification problem. Original AdaBoost is proposed in [7]. Before proposing the AdaBoost algorithm, the function is predefined as

AdaBoost algorithm is summarized as follows.

Given the training data , where denotes the th input feature vector with dimensions, denotes the label of the th input feature vector, where . Use to denote the th weak classifier and suppose weak classifiers will be combined.(1)Initialize the observation weights , .(2)For ,(a)fit a classifier to the training data using weights ;(b)compute the weighted error(c)compute the weight of the th classifier(d)update the weights of sample data, for all (e)renormalize , for all .(3)Output

Here, is or . In binary classification, any classifier whose generalization performance is better than is a weak classifier. For the original AdaBoost, we have the following.(1)For the th and the th classifiers, if , we have , which means the final ensemble classifier values more of the th classifier’s result. Specifically, if , , which means the final ensemble classifier just ignores the classifier since its effect is the same as random guess.(2)If the th classifier misclassifies the th sample, the th sample will have a big weight in the next iteration. As a result, the th classifier will pay more attention to it. On the contrary, if the th classifier classifies the th sample correctly, the th sample will have a small weight in the next iteration, which means th classifier will pay less attention to it.

However, for a -class classification problem, we have and . If a classifier’s generalization performance is better than (maybe much smaller than ), it can be called a weak classifier. Since original AdaBoost only takes a classifier whose generalization performance is better than as a weak classifier, obviously, it cannot be directly implemented to multiclass conditions that is bigger than 2. Freund and Schapire [11] extend the original AdaBoost to multiclass condition. The weight of the th classifier is modified as

Similar to the binary condition, for the th and the th classifiers, if , we have , which means the final ensemble classifier values more of the th classifier’s result. In particular, if , .

2.4. LBP Based Face Recognition

The original LBP operator goes through each neighborhood in a picture. It takes the center pixel as the threshold value of the neighborhood and considers the result as a decimal number. The LBP operator is shown in Figure 1. Then, the texture of the picture can be represented by the histogram of all the decimal numbers.

To apply LBP operator in face recognition problem, Ahonen et al. [12] divided the face image into several windows and calculated the histogram of each window by LBP operator. The final feature vector is gotten by combining the histograms into a spatially enhanced histogram. The spatial enhanced histogram is provided with three levels of information: the patterns of pixel level; the patterns of regional level; the global patterns of the face image. Experiments in [12] have shown that the LBP description is more robust against variants in pose or illumination than holistic methods. All our experiments in Section 4 are done with the most original LBP operator.

3. MAELM and Face Recognition Structure

In this part, the multiclass AdaBoost ELM (MAELM) algorithm is proposed and a structure of face recognition based on LBP and ELM is also included.

3.1. Proposed MAELM Algorithm

By applying the multiclass AdaBoost to ELM, this paper proposes the multiclass AdaBoost ELM (MAELM) algorithm. The algorithm takes a number of ELM classifiers as the weak classifiers. denotes the th ELM classifier. The proposed algorithm is put as follows.(1)Initialize the observation weights , .(2)For ,(a)fit a classifier to the training data using weights ;(b)compute the weighted error(c)compute the weight of the th classifier(d)update the weight of sample data, for all (e)renormalize .(3)Output

Part (2)(a) of the proposed algorithm should be paid more attention. Both [8, 9] did not give any detail of how to fit the basic classifier with weighted samples, but it is a very important part of AdaBoost. Zong et al. [18] proposed an algorithm named weighted ELM by introducing a diagonal matrix , whose element denotes the weight of the th training sample. In view of some special situations, we introduce the weighted ELM algorithm. Obviously, it boils down to the original one when the weighted matrix is the identity matrix.

The proposed method maintains the advantages from original ELM: (1) it is simple in theory and convenient in implementation; (2) wide types of feature mapping functions or kernels are available for the proposed framework; (3) the proposed method can be applied directly into multiclass classification tasks. In addition, after integrating with the weighting scheme, the weighted ELM is able to deal with data with imbalanced class distribution while maintaining the good performance on well-balanced data as unweighted ELM; by assigning different weights for each example according to the users’ needs, the weighted ELM can be generalized to cost sensitive learning.

Under the weighted circumstance, the solution of becomes

3.2. Application in LBP Based Face Recognition

This paper combines LBP based feature vectors with ELM to build a face recognition structure. There have been some papers [19, 20] about applying ELM in face recognition problem. However, the existed ELM based face recognition structures are all based on statistical features, for example, PCA [21] and LDA [22].

In order to get better generalization performance, the proposed face recognition structure implements the LBP based method to get the feature vector and ELM as the classifier. It has been proved in [12] that LBP based method is more robust than PCA and LDA when lighting, facial expression, and poses change. At the same time, ELM is very fast in classification and has very good generalization performance. So, it is reasonable to combine LBP method and ELM to build the face recognition structure.

There are two steps of the proposed face recognition structure. The first step is to train the training samples by ELM or MAELM. In this step, the training samples are represented by LBP based feature vectors. Then, the feature vectors are used to train the classifier model by ELM or MAELM; see Figure 2. The second step is to predict the labels of the test samples. The test samples are also represented by the LBP based feature vectors. Then, the classifier model trained in the first step is implemented to predict the labels of the test samples; see Figure 3.

4. Experiments

In this paper, two of the mostly used face recognition datasets Yale and ORL are used to prove the efficiency of the proposed algorithm. To make the results valid, except for Section 4.2, the average testing accuracy is obtained on 20 trials randomly generated training set and test set. This paper chooses the sigmoid function as the activation function for it is the most commonly used one.

The parameters to set and their meanings in the experiments are listed in Table 1. For example, if the experiment sets , , , , and , it means that selecting 5 images of each person builds the training set and the remaining images build the test set. Each image is divided into windows. After building the training and test set, ELM with , and MAELM, which combines 10 ELMs with , , are evaluated in the built sets.

4.1. Performance Changes with and

Although ELM is comparatively not that sensitive to the arguments as SVM, its performance still changes with the hidden layer number and the constant value .

Suppose we have training samples; Huang et al. [1] rigorously prove that SLFNs (with hidden nodes) with random bias and input weights can exactly learn the distinct observations. If the training error is allowed, the number of hidden nodes can be much smaller than . At the same time, the constant value also has some impacts of the solution of ’s Moore-Penrose generalized inverse.

In this part, the experiment is conducted in Yale dataset. The experiment sets , , and . In addition, the is set as and the is set as . The performance of ELM and MAELM is shown in Figure 4.

It is obvious that both ELM and MAELM are not sensitive to the change of arguments. The difference between ELM and MAELM is mainly in the region where is very small and is very large. From Figure 4, one can conclude that ELM performs badly in this region, since its accuracy rate is below 0.6. On the contrary, MAELM is still very stable in this region. Its accuracy rate is bigger than 0.8.

After seeing PCA’s good performance in the region of face recognition, we wonder if PCA could have a stable and better performance when it replaces the way we originally construct the matrix .

The experiment is also conducted in Yale dataset with the same parameters. Besides, the new parameter , which is the dimension after reduction, could not be set bigger than the number of input nodes. In view of the dimension of dataset and other limitations in the experiment, the parameter is set as . Since it is complex in the picture because of the imbalance with the parameters change, we choose to show them in the table. The performance of ELM (Figure 4(a)) and MAELM (Figure 4(b)) with PCA is listed in Table 2; the best accuracy rate in the table is bold.

It is clear that both ELM and MAELM with PCA are not so sensitive to the change of arguments. The difference between them is mainly in the region where is very small and is very large. From Table 2, one can conclude that MAELM with PCA performs better in this region when is very small, but when is large and is small, ELM with PCA performs rather well and stable. Besides, ELM with PCA’s performance is almost as well as the other one in the region where and are both very large, and its accuracy rate is bigger than 0.85.

4.2. Prediction Stability Analysis

Since the original ELM randomly generates the weights between the input layer and the hidden layer, as well as the bias of the activation function, its performance even for the same training and test set changes each time. This is to say the performance of original ELM may not be so stable. The proposed algorithm successfully reduces the instability.

From Figure 4, one is able to conclude that ELM tends to get better performance when , while is better for MAELM. Let , , , , and for MAELM and , , , and for ELM. Besides, ELM and MAELM with PCA are also included under the corresponding situations because of the considerate performance above. Experiments are done in Yale datasets. In order to prove that the proposed algorithm is more stable than the original ELM, experiments are done in the same training set and test set (randomly generated) for 20 times. The result is shown in Figure 5.

In Figure 5, it is obvious that the performance of MAELM is much more stable than the original ELM. Although ELM or MAELM with PCA performs far more stable than the original ELM and MAELM (since they take the algorithm of PCA into consideration instead of the random weights between the input layer and the hidden layer and the bias of the activation function), the accuracy rates of them, which are always in the middle from Figure 5, are still not so good as the original MAELM. We conclude the result of Figure 5 in Table 3. Please notice that although the generalization performance of MAELM seems to be much better than ELM in the table, it is improper to conclude that MAELM performs better. The reason is that the training set and test set are fixed. One cannot exclude the possibility that MAELM performs better than ELM only under this dataset. Some other experiments will be done in the following parts to show MAELM’s better generalization performance.

4.3. Performance Changes with

In order to evaluate the changes of performance when changes, the experiment in this part lets , , , for the original MAELM, for MAELM with PCA, and . The average test accuracy is obtained on 20 trials randomly generated training set and test set. Yale dataset is used for the experiment. The result is presented in Figures 6 and 7.

From Figure 6, it is obvious that as the increases, the generalization performance also becomes better. However, the trend becomes slower as increases. From Figure 7, one can conclude that as the increases when is small, the performance decreases a little, while becomes larger after 25; the performance also becomes better, although the trend is not so stable as the original MAELM. This situation indicates that in real-world applications, does not need to be very big. Good generalization performance can be obtained by setting less than 30 in the algorithm of original MAELM, which achieves better than MAELM with PCA under the same situation.

4.4. Better Generalization Performance Than ELM

In this part, experiments are done both in Yale and ORL datasets. The experiments set the parameters of those algorithms as follows: , , , (MAELM), and (PCA). The experiments take , , , , and windows into consideration, which means setting . The average testing accuracy is obtained on 20 trials randomly generated training set and test set.

The experiment indicates that MAELM has better generalization performance both in Yale and ORL datasets under different window sizes. See Figure 8 for details, while in Figure 9, it is obvious that ELM with PCA has much better performance both in Yale and ORL datasets under different window sizes. In addition this algorithm keeps more stable than any other algorithms both in Yale and ORL datasets.

4.5. The Performance in PCA

After seeing all these experiments, we can conclude that although MAELM with PCA performs not so well as the original one, ELM with PCA performs much better than before, especially in the experiment in Section 4.2. It is obvious that the performance of the experiments with PCA is just between the original ELM and MAELM.

What is more, since the original ELM randomly generates the weights between the input layer and the hidden layer, as well as the bias of the activation function, its performance is not so stable. The proposed algorithm with PCA successfully reduces the instability which is very important in the real world.

Although PCA improves the performance of ELM in a certain degree, it still could not reach the ability of MAELM with random weights and bias. Finally, it comes to the result that the proposed algorithm named MAELM performs much better in solving the multiclass classification problem.

5. Discussion

5.1. Complexity Comparison

Very similar to MAELM, the DAEELM [8] also considers taking the ELM as the weak classifier and implements AdaBoost as the ensemble method. The difference is that MAELM implements multiclass AdaBoost which can be directly used in multiclass classification problem, while DAEELM implements dynamic ensemble AdaBoost [23], which aims to solve the binary classification problem.

Many methods have been developed to apply binary classifier to multilabel problem. One-against-all (OAA) [24] and one-against-one (OAO) [25] are mostly used. For a -class classification problem, under OAA condition, classifiers have to be trained. Each of them separates a single class from all the remaining classes. Under the OAO condition, classifiers have to be trained. Each of them separates a pair of classes.

Suppose that both MAELM and DAEELM have iterations. For a -class classification problem, MAELM only needs to train ELMs, while DAEELM needs to train and classifiers for OAA and OAO condition, respectively. Although DAEELM may stop the iteration earlier, it is obvious that, in theory, MAELM’s computation complexity is much lower than DAEELM for -class classification problem, especially when is a very big number.

The authors of DAEELM have not published its codes and DAEELM has its own arguments which MAELM does not have. DAEELM also does not provide details of how it trains weighted data with ELM, so it will be unfair to compare the performance of MAELM and DAEELM. However, the conclusion that MAELM is much faster than DAEELM in multiclass classification problem can be drawn from the complexity analysis above.

5.2. Train ELM with Weighted Data

Section 3.1 has mentioned that training ELM with weighted data is a key problem when applying AdaBoost. However, [8, 9] did not mention the key point at all.

Toh in [26] first applied ELM to classify imbalanced data with two classes. ELM tries to minimize the training error of the data while the proposed algorithm tends to minimize the total error rate (TER), which takes the weights of the positive and negative data into consideration.

In Section 3.1, the weighted ELM is applied in MAELM. Actually, the weighted ELM is inspired and in a way that is very similar to regularized ELM proposed by Deng et al. in [27]. The regularized ELM aims to minimize the weighted training error of the weighted data.

6. Conclusion

This paper proposes a new boosting ELM named MAELM, which applies the multiclass AdaBoost in ELM ensemble to directly solve multiclass classification problem. A face recognition structure combined LBP based method and ELM is also presented in the paper. What is more, this paper proposes the way in which ELM combined with PCA instead of using random weights between the input layer and the hidden layer, as well as the bias of the activation function.

Experiments in LBP based face recognition will show the stable and good performance in a certain degree. Although PCA improves the performance of ELM, it still could not be better than MAELM with random weights and bias. Experiments show that in LBP based face recognition problem, the recognition result of MAELM is more stable than the original ELM and better than any other algorithms listed in the paper.

Finally, it comes to the result that the proposed algorithm named MAELM, which applies the multiclass AdaBoost in ELM and combines with LBP method, performs much better in solving the multiclass classification problem.

Also, MAELM is compared with DAEELM in multiclass classification problem in theory, which indicates that MAELM has much lower computation complexity than DAEELM. Moreover, this paper makes the problem how to train weighted data by ELM clear.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research is based on work supported in part by the National Natural Science Foundation of China (61370173, 61173123) and the Natural Science Foundation Project of Zhejiang Province under Project LR13F030003.