Abstract
We propose a distance based multiple kernel extreme learning machine (DBMKELM), which provides a twostage multiple kernel learning approach with high efficiency. Specifically, DBMKELM first projects multiple kernels into a new space, in which new instances are reconstructed based on the distance of different sample labels. Subsequently, an norm regularization least square, in which the normal vector corresponds to the kernel weights of a new kernel, is trained based on these new instances. After that, the new kernel is utilized to train and test extreme learning machine (ELM). Extensive experimental results demonstrate the superior performance of the proposed DBMKELM in terms of the accuracy and the computational cost.
1. Introduction
Currently, classification and regression are two major problems targeted by most of machine learning, pattern recognition, and data mining methods. For example, one may use a classifier in fingerprint identification system [1] or introduce regression to predict stock price [2], and so forth. As a kind of structure that can process classification and regression problems, single hidden layer feedforward networks (SLFNs) have been extensively studied. In previous work, many methods have been proposed to train SLFNs, such as back propagation algorithm [3] and SVM [4]. However, the above training methods may have two main drawbacks, that is, local extremum and long training time.
Recently, Huang et al. have proposed extreme learning machine (ELM) to train SLFNs in an extremely fast fashion [5, 6]. It has been proved that ELM can overcome the main drawbacks of previous SLFNs training methods. Specifically, ELM can learn a global optimal solution of the SLFNs’ parameters, which can contribute to a high performance in both classification and regression problems, in an extremely short time because it only needs to train the output weights between hidden layer and output layer via the least square method [7, 8]. The reason is that ELM only needs to train the output weights between the hidden layer and the output layer via the least square method. Another attractive feature of ELM is that it establishes a unified model for solving both classification and regression problems [8]. Considering the outstanding advantages of ELM, numerous valuable applications based on the ELM have been proposed, such as [9, 10]. Meanwhile, many researchers have promoted the evolution of ELM recently, including proposed online sequential ELM [11], voting based ELM [12], weighted ELM [13], sparse ELM [14], and kernel based ELM [8]. As it can get rid of the impact of the number of hidden nodes, one of their work, the kernel based ELM, is applicable to a wide range and has perfect performance. So far, most studies that towards kernel based ELM focus on using a single kernel. However, since description ability of single kernel is weaker than multiple kernels in most cases, it may get better results to use multiple kernels in kernel based ELM. Moreover, using multiple kernels can handle multiple source information fusing problem, which can further improve the performance of classification and regression in some cases.
Multiple kernel learning (MKL) is a kind of machine learning method that can enable classifier and regressor to utilize multiple kernel information. Given a set of base kernels, the goal of MKL is to construct a new kernel, which can be more suitable to address the problem at hand, through learning an appropriate combination of base kernels. Typically, MKL can be sorted into two categories, onestage approach and twostage approach, by its approach. The onestage approach learns the combination coefficients of the base kernels and the parameters of the classifier jointly by solving a joint optimization objective function. After [15] pioneered this kind of approach, which got great attention, a lot of work following it has been proposed, including [16–19], to name just a few. On the contrary, the twostage approach, such as [20, 21], constructs a new kernel firstly by finding a suitable combination of base kernels and then it uses this combination in classifier or regressor.
In previous work, some researchers tried to introduce multiple kernels into ELM, such as [22, 23], and got satisfactory results. Although these efforts made pioneering achievements, all of them fail to achieve an extreme learning speed or cannot make sense in both classification and regression cases. In this paper, we propose a novel multiple kernel based ELM, named distance based multiple kernel extreme learning machine (DBMKELM), which is a fast twostage multiple kernel learning approach and can be adapted to both classification and regression. In the first stage, DBMKELM finds the combination coefficients of pregenerated base kernels based on training samples. It first projects original base kernels into a new space and reconstructs new instances based on the distance of training samples. Then, it transfers the multiple kernel learning problem to a binary classification problem or a regression problem and solves it using the least square method. Finally, it constructs the new kernel from base kernels based on the learned combination coefficients. In the second stage, DBMKELM adopts the new kernel in kernel based ELM. Experimental results demonstrate the following advantages of our proposed DBMKELM: the training time of DBMKELM is extremely short compared with traditional MKL methods; DBMKELM can fully use multiple source information and outperform previous MKL methods in terms of the classification and regression accuracy; DBMKELM can improve the robustness and the accuracy of basic kernel based ELM in both classification and regression cases.
The rest of paper is organized as follows. Section 2 briefly introduces ELM. Then, Section 3 presents the proposed DBMKELM. Meanwhile, Section 4 evaluates the performance of DBMKELM via extensive experiments. Finally, Section 5 concludes the paper.
2. Related Work
Our proposed method is extended from ELM, specifically, kernel based ELM. In this section, we briefly introduce ELM and kernel based ELM.
2.1. Extreme Learning Machine
Extreme learning machine is a perfect training method of single hidden layer feedforward networks (SLFNs). Since it was proposed by Huang et al. [6], ELM has been widely used in numerous areas. The main advantages include but are not limited to the extreme training speed; the fascinating generalization performance. These advantages are attributed to the fact that ELM randomly generates the weights between input layer and hidden layer and uses a least squares method to learn the other weights. For instance, if we consider a SLFNs as Figure 1 illustrating, which hashidden layer nodes and one output layer node, the output function of it is as follows:whereis the weights between hidden layer and output layers and is the value vector of hidden layer that maps original features into a new feature space. Different from other SLFNs training methods, ELM can use any , in which is a nonlinear piecewise continuous function, such as sigmoid function and Gaussian function. Moreover, , the weights between input layer and hidden layer, and , the bias of hidden nodes, can be randomly chosen based on any continuous probability distribution in ELM. Thus, the only parameter that ELM must learn is . ELM learns through solving the following optimization problem [8]:whereis training samples. According to KKT theorem,can be calculated fromorwhere and. One can use (3) for the case where the number of training samples is huge, that is, , and use (4) on the contrary, that is, .
2.2. Kernel Based Extreme Learning Machine
Kernel method can be easily introduced into ELM. Specifically, ELM kernelcan be derived from ELM feature mapping functionas follows:
Therefore, ELM output function can be written as follows:
In this case, users do not have to know theor set the number of hidden nodes, that is, the dimension of ELM feature space.
3. Distance Based Multiple Kernel ELM
Since the goal of multiple kernel learning is to construct a new kernel that more suitable for problem processing, the nature of “good” kernel must be considered. Generally, kernel can be seen as a measure of similarity. Each entry in a kernel matrix represents a similarity of two corresponding samples. From this point of view, a “good” kernel can display the true similarity of sample pairs. In other words, if two sample pairs have similar similarity, their corresponding value in the “good” kernel will also be similar. To this end, we propose distance based multiple kernel ELM. It measures similarities of sample pairs by their “label distance” that will be defined in the following part and uses this information to construct a “good” kernel. DBMKELM is a kind of twostage multiple kernel learning method. In the first stage, it learns a new kernel. In the second stage, it uses the new kernel in kernel based ELM.
3.1. Label Distance
We first define the “label distance”. As a significant part of training samples, the label contains class information in the classification case and dependent variable information in the regression case, which can be used to measure true similarity between samples. Considering different label meaning between classification and regression, we discuss “label distance” in these two cases separately.
In the classification case, the label means the class that a sample belongs to and samples can be seen as similar when they are in the same class. In other words, if two samples have the same label, they can be seen as similar. On the contrary, if two samples have different labels, they can be seen as different. However, in this case, it is difficult to discriminate how different two samples are, because the difference between classes is not clear. Therefore, label distance is defined toif two samples have the same label and defined toif two samples have different labels. Formally, we define label distanceas follows:
In the regression case, the label means the value of the dependent variable. Typically, the similarity of samples is directly represented in the difference of their values of dependent variable in regression cases, for example, pollution prediction, housing number prediction, and stock price prediction, to name just a few. Thus, label distance can be defined as distance between two values of the dependent variable. Admittedly, various measurements can be used to measure this distance, but Euclidean distance is used in this paper. We, in the regression case, formally define the label distanceas follows:
3.2. Multiple Kernel Learning Based on Distance
Since label distance has been defined above, the distance information can be used to guide the new kernel learning. In this subsection, we show how does DBMKELM perform multiple kernel learning based on distance. As we discussed, the new optimal kernel can be seen as a linear combination of base kernels. Therefore, the goal of distance based multiple kernel learning (DBMK) is to learn the combination coefficients of base kernels from which a new kernel that each entry value is coincident to the label distance of the corresponding sample pair can be generated. Considering values of the same entry in each base kernels, corresponding to the same sample pair, if they are seen as input features and the label distance of the same sample pair is seen as output value, the DBMK can be transformed to the regression problem, in which parameters of the regressor are the combination coefficients that need to be learned. To this end, DBMK learns the combination coefficients following the next two steps. Firstly, it reconstructs a new sample space, named Kspace, in which each sample corresponds to a sample pair in the original sample space, based on multiple base kernels and label distance of sample pairs. Secondly, it solves a regression problem in this new space to find combination coefficients of kernels. After this two steps, DBMKELM can obtain the new optimal kernel through combining the base kernels using learned combination coefficients.
For machine learning problems, including classification and regression, if training samplesare drawn from a distribution over andbase kernels, which must satisfy the positive semidefinite condition, are generated by a set of kernel functionsand denoted aswith: , DMKLELM reconstructs the Kspace as where
In the Kspace, DBMKELM learns the combination coefficients through solving a regression problem. In this problem, the training samples are, the whole samples in Kspace, in which input feature vector is and the output label is . Though numbers of methods can be used to solve the regression problem, DBMKELM applies an norm regularization least squares linear regression, which is similar to ELM output weight learning, to solve it, considering a fast training speed. This method aims to minimize the training error and the norm of combination coefficients at the same time. If Kspace has samples, the optimization objective function of it can be formalized as follows:whereis the combination coefficients that DBMKELM needs to learn,corresponds to the th sample in Kspace,represents the training error generated by th sample, andis a tradeoff parameter.
Using KKT theorem, solving (10) is equivalent to solving its dual optimization problem:whereis the Lagrange multiplier corresponding to the th sample. In this case, KKT optimality conditions can be written as follows:whereand. According to (12), we haveand equallywherein both equations above. We can use (13) to calculatewhen the number of training samples in Kspace is not huge and use (14) in the opposite case to speed up the computation.
Finally, DBMKELM obtains the new optimal kernel by combining base kernels according toas follows:Similarily, for each two sample pair, their new optimal kernel function can be written as follows:
3.3. Multiple Kernel Extreme Learning Machine
In the second stage, DBMKELM uses the learned new kernel, in which multiple kernel information is included, in the kernel based ELM in both the training and testing case. Therefore, the output function of DBMKELM can be written as follows:where. At this point, DBMKLELM can successfully deal with the classification and regression problem benefitting from multiple kernel at a fast speed.
For a classification or regression problem, DBMKELM first learns the combination coefficients of pregenerated base kernels using (13) or (14). Then, it calculates a new optimal kernel by (15). Finally, the result of the problem can be obtained by (17). The algorithm of DBMKELM can be illustrated as Algorithm 1.

3.4. Connected to TSMKL
A previous successful multiple kernel learning approach TSMKL (twostage multiple kernel learning) [21] also follows the idea that uses the label information to learn kernel. It denotes for the same label and for different labels to construct its target labels. This is very similar to DBMKELM label distance in the classification case. The experimental results show that this method can find a really good kernel that achieves the stateoftheart classification performance. However, TSMKL can only solve the classification problem. DBMKELM does not just consider the difference between classes but uses label distance to measure the similarity of a sample pair. In this way, DBMKELM not only adapts to the classification problem but also adapts to the regression problem.
4. Experiments
In this section, we first compare DBMKELM to several methods in both classification and regression benchmarks using pregenerated kernels. In order to verify DBMKELM performance on multiple kernel learning, SimpleMKL [16] and unweighted sum of kernel methods (UW) have been compared. Meanwhile, we also compare DBMKELM with basic kernel based ELM [8] in which the best kernel in base kernels has been used. Then, we compare the classification accuracy between DBMKELM and ELM in multiple kernel classification benchmarks, in which different kernels are generated from different channels, in order to demonstrate the ability for multisource data fusion of DBMKELM. In addition, we compare DBMKELM with the stateoftheart multiple kernel extreme learning machine, namely, the MKELM and the radiusincorporated MKELM (RMKELM) [23], in classification benchmark mentioned above, since these two methods can only suit classification cases. Finally, we conduct parameter sensitivity test for DBMKELM.
4.1. Benchmark Data Sets
We choose 12 classification benchmark data sets including ionosphere, sonar, wdbc, wpbc, breast, glass, and wine from UCI Machine Learning Repository [24]. Table 1 shows the number of training samples, testing samples, features, and classes in these data sets.
The regression benchmark data sets used in this experiment are taken from UCI Machine Learning Repository [24] and Statlib [25]. These sets include PM10 [25], bodyfat [25], housing [24], pollution [25], spacega [25], servo [24], and yacht [24]. We display the information, including the number of training samples, testing samples, and features of these sets in Table 2.
Three multiple kernel classification benchmarks from bioinformatics data sets are selected in our experiment. The first of them is the original plant data set of TargetP [26]. The others are PsortPos and PsortNeg [27] that both for bacterial protein locations problem. We show the number of training samples, testing samples, kernels, and classes in these data sets on Table 3.
For each data set, we randomly select twothirds of the data samples as training data and the rest as testing data. We repeat this procedure 20 times for each data set and obtain 20 partitions of original data. All algorithms in the experiment are evaluated on each partition and the averaged results are reported for each benchmark.
4.2. Parameters Setting and Evaluation Criteria
For both classification and regression benchmark data sets, we generate 23 kernels on full feature vector, including 20 Gaussian kernels () with , 3 polynomial kernels of degrees 1, 2, and 3. For the kernel based ELM [8], we test all the 23 kernels generated above and display the best result of them in our experiments according to the testing accuracy. For all algorithms, the regulation parameter is selected from via 3fold cross validation on training data.
We select accuracy and computational efficiency as the performance evaluation criteria. The accuracy means the classification accuracy rate in testing data for classification problems or the mean square error (MSE) in testing data for regression problems. In addition, for the regression problem, sample labels have been normalized to . For all cases, the computational efficiency is evaluated by the training time.
The reported results for each benchmark include the mean value and the standard deviation of criteria in 20 partitions. In order to measure the statistical significance for the accuracy improvement, we further use the paired student’s ttest, in which value means the probability that two compared sets come from distributions with an equal mean. Typically, if the value less than 0.05, the compared sets are considered having statistically significant difference.
4.3. Classification Performance
The classification accuracy of different methods is shown in Table 4. The content in Table 4 has following meanings, the first part is the mean standard deviation and the second part is the value calculated by the paired Student’s test. The bold value in each cell of Table 4 represents the highest accuracy and those having no significant difference compared with the highest one. We also show the classification training time in Table 5, which presents as the meanstandard deviation.
As we can see from Table 4, DBMKELM achieves the highest correct classification rate or has no significant different compared with the best one. Meanwhile, the results in Table 5 prove that the time cost of this approach is significantly lower than SimpleMKL, MKELM, and RMKELM.
4.4. Regression Performance
The regression accuracy of different methods is shown in Table 6 with the same representation of Table 4. And the regression training time is shown in Table 7.
In this case, we can see DBMKELM has the significant highest regression accuracy compared with other methods. From the time cost point of view, this situation is similar to the classification problem; that is, DBMKELM dramatically improved training time compared to SimpleMKL.
4.5. Multiple Kernel Classification Benchmark Performance
The classification accuracy for multiple kernel classification benchmarks of DBMKELM and other methods is shown in Table 8. And the multiple kernel classification training time is shown in Table 9. From the results, we can see DBMKELM significantly better than ELM. That means DBMKELM has the ability to perform multisource data fusion, thereby improving the performance of ELM. The DBMKELM is better than the stateoftheart multiple kernel extreme learning machine in this case regarding the classification accuracy and the training time.
4.6. Parameter Sensitivity Test
In our proposed DBMKELM, there are two regularization parameters need to be set. In order to describe more clearly, we use and represents the regularization parameter in ELM training and multiple kernel learning, respectively. We choose classification data set ionosphere and regression data set yacht to test parameter sensitivity. For each data set, we set a wide range of and . Specifically, we have used 10 different values of and 10 different values of from . For each () pair, we repeat 20 times on each data set to get the average accuracy. The result of classification case and regression case is shown in Figures 2 and 3, respectively. As can be seen from the results, the performance of DBMKELM is not sensitivity while and vary within a wide range.
4.7. Discussion
The experimental results have illustrated that DBMKELM can achieve a high accuracy with a fast learning speed. However, two issues need to discussed.
Why is the learning speed of DBMKELM much slower than basic ELM in some cases? The main reason is that there are substantial samples to learn in the Kspace, which is constructed in multiple kernel learning step. Specifically, if there areoriginal samples, there will be corresponding new samples. Therefore, the training time difference between DBMKELM and basic ELM will be magnified with the training samples increasing. It may be possible to reduce the training time gap between DBMKELM and basic ELM if we use sampling techiques in the Kspace.
In which cases should we use DBMKELM? DBMKELM can obtain more accurate results and a faster learning speed compared with traditional multiple kernel learning method, SimpleMKL. Despite the fact that it is much better than other multiple kernel learning methods, DBMKELM has more time cost compared with basic ELM method. In this way, a tradeoff between accuracy and time cost is needed. The experimental results show that DBMKELM significantly improves testing accuracy compared with basic ELM in regression and multisource data fusion problems in most cases. But in the classification case, where kernels are generated from one data source, DBMKELM has no significant difference compared with basic ELM in testing accuracy. Therefore, a preferable choice is to apply DBMKELM in regression and multisource data fusion problems and use basic ELM in single data source generated kernel classification problems.
5. Conclusion
In this paper, we have proposed DBMKELM, a new multiple kernel based ELM, to extend the basic kernel based ELM. The proposed multiple kernel learning method can unify classification and regression problems. Moreover, DBMKELM is able to learn from multiple kernels at an extremely fast speed. Experimental results show that DBMKELM achieves a significant performance enhancement, in terms of the accuracy and the time cost in both classification and regression problems. In future, we will consider how to define a better distance among different classes and how to extend DBMKELM to the semisupervised learning problem.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Project nos. 60970034, 61170287, and 61232016), the National Basic Research Program of China (973) under Grant no. 2014CB340303, and the Hunan Provincial Science and Technology Planning Project of China (Project no. 2012FJ4269).