Extreme Learning Machine on High Dimensional and Large Data ApplicationsView this Special Issue
Sample-Based Extreme Learning Machine with Missing Data
Extreme learning machine (ELM) has been extensively studied in machine learning community during the last few decades due to its high efficiency and the unification of classification, regression, and so forth. Though bearing such merits, existing ELM algorithms cannot efficiently handle the issue of missing data, which is relatively common in practical applications. The problem of missing data is commonly handled by imputation (i.e., replacing missing values with substituted values according to available information). However, imputation methods are not always effective. In this paper, we propose a sample-based learning framework to address this issue. Based on this framework, we develop two sample-based ELM algorithms for classification and regression, respectively. Comprehensive experiments have been conducted in synthetic data sets, UCI benchmark data sets, and a real world fingerprint image data set. As indicated, without introducing extra computational complexity, the proposed algorithms do more accurate and stable learning than other state-of-the-art ones, especially in the case of higher missing ratio.
Extreme learning machine (ELM) was proposed by Huang et al. in . It works for the generalized single-hidden layer feedforward networks (SLFNs) . Different from traditional tenet in neural network learning, hidden node parameters of ELM are randomly generated and do not need to be adjusted . Therefore, it achieves fast learning speed as well as excellent generalization ability with least human intervention. As an open, scalable, and unified learning framework, ELM can be used in various kinds of learning tasks, for example, classification, regression, representation learning, and so forth. Its learning theory has become an active research topic in machine learning domain in recent years. On one hand, many improved variants have been developed. On another hand, ELM learning mechanism has been integrated into some well-known platforms and systems. The following is a brief review. The main objective of variants is mainly twofold: achieving more accurate learning by introducing different regularization methodologies and reducing computational complexity by simplifying ELM network architecture. Regularized ELM  considers heteroscedasticity in real applications and reduces the effect of outliers in the data set. In , ridge regression, elastic net, and lasso methods are used to prune ELM network, which leads to compact architecture. Sparse Bayesian ELM  estimates the marginal likelihood of network outputs and automatically prunes most of the redundant hidden neurons. Localized error model based ELM utilizes principal component analysis to reduce the feature dimension and then selects the optimal architecture of the SLFN. OP-ELM uses LASSO regularization technique to rank the neurons of hidden layer and obtains a more parsimonious model. TROP-ELM  further improves OP-ELM by using a cascade regularization method based on LASSO and Tikhonov criteria. OP-ELM-ER-NCL  combines ensemble of regularization techniques with negative correlation penalty. TS-ELM  introduces a systematic two-stage mechanism to determine the network architecture. MK-ELM  considers multiple heterogenous data sources and uses a multikernel strategy to optimize ELM. Moreover, ELM is modified according to different characteristics of data sets and applied in various applications. OS-ELM  learns the data one-by-one or chunk-by-chunk with fixed or varying chunk size. EOS-ELM  further improves OS-ELM’s learning stability by adapting the node location, adjustment, and pruning methods. Weighted ELM  assigns different weights to different samples according to users’ needs, realizing cost sensitive learning. T2FELA  inherits the merits of ELM and randomly generates the parameters of the antecedents, successfully applying ELM in type 2 fuzzy logic system. In , ELM is integrated into MapReduce framework for large scale regression. All in all, those extensions and variants inherit the merits of ELM and are more suitable for specific application scenarios. However, missing data problem which is highly pervasive in real world tasks is rarely considered in ELM learning.
Missing data is a common situation where null value is stored for some samples in the data sets. This problem has complex patterns. It occurs by several practical reasons such as equipment malfunction in data sampling or transmission and noise value being deleted in data preprocessing . Besides, there are some inherently missing caused by nonexisting features. For example, an image sample may not contain all the predefined components. Standard ELM learning requires all samples in the data set to be complete and have the same dimensions. Obviously, it cannot directly handle the issue of missing data. Traditional approaches depend on the preprocessing for missing data before learning. Inevitably, those approaches introduce extra preprocessing overhead. Worse still, they may mistakenly omit some useful information and produce a certain amount of error information . Therefore, learning precision can be seriously affected. In this paper, we propose a sample-based learning framework and develop a sample-based ELM classification algorithm and a sample-based ELM -insensitive regression algorithm. In the proposed learning framework, missing data can be directly learned without any extra preprocessing.
The contributions of this paper are as follows. First, we analyze the limitations of state-of-the-art approaches for learning with missing data. Second, we propose a sample-based ELM learning framework for learning missing data. Third, we develop two sample-based ELM algorithms for classification and regression, respectively. Experiment results show that the proposed algorithms achieve more accurate classification and regression compared with traditional approaches, especially in the situation that missing ratio is relatively intensive.
2. ELM Learning
In Section 2.1, we review the basic concepts of ELM. Then, from general ELM optimization formula, we discuss several ELM variants and illustrate their characteristics. In Section 2.2, we introduce ELM -insensitive regression and explain its advantage over standard ELM regression.
2.1. Standard ELM
From the viewpoint of neural network, ELM can be seen as generalized SLFNs. Figure 1 gives the network architecture. ELM randomly chooses hidden nodes and analytically determines the output weights of SLFNs. In theory, ELM can approximate any target continuous function and classify any disjoint regions . Its interpolation capability and universal approximation capability have been investigated by Huang et al. in [18, 19]. The corresponding output function is given aswhere is the output weight vector and is the random mapping function. Being superior to traditional learning algorithm, ELM tends to not only minimize the empirical risk but also minimize the structural risk. Its general optimization formula is as
From the viewpoint of learning theory, ELM considers empirical risk as well as structural risk which can be observed in ELM’s general optimization formula (2). is the random mapping matrix, and is the training data targets. Evolved from (1), with different combinations of parameters and constraints in general formula, ELM derives many variants of different regularization methods. Additionally, kernel methods are used to enhance the learning ability especially for multiple data sources learning . With competitive performance in generalization abilities, they are widely used in classification, regression, representation learning and clustering, and so forth. The following are some representative forms.
Basic ELM. There are two kinds of basic ELM. When (i.e., ELM only concerns empirical risk), the optimization objective equals . In another extreme; when (i.e., ELM only concerns the structural risk), the optimization objective equals . Obviously, basic ELM learning has high learning efficiency. But both of them take only one optimization objective into account, which can cause overfilling or underfitting.
Inequality Optimization Constraints Based ELM. With the parameter setting of , , , , and inequality constraints, the general optimization formula can be written as (4), which is common in binary classification. Since this form is applied wildly and has good sparsity, we use it as the base model of our extension for classification:Applying KKT conditions, (4) can be transformed into (5); then it can be solved in dual space:
Equality Optimization Constraints Based ELM. With the parameter setting of , , , , and equality constraint, the general ELM optimization formula is equivalent to (6) which can be used in regression and classification:The corresponding KKT optimal conditions are shown inFurther, the final output is given in
2.2. ELM for -Insensitive Regression
For regression, ELM provides general model for standard setting. It achieves better predictive accuracy than traditional SLFNs . In addition, many variants and extensions of ELM regression algorithms have been proposed. Inspired by Vapnik’s epsilon insensitive loss function,  proposed -insensitive ELM. Its optimization formula is aswhere is insensitive factor and the error loss function is calculated byCompared with conventional ELM regression, ELM with -insensitive loss function uses margin to measure the empirical risk. It controls the sparsity of the solution  and is less sensitive to different levels of noise . In this paper, we extend ELM regression algorithm based on this variant.
3. Missing Data Problem in ELM Learning
3.1. Missing Data Problem
Nowadays, with ever-increasing data velocity and volume, missing data becomes a common phenomenon. Generally, there are two missing patterns, that is, missing feature and missing label. In this paper, we focus on the issue of missing feature.
From the causes of missing data, there are two circumstances. In the first circumstance, the missing features exist but their values are unobserved for the reason that information is lost or some features are too costly to be acquired . Examples of such case can be found in many domains. Sensors in a remote sensor network may be damaged and fail to collect data intermittently. Certain regions of a gene micro array may fail to yield measurements of the underlying gene expressions due to scratches, fingerprints, or dust . Second is inherently missing. In this circumstance, different samples inherently contain different features. For instance, in packed malware identification, instances contain some unreasonable values. In the web-page task, one useful feature of a given page may be the most common topic of other sites that point to it. If this particular page has no such parents, however, the feature is null, and should be considered structurally missing . Obviously, imputation for this circumstance is meaningless.
3.2. Traditional Approaches for Missing Data Learning
Generally, there are three approaches for dealing with missing features in machine learning. The first approach is omitting, which includes sample deletion and feature filtering. Sample deletion simply omits the samples containing missing features and applies standard learning algorithms in the remaining samples. An example is shown in Figure 2; with two missing features is deleted. Feature filtering omits the features that are missing in most samples. Figure 3 interprets this approach. Obviously, the advantage of omitting based approaches is simple and computationally inexpensive. Notably, the key point of omitting is keeping as much as possible useful information while omitting. But it is difficult to do that. Both of them inevitably omit some useful information. When there is massive information retained after being partly omitted, this approach can be a better choice. Otherwise, in the situation of much useful information being omitted while few being retained, this kind of approaches affects learning precision seriously. Second approach is imputation. In data preprocessing phase, missing features are filled with most possible values . Simple imputations fill the missing features with some default value such as zero or average value of other samples. Complex imputations use some probabilistic density function or distribution function to estimate the missing features. The computational complexity of imputation varies with different estimation methods. Imputation makes sense when the features are known to exist and relatively easy to be estimated. But in some cases, the missing values are difficult to impute. Besides, in the situation of inherent missing, imputation is meaningless. Last but promising approach appears in recent years. Researchers manage to extend standard learning algorithms to deal with missing features. Different from the above two, it needs no extra preprocessing for missing data. It keeps the intactness of incomplete samples and extends standard learning algorithms according to their learning theory. Consequently, the extended algorithms learn over complete samples as well as incomplete samples. In this paper, the proposed sample-based learning belongs to this category.
4. Sample-Based ELM Framework and Algorithms
In this section, we start by illustrating ELM learning in a max-margin way as Huang explains and formalizes the learning problem with nonexisting features. Then, we explain why standard max-margin learning must be modified for learning missing data. Based on this, we propose the S-ELM classification algorithm in Section 4.2 and the S-ELM -insensitive regression algorithm in Section 4.3.
4.1. Sample-Based Learning Framework
In the viewpoint of geometric formulation, maximizing margin is equivalent to maximizing the dot products of sample vectors and output weight vectors. Meanwhile, the smaller the norms of output weights are, the better generalization the network tends to have. ELM learning is the process of adjusting output weight vectors by the above two objectives. For samples containing missing features, the margins are incorrectly scaled . For example, in Figure 4, misses feature 1. The distance between and separating plane should be calculated only in existing features (i.e., ). However, in standard ELM, distance is calculated in full space. So is incorrectly scaled up, that is, . Based on this observation, we are inspired to consider each sample in its own relevant subspace. This is just the essence of sample-based learning framework. Samples with missing features should be treated as residing in the relevant subspaces rather than in full space. The optimization formula of S-ELM is naturally derived as follows:Compared with standard general ELM, the essential transformation is substitution original output weight with sample-based output weight . Specifically speaking, there are two objectives in minimization optimization formula (i.e., empirical risk and structural risk). Missing features affect the calculation of empirical risk (i.e., dot product of output weight and sample vector). Fortunately, considering distances in sample-based subspaces, we keep the two optimization objectives consistent by transforming into . Actually, is the ’s projection in ’s relevant subspace. For samples without missing feature, sample-based output weights are just the same with standard one.
4.2. S-ELM Classification Algorithm
According to sample-based learning framework, the optimization formula of sample-based ELM(S-ELM) classification is constructed as (12). Here, the extension is based on (4):As it can be seen, different missingness among samples may induce different s. In order to guarantee better generalization, the maximum of all s should be minimized. Consequently, (12) can be transformed intoBy introducing an indication matrix (each of its components indicates whether the corresponding feature of sample is missing or not), (13) can be equivalently rewritten asSince (14) is not a convex optimization problem, it is more difficult to be solved than standard ELM. Fortunately, we can transform it into a convex one (i.e., (15)) by using an auxiliary variable (). Consider the following:From (15), it can be found that the objective is a quadratic function, which is convex; the first constraint is linear in variables, which is also convex; and the second constraint is a quadratic cone, which is again convex. Therefore, (15) is a convex optimization problem, which can be efficiently solved by many off-the-shelf convex optimization packages. In this paper, we use monqp package to solve it. Further, (15) is a quadratic constraint quadratic programming (QC-QP) problem; more advanced optimization techniques such as Nesterov’s methods  can be applied to improve its computation efficiency.
4.3. S-ELM -Insensitive Regression Algorithm
In this section, we extend the -insensitive ELM  based on the sample-based learning framework. Based on (9), we substitute standard output weight with the sample-based one. The resultant optimization formula is
In (16), corresponding to different samples may have different components. So we minimize their maximum value. We change its optimization objective as (17). Then, we utilize the same transformation technique in Section 4.2 and derive the final convex optimization. In specific, we list the transformation process as follows:From (18), we can see that the objective is convex; the first and third constraint is linear inequality which is convex; the second constraint is a quadratic inequality which is again convex. Consequently, (18) is a convex optimization problem. It can be efficiently solved by many existing methods.
In this section, We demonstrate the proposed algorithms over synthetic data sets, UCI benchmark data sets, and a real world fingerprint image data set to validate the effectiveness of S-ELM learning framework. Missing data used in experiments is produced artificially. In detail, we generate indication matrix (which is depicted in Section 4.2) for each raw data set by the same numbers of row and column. First, all the components of indication matrix are initialized with one. Then, according to specific missing ratio, some randomly chosen components are set to zero. Note that we make sure there is no single row or column to be set all to zeros. For the same data set, indication matrixes are produced for training and testing data, respectively.
To validate the effectiveness of the proposed algorithms, we compare the performance of proposed algorithms with two imputation approaches (i.e., zero-filling and mean-filling). In specific, S-ELM learning algorithms run on the data set with missing data while standard ELM algorithms run in the imputed data set. For abbreviation, ZF-ELM represents standard ELM algorithm running in zero-filling data set, and MF-ELM represents standard ELM algorithm running in mean-filling data set.
5.1. Experiment Settings
All our experiments are implemented on MATLAB 2013b environment running in Core 3.0 GHz CPU and 8 GB RAM. Prior to each algorithm execution, we do some data preprocessing. First, we normalize data sets to . Then, we permutate data sets randomly and divide them into training data and testing data ( equals to ). In order to eliminate randomness, final results are the average of fifty times repetitions. We also calculate the standard deviation to show the stability of algorithms.
5.2. Results of S-ELM Classification
In this section, we use aggregated classification accuracy as the standard to measure the learning precision of different algorithms. Missing ratios are between 0.1 and 0.6 with 0.1 interval. For each algorithm and each data set, the regularization factor is chosen by fivefold cross validations. The UCI benchmark data sets are varied in the number of features and samples; detailed information is shown in Table 1.
The classification accuracy of three algorithms is plotted in Figure 5. As it can be observed, the curves corresponding to S-ELM are above other two in most cases, indicating its superior performance. Furthermore, Table 2 lists the average accuracy and standard deviation. In general, for various UCI benchmark data sets, the S-ELM classification algorithm achieves better accuracy and stability over different missing ratios. The results are consistent with our analysis in Section 4. The two imputation methods classify all samples in full feature space, while S-ELM learning assumes samples with the different vector components lying in their own relevant feature spaces. Particularly, compared with original ELM, S-ELM classification algorithm does not bring extra computational cost.
5.3. Fingerprint Classification
As an application, we evaluate our methods in a real world problem, that is, automatic fingerprint classification, which is an effective index scheme for reducing matching candidate in fingerprint identification. Through gross classification, a query finger is only compared with samples of the same class rather than all samples in the whole database. It reduce computational complexity greatly. In practice, there are some meaningless features in fingerprint data set due to distortion, wet and dry impression, and so forth. The meaningless features can be seen as a kind of missing data. Therefore, we apply the S-ELM classification algorithm to this application.
The fingerprint image data set used in our experiment contains 2000 samples. First, we categorize all those fingerprints into three basic patterns manually: loop, whorl, and arch. All fingerprint images are 8-bit gray-level bmp files and the image resolution is 328 356. Then, we use the first 1000 pairs of fingerprints for training and the second 1000 pairs of fingerprints for testing. Through fingerprint image preprocessing (e.g., edge detection, segmentation, feature, thinning, and extraction), each fingerprint image is expressed as a feature vector of 400 dimensions.
Too low accuracy does not make sense in the application of fingerprint classification. We set missing ratios in a relatively low range, that is, 0.1, 0.2, and 0.3. Besides, zero filling is meaningless. We only take MF-ELM as comparison. The fingerprint data set contains three classes of samples. We use one-versus-rest method to classify them, respectively; that is, we take a target from three classes each time. All the classification accuracy list in Table 3 is the average values of ten times execution results. It can be observed that the classification accuracy of fingerprint classification seriously affected by missing data; the computational cost of different missing ratios are basically the same; in almost all conditions, S-ELM classification algorithm performs more accurately and stably than MF-ELM with almost the same computational cost.
5.4. Results of S-ELM -Insensitive Regression
In this section, we evaluate the S-ELM -insensitive regression algorithm in synthetic data sets and UCI benchmark data sets. Root mean square error (RMSE) is used as the measurement to evaluate the prediction accuracy of algorithms. It is calculated as
We generate synthetic data sets by combinations of basic algebraic operation functions. For each function, all independent variables’ domains are in , and their intervals are randomly different. The detail is given in Table 4. As for S-ELM -insensitive regression, is set to be 0.01 empirically.
From Figure 6, it can be observed that the S-ELM -insensitive regression algorithm achieves better prediction accuracy in different missing ratios. Specifically, S-ELM performs better than ZF-ELM, which proves that S-ELM reduces generalization error by using sample-based output weight vector norm. S-ELM beats MF-ELM with comparatively less obvious advantage. It shows that mean value is closer to missing value than zero.
To further illustrate the advantage of the S-ELM -insensitive regression algorithm, we use UCI benchmark data sets for comparison. Table 5 specifies the UCI benchmark data sets used in our experiments. Those four data sets are varied in scales.
Mean RMSE and standard deviation are reported in Table 6. S-ELM regression algorithm performs better than ZF-ELM and MF-ELM over different missing ratios. As Figure 7 shows, with the missing ratio increasing, both ZF-ELM and MF-ELM’s prediction accuracies decrease rapidly, while S-ELM decreases mildly. The most obvious advantage of S-ELM appears in Pyrim data set, especially when missing ratio is high. It proves that it is more difficult for imputation approaches to estimate the true value of missing features in small scale data set.
6. Conclusions and Future Work
In this paper, we propose a sample-based learning framework for ELM in order to cope with the problem of missing data. Further, we extend ELM classification and -insensitive ELM regression in this framework. For the proposed algorithms, we transform their formulations into convex optimizations. We compare proposed algorithms with two widely used imputation methods in synthetic, UCI benchmark data sets and a real world fingerprint image data set. Result shows that S-ELM achieves better classification and regression accuracy and stability without introducing extra computational complexity. Particularly, S-ELM shows remarkable advantage when the missing ratio is relatively high. Many following works are worth exploring. In the next, we are going to extend more ELM algorithms based on sample-based learning framework. Meanwhile, we will handle missing data in feature selection to improve learning accuracy . Further, we will consider missing data in multiple data sources learning.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is supported by the Major State Basic Research Development Program of China (973 Program) under the Grant no. 2014CB340303, the National High Technology Research and Development Program of China (863 Program) under Grant no. 2013AA01A213, and the National Natural Science Foundation of China under Grant no. 61402490.
A. Mozaffari and N. L. Azad, “Optimally pruned extreme learning machine with ensemble of regularization techniques and negative correlation penalty applied to automotive engine coldstart hydrocarbon emission identification,” Neurocomputing, vol. 131, pp. 143–156, 2014.View at: Publisher Site | Google Scholar
N.-Y. Liang, G.-B. Huang, H.-J. Rong, P. Saratchandran, and N. Sundararajan, “A fast and accurate on-line sequential learning algorithm for feedforward networks,” IEEE Transactions on Neural Networks, vol. 17, no. 6, pp. 1411–1423, 2006.View at: Google Scholar
B. M. Marlin, Missing data problems in machine learning [Ph.D. thesis], University of Toronto, 2008.
P. Royston, “Multiple imputation of missing values,” Stata Journal, vol. 4, pp. 227–241, 2004.View at: Google Scholar
J. Goeman, R. Meijer, and N. Chaturvedi, “L1 and l2 penalized regression models,” cran.r-project.or, 2012.View at: Google Scholar
J. A. R. Little, “Regression with missing x's: a review,” Journal of the American Statistical Association, vol. 87, no. 420, pp. 1227–1237, 1992.View at: Google Scholar
G. Chechik, G. Heitz, G. Elidan, P. Abbeel, and D. Koller, “Max-margin classification of data with absent features,” The Journal of Machine Learning Research, vol. 9, pp. 1–21, 2008.View at: Google Scholar