Mathematical Problems in Engineering

Volume 2015 (2015), Article ID 126452, 17 pages

http://dx.doi.org/10.1155/2015/126452

## Stream-Based Extreme Learning Machine Approach for Big Data Problems

^{1}Graduate Program in Electrical Engineering, Federal University of Minas Gerais, Avenida Antônio Carlos 6627, 31270-901 Belo Horizonte, MG, Brazil^{2}Institute of Science and Technology, Federal University of Jequitinhonha and Mucuri Valleys, Rodovia MGT 367, Km 583, 5000 Alto da Jacuba, 39100-000 Diamantina, MG, Brazil^{3}Department of Electrical Engineering, Federal University of Minas Gerais, Avenida Antônio Carlos 6627, 31270-901 Belo Horizonte, MG, Brazil^{4}Department of Electronics Engineering, Federal University of Minas Gerais, Avenida Antônio Carlos 6627, 31270-901 Belo Horizonte, MG, Brazil

Received 15 May 2015; Accepted 17 August 2015

Academic Editor: Huaguang Zhang

Copyright © 2015 Euler Guimarães Horta et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Big Data problems demand data models with abilities to handle time-varying, massive, and high dimensional data. In this context, Active Learning emerges as an attractive technique for the development of high performance models using few data. The importance of Active Learning for Big Data becomes more evident when labeling cost is high and data is presented to the learner via data streams. This paper presents a novel Active Learning method based on Extreme Learning Machines (ELMs) and Hebbian Learning. Linearization of input data by a large size ELM hidden layer turns our method little sensitive to parameter setting. Overfitting is inherently controlled via the Hebbian Learning crosstalk term. We also demonstrate that a simple convergence test can be used as an effective labeling criterion since it points out to the amount of labels necessary for learning. The proposed method has inherent properties that make it highly attractive to handle Big Data: incremental learning via data streams, elimination of redundant patterns, and learning from a reduced informative training set. Experimental results have shown that our method is competitive with some large-margin Active Learning strategies and also with a linear SVM.

#### 1. Introduction

The induction of Supervised Learning models relies on a large enough set of pairs obtained by sampling from the input space according to a probability function and by querying an oracle function for the labels . The final goal of learning is to obtain the parameters and of the approximation function so that . Convergence conditions to guarantee that in depend on the representativeness and size of the learning set . Reliable labeling of the input samples is of paramount importance to guarantee robustness of the approximation function . In any case, should be large enough to guarantee convergence conditions.

In Big Data problems, however, the availability of a large amount of data reveals itself as another important challenge for the induction of supervised models [1]. Learning using the entire dataset can be impracticable for most current supervised classifiers due to their time-consuming training procedures. The problem becomes more evident when labeling cost is difficult or expensive and data is presented to the learner via data streams [2]. Dealing with Big Data requires some technique to circumvent the need of considering the entire data in the learning process. In this context, sampling probability can be controlled in order to induce good learning models using fewer patterns.

In Supervised Learning it is assumed that the learner has no control of the sampling probability . Nonetheless, the construction of learning machines that may influence has become a central problem in recent years. This new subfield of Machine Learning, known as* Active Learning* [2], has received special attention due to the new learning settings that have appeared in application areas such as bioinformatics [3], electronic commerce [4], and video classification [5].

In the new setting, the learner is* active* and may actually choose the samples from a stream [6] or pool [7] of data to be labeled. The sample selection strategy embodied into the active learner determines the probability of an input sample to be selected for labeling and learning. In the end, the goal of Active Learning is similar to that of Supervised Learning: to induce a learning function that is valid in the whole input domain by behaving as similar as possible to the label-generator function . The goal is to select more representative samples that will result in . For instance, in a classification problem those samples that are near the separation margin between classes may suffice [8, 9] if discriminative models like Support Vector Machine (SVM) [10] and Perceptron-based neural networks [11] are used.

Margin-based Active Learning has been usually accomplished by considering the simplistic linear separability of patterns in the input space [12–14]. Once a linear separator is obtained from the initial samples, further labeling is accomplished according to a preestablished criterion, usually related to sample proximity to the separator, which is simpler to calculate if the separator is linear. In a more realistic and general approach, however, a nonlinear separator should be considered, which requires that linearization be carried out by mapping the input data into a feature space, where sample selection is actually accomplished. The overall function is composed of the hidden layer mapping function and the output function . Since both functions are single layer, can only perform linear separation and is expected to linearize the problem. Nevertheless, the difficulty with the nonlinear approach is that in order to obtain some sort of user interaction may be required.

In order to overcome the difficulty to obtain a user-independent nonlinear feature space mapping, in this paper we present a method that is based on the principles of Extreme Learning Machines (ELMs) [15] to obtain the mapping function . The basic principle of ELM is to randomly sample the elements of and to expand the input space into a higher dimension in order to obtain . This is the most fundamental difference between ELM, feedforward neural networks, and SVM, since in these two models the function is obtained by minimizing the output error. In practice, the only parameter required by the ELM projection is the dimension (number of neurons) of the feature space to which its final performance is not much sensitive.

Although both ELM and SVM are based on two-layer mapping, SVM’s kernel provides an implicit mapping whereas ELM is based on the explicit mapping by the hidden layer sigmoidal functions [16, 17]. The two models also differ on the way that smoothing of the approximation function is treated, since SVM’s performance relies on support vectors and ELM’s output is computed considering the whole dataset. The Lagrangian solution of SVM’s quadratic programming learning problem yields Lagrange multipliers that, in practice, point out to the patterns (the support vectors) that will be used to compute SVM’s output. In fact, SVM’s output for the input pattern is a linear combination of the labels weighted by the kernel between and all other learning patterns : . The linear combination coefficients are the Lagrange multipliers resulting from the solution of the quadratic programming problem, which was formulated with the objective of minimizing the empirical risk and maximizing the separation margin [10]. Since only those patterns with nonzero Lagrange multipliers effectively contribute to the computation of , SVM’s learning can be seen as a sample selection problem. Given the proper kernel parameters, the selection of margin patterns and the Lagrange multipliers yields error minimization and margin maximization [10]. In such a scenario, “discarded” samples are those assigned to null Lagrange multipliers and the “selected” ones, the support vectors, are those with Lagrange multipliers in the range , where is a regularization parameter.

SVM’s learning approach, however, can not be directly applied to the Active Learning problem since the whole dataset must be available at learning time so that the optimization problem can be solved. Active Learning methods should be capable of dealing with incremental and online learning [2, 14, 18], which is particularly convenient to Big Data problems. Nonetheless, the selection strategy presented in this paper aims at patterns near the class separation boundaries, which is expected to result in large-margin separators and in output function smoothing that may compensate for overparametrization [19] of the projection function .

The mapping of into the feature space is expected to embody a linearly separable problem given a large enough number of projection neurons [20]. Once the mapping matrix is obtained, the learning problem is reduced to selecting patterns and to inducing the parameters of the linear separator.

The pseudoinverse approach adopted by original formulation of ELM [15] to obtain the linear separator results in overfitting when the number of selected patterns tends to the number of neurons (). In such a situation, the number of equations is the same as the number of unknowns and the pseudoinverse yields a zero-error solution [15]. Consequently an overfitted model is obtained due to the large number of hidden neurons required to separate the data with a random projection. Since in Active Learning the training set size will most likely reach the number of neurons as more patterns are labeled, the zero-error solution effect of the pseudoinverse is unwanted because it may result in a sudden decline in performance near the limit . Because of that, an alternative to the pseudoinverse solution should be adopted in Active Learning problems.

Recently, Huang et al. [21] proposed a regularized version of ELM that can avoid the zero-error solution for . For this formulation a regularization parameter should be fine-tuned, which can increase the costs to perform Active Learning, because some labeled patterns should be separated to the parameter tuning. In addition, since Active Learning is incremental, relearning the whole dataset for every new pattern can be prohibitive. At first sight, the Online Sequential Extreme Learning Machine (OS-ELM) [22] could be a good candidate to Active Learning, because it can learn data one by one or chunk by chunk. However, its formulation demands that the initial model must be calculated using at least patterns. In this case can be large, which implies that the initial learning set should also be large. So, this is not the best option, because the main objective of Active Learning is to minimize the number of labeled patterns necessary to learn [2]. Because of that, in this paper we propose a new incremental learning approach to replace the pseudoinverse-based solutions. The method has an inherent residual term that compensates the zero-error solution of the pseudoinverse and that can be viewed as implicit regularization.

The method presented in this paper is a classifier composed of a large size ELM hidden layer and an output layer learned via a Hebbian Learning Perceptron with normalized weights [23]. The Active Learning strategy relies on a convergence test adapted from the Convergence Theorem of Perceptron [11, 24]. The learning process is stream-based and each pattern is analyzed once. It is also incremental and online, which is particularly suitable for Big Data. Experimental results have shown that the proposed Active Learning strategy achieved a performance similar to linear SVM with ELM kernel and to regularized ELM. Our approach, however, has shown learning only a small part of the dataset.

The remainder of this paper is organized as follows: Section 2 describes the foundations of Extreme Learning Machines. Section 3 presents the Hebbian Learning. Section 4 discusses how overfitting can be controlled using Hebbian Learning. Section 5 extends the Perceptron Convergence Theorem [11, 24] to the Hebbian Learning with normalized weights [23]. Section 6 presents the principles of our Active Learning strategy. Experimental results are shown in Section 7. At last, the final discussions and conclusions are provided in Section 8.

#### 2. Extreme Learning Machines

ELM can be seen as a learning approach to train a two-layer feedforward neural network, Multilayer Perceptron (MLP) type [15]. The method has basically the following main characteristics: (1) number of hidden neurons is large, (2) training of hidden and output layers is made separately, (3) hidden nodes parameters are not learned according to a general objective function but randomly chosen, and (4) output weights are not learned iteratively but obtained directly with the pseudoinverse method.

The input matrix with rows and columns contains the input training data, where is the number of samples and is the input space dimension. The rows of the vector contain the corresponding labels of each one of the input samples of :

Function , with argument , matrix of weights , and vector of bias , maps each one of the rows of into the rows of the mapping matrix , where is the number of hidden layer neurons (). Activation functions of all neurons are regular sum-and-sigmoid functions:

In the particular case of ELM since the elements of and are randomly sampled the number of hidden neurons is expected to be large enough to meet the linear separability conditions of Cover’s theorem [20], so the projected data from into is assumed to be linearly separable.

Matrix is then mapped into the output space by the function in order to approximate the label-vector . The vector contains the parameters of the linear separator in the hidden layer and is obtained solving a linear system of equations:

The smallest norm least-squares solution of the above linear system is [15]where is the Moore-Penrose pseudoinverse. The network response to an input pattern is obtained by first calculating and then by estimating the output as [21] for binary classification. For multiclass classification the output is estimated choosing the highest output neuron. In this paper we focus only in binary classification.

Like in any function approximation problem, the resulting general function is expected to be robust to . However, the pseudoinverse zero-error least-squares solution results in overfitting of the oversized ELM when is close to . Since the learning set is formed incrementally in Active Learning, will eventually reach as learning develops, which makes the use of original formulation of ELM in this context impracticable, even if the heuristics of Schohn and Cohn [9] and Tong and Koller [8] are applied.

In order to show performance degradation when using the pseudoinverse, a sample selection strategy using ELM with 100 hidden neurons was applied to the dataset of Figure 1, which is a nonlinear binary classification problem with 180 samples of each class. The experiment was performed with 10-fold cross-validation and 10 runs. The learning process started using only one randomly chosen pattern, which was then projected into the hidden layer with random bias and weights. Output weights were obtained with the pseudoinverse followed by the calculation of Area Under the ROC Curve (AUC) [25] performance on the test set. Active Learning continues with the random selection strategy and as new random patterns are added to the learning set the projection procedures and pseudoinverse calculation are repeated. Figure 2 shows the yielded average AUC on all experiments. As can be observed, AUC performance degrades sharply in the region when the number of equations reaches the number of unknowns and the pseudoinverse solution of the linear system is exact.