Computational Intelligence and Neuroscience

Volume 2016, Article ID 8091267, 17 pages

http://dx.doi.org/10.1155/2016/8091267

## Adaptive Online Sequential ELM for Concept Drift Tackling

Faculty of Computer Science, University of Indonesia, Depok, West Java 16424, Indonesia

Received 29 January 2016; Accepted 17 May 2016

Academic Editor: Stefan Haufe

Copyright © 2016 Arif Budiman et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A machine learning method needs to adapt to over time changes in the environment. Such changes are known as concept drift. In this paper, we propose concept drift tackling method as an enhancement of Online Sequential Extreme Learning Machine (OS-ELM) and Constructive Enhancement OS-ELM (CEOS-ELM) by adding adaptive capability for classification and regression problem. The scheme is named as adaptive OS-ELM (AOS-ELM). It is a single classifier scheme that works well to handle real drift, virtual drift, and hybrid drift. The AOS-ELM also works well for sudden drift and recurrent context change type. The scheme is a simple unified method implemented in simple lines of code. We evaluated AOS-ELM on regression and classification problem by using concept drift public data set (SEA and STAGGER) and other public data sets such as MNIST, USPS, and IDS. Experiments show that our method gives higher kappa value compared to the multiclassifier ELM ensemble. Even though AOS-ELM in practice does not need hidden nodes increase, we address some issues related to the increasing of the hidden nodes such as error condition and rank values. We propose taking the rank of the pseudoinverse matrix as an indicator parameter to detect “underfitting” condition.

#### 1. Introduction

Data stream mining is a data mining technique, in which the trained model is updated whenever new data arrive. However, the trained model must work in dynamic environments, where a vast amount of data not only is continuously generated but also keeps changing. This challenging issue is known as concept drift [1], in which the statistical properties of the input attributes and target classes shifted over time. Such shifts can make the trained model less accurate.

More methods for concept drift handling can be found in the literature [1], where the aim is to boost the generalization accuracy. These methods pursue an accurate, simple, fast, and flexible way to retain classification performance when the drift occurs. Ensemble classifier is a well-known way to retain the classification performance. The combined decision of many single classifiers (mainly using ensemble members diversification) is more accurate than single classifier [2]. However, it has higher complexity when handling multiple (consecutive) concept drifts.

One of the popular machine learning methods is Extreme Learning Machine (ELM) introduced by Huang et al. [3–7]. The ELM is a Single-Layer Feedforward Neural Network (SLFN) with fast learning speed and good generalization capability.

In this paper, we focused on the learning adaptation method as an enhancement to Online Sequential Extreme Learning Machine (OS-ELM) [8] and Constructive Enhancement OS-ELM (CEOS-ELM) [9]. We named it as adaptive OS-ELM (AOS-ELM). The AOS-ELM has capability to handle multiple concept drift problems, either changes in the number of attributes (virtual drift/VD) or the number of target classes (real drift/RD) or both at the same time (hybrid drift/HD), also for recurrent context (all concepts occur alternately) or sudden drift (new concept substitutes previous concepts) [10]. Our scope of attribute changes discussed in this paper is on the feature space concatenation widely used in data fusion, kernel fusion, and ensemble learning [11] and not on the feature selection (irrelevant features removal) methods [12]. We compared the performance with nonadaptive sequential ELM: OS-ELM and CEOS-ELM. We also compared the performance with ELM classifier ensembles as the common adaptive approach for concept drift solution. In the present study, although we focus on the adaptation aspect, we address some possible change detection mechanisms that are suitable for our method.

A preliminary version of RD and its early results appeared in conference proceedings [14]. In this paper, we introduced the new scenarios in VD, HD, and consecutive drifts, either recurrent or sudden drift scenarios as well as theoretical background explanation. Our main contributions in this research area can be summarized as follows:(1)We proposed simple adaptive method as enhancement to OS-ELM and CEOS-ELM for addressing concept drifts issue. Unlike ensemble systems [6, 13] that need to manage the complex combination of a vast number of classifiers, we pursue a single classifier for simple implementation while retaining comparable performance for handling multiple (consecutive) drifts.(2)We introduced a simple unified platform to handle a hybrid drift (HD) when changes in the number of attributes and the number of target classes occurred at the same time.(3)We elaborated how the AOS-ELM for transfer learning uses hybrid drift strategy. Transfer learning focuses on extracting the knowledge from one or more source task domains and applies the knowledge to a different target task domain [15]. Concept drift focuses on the time-varying domain with a small number of current data available. In contrast, transfer learning is not associated with time and requires the entire training and testing data set [16]. The example of transfer learning by using HD strategy is the transition from different data set sources but still related and with the same purpose. In this paper, we discussed the transfer learning on numeric handwritten MNIST [17] to alphanumeric handwritten USPS [18] recognition.(4)Naturally, the AOS-ELM handling strategy was based on recurrent context. We devised an AOS-ELM strategy to handle sudden drift scenario by introducing output marginalization method. This method is also applicable for concept drift in a regression problem.(5)We studied the effect of increasing the number of hidden nodes, which is treated as one of learning parameters, to improve the accuracy (other learning parameters are input weight, bias, activation function, and regularization factor). We proposed the evaluation parameter to predict the accuracy before the training was completed. We applied this assessment parameter actually to prevent “underfitting” or nonconvergence condition (the model does not fit the data well enough that makes accuracy performance dropped) when any learning parameter changes such as hidden nodes increased.

This paper is organized as follows. Section 2 explains some issues and challenges in concept drift, the background of ELM, and ELM in sequential learning. Section 3 presents the background theory and algorithm derivation of the proposed method. In Section 4, we focus on the empirical experiments to prove the methods and research questions in regression and classification problem. We use artificial and real data set. The artificial data sets are streaming ensemble algorithm (SEA) [19] and STAGGER [20], which are commonly used as benchmark in sequential learning. The real data sets are handwritten recognition data: MNIST for numeric [17] and USPS for alphanumeric classes [18]. We studied the effect of hidden nodes increase as one of the important learning parameters in Section 4.5. Section 7 discusses research challenges and future directions. The conclusion presents some highlights in Section 8.

#### 2. Related Works

##### 2.1. Notations

We specify the notations used throughout this article for easier understanding as follows:(i)Matrix is written in uppercase bold (e.g., ).(ii)Vector is written in lowercase bold (e.g., ).(iii)The transpose of a matrix is written as . The pseudoinverse of a matrix is written as .(iv), will be used as nonlinear differentiable function (activation function), for example, sigmoid or tanh function.(v)The amount of training data is . Each input data contains some attributes. The target has number of classes. An input matrix can be denoted as and the target matrix as .(vi)The hidden layer matrix is . The input weight matrix is . The output weight matrix is . The matrix is the additional block portion of the matrix . The matrix is the autocorrelation matrix of . The inverse of matrix is .(vii) can be denoted as . can be denoted as and can be denoted as . denotes the additional nodes number of .(viii)When the number of training data , we employed the online sequential learning method by updating model every time each new training pairs are seen. is the subset of input data at time as the initialization stage. are the subset of input data at the next sequential time. Each subset may have different number of quantities. The corresponding label data is presented as . We used the subscript font with parenthesis to show the sequence number.(ix)We denote the training data from different concepts (sources or contexts), using the symbol for training data and for target data. We used the subscript font without parenthesis to show the source number.(x)We denote the drift event using the symbol , where the subscript font shows the drift type. For example, Concept 1 has virtual drift event to be replaced by Concept 2 (sudden drift): . Concept 1 has real drift event to be replaced by Concept 1 and Concept 2 recurrently (recurrent context) in the shuffled composition: .

##### 2.2. Concept Drift Strategies

In this section, we briefly explained the various concept drift solution strategies.

Gama et al. [1] explained that many concept drift methods have been developed, but the terminologies are not well established. According to Gama et al., the basic concept drift based on Bayesian decision theory in the classification problem for class output and incoming data is

Concept drift occurred when has changed; for example, , where and are, respectively, the joint distribution at times and . Gama et al. categorized the concept drift types as follows:(1)Real drift (RD) refers to changes in . The change in may be caused by a change in the class boundary (the number of classes) or the class conditional probabilities (likelihood) . The number of classes expanded and different class of data may come alternately, known as recurrent context. A drift, where new conditional probabilities replace the previous conditional probabilities while the number of classes remained the same, is known as sudden drift. Other terms are concept shift or conditional change [21].(2)Virtual drift (VD) refers to the changes in the distribution of the incoming data (e.g., changes). These changes may be due to incomplete or partial feature representation of the current data distribution. The trained model is built with additional data from the same environment without overlapping the true class boundaries. Other terms are feature change [21], temporary drift, or sampling shift.

Kuncheva [10, 22] explained the various configuration patterns of data sources over time as random noise, random trends (gradual changes), random substitutions (abrupt or sudden changes), and systematic trends (recurring context). The random noise will simply be filtered out. A gradual drift occurs when many concepts may reoccur alternately in the gradual stage for a certain period. A consecutive drift takes place when many previously active concepts might keep on changing alternately (recurring context) after some time. The sudden drift (abrupt changes or concept substitutions) is the type that at one time one concept is suddenly replaced by another concept.

Žliobaitė [13] proposed a taxonomy of concept drift tackling methods as shown in Figure 1. It describes the methods based on when the model is switched on (the “when” axis) and how the learners adapt to training set formation or design and parametrization of the base learner (the “how” axis). The “when” axis spans drift handling from trigger based to evolving based methods. The “how” axis spans drift handling from training set formation to model manipulation (or parametrization) methods.