Mathematical Problems in Engineering

Volume 2015, Article ID 412957, 11 pages

http://dx.doi.org/10.1155/2015/412957

## One-Class Classification with Extreme Learning Machine

^{1}School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 101408, China^{2}Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China

Received 13 August 2014; Revised 8 November 2014; Accepted 10 November 2014

Academic Editor: Zhan-li Sun

Copyright © 2015 Qian Leng et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

One-class classification problem has been investigated thoroughly for past decades. Among one of the most effective neural network approaches for one-class classification, autoencoder has been successfully applied for many applications. However, this classifier relies on traditional learning algorithms such as backpropagation to train the network, which is quite time-consuming. To tackle the slow learning speed in autoencoder neural network, we propose a simple and efficient one-class classifier based on extreme learning machine (ELM). The essence of ELM is that the hidden layer need not be tuned and the output weights can be analytically determined, which leads to much faster learning speed. The experimental evaluation conducted on several real-world benchmarks shows that the ELM based one-class classifier can learn hundreds of times faster than autoencoder and it is competitive over a variety of one-class classification methods.

#### 1. Introduction

One-class classification [1, 2] has received much interest during recent years, which has also been known as novelty or outlier detection. Different from normal classification, data samples from only one class, called the target class, are well characterized, while there are no or few samples from the other class (also called the outlier class). To reveal the necessity of one-class classification, we take the case of online shopping service as an example. In order to recommend goods users want, it is convenient to track the users’ history shopping lists (positive training samples), while collection of negative training samples is challenging because it is hard to say which one users dislike. Other applications include machine fault detection [3], disease detection [4], and credit scoring [5]. The goal is to “teach” the classifier through observing target samples so that it can be applied to select unknown samples similar to the target class and reject samples which deviate significantly from the target class.

Various types of one-class classifier have been designed and applied in different fields; see [6] for a comprehensive review. Early attempt to obtain a one-class classifier is by estimating the probability density functions based on training data. Parzen density estimation [7, 8] superposes kernel functions on individual training samples to estimate the probability density function. Naive Parzen density estimation, similar to Naive Bayes approach used for classification, fits a Parzen density estimation on each individual feature and multiplies the results for final density estimation. A test sample is rejected if its estimated probability is below a threshold. However, estimating the true density distribution usually requires a large number of training samples.

A simpler task is to find the domain of the data distribution. Schölkopf et al. [9] constructed a hyperplane which is maximally distant from the origin to separate the regions that contain no data. An alternative approach is to find a hypersphere [10] instead of a hyperplane to include the most target data with the minimum radius. Both approaches are cast out in the form of quadratic programming, while some approaches [11–13] are of linear programming. One-class LP classifier [11] minimizes the volume of the prism, which is cut by a hyperplane that bounds the data from above with some mild constrains on dissimilarity representations. Lanckriet et al. [13] propose the one-class minimax probability machine that minimizes the worst case probability of misclassification of test data, using only the mean and covariance matrix of the target distribution. When kernel methods are used, the aforementioned domain-based classifiers [2] can obtain more flexible descriptions. Recently, a minimum spanning tree based one-class classifier [14] was proposed. It considered graph edges as additional set of virtual target objects. By constructing a minimum spanning tree, recognition of a new test sample is determined by the shortest distance to the closest edge of that tree.

Autoencoder neural network is one of the reconstruction methods [1] to build a one-class classifier. The simplest architecture of such model is based on the single-hidden layer feed-forward neural networks (SLFNs). Usually, the hidden layer contains a smaller number of nodes than the number of input nodes which works like an information bottleneck. The classifier reproduces the input patterns at the output layer through minimizing the reconstruction error. However, standard backpropagation (BP) algorithm is used to train the networks, which is quite time-consuming. Extreme learning machine [15, 16] is originally developed to address the slow learning speed problem of gradient based learning algorithms for its iterative tuning of the networks’ parameters. It randomly selects all parameters of the hidden neurons and analytically determines the output weights. It is stated [17, 18] in theory that ELM tends to provide the best generalization performance at extreme learning speed since it is a simple tuning-free algorithm.

In this paper, the proposed one-class classifier based on ELM is constructed for situations where only the target class is well described. The proposed one-class classifier utilizes the unified ELM learning theory [17], which leads to extreme learning speed and superior generalization performance. Moreover, the classifier further lessens the human intervention since it is not limited to specific target labels. Both random feature mappings and kernels can be adopted for such classifier which makes it more flexible to unique target descriptions. Constructing the proposed classifier for three quite different specific-designed artificial datasets demonstrates the classifier’s ability to describe universal target class distributions. When real-world datasets are evaluated, the proposed one-classifier is competitive over a variety of one-class models and learns hundreds of times faster than autoencoder neural network for one-class classification.

The rest of the paper is organized as follows. Section 2 briefly reviews extreme learning machine. In Section 3, we first describe the hypersphere perceptron as a one-class classifier and then introduce our proposed ELM based one-class classifier. Section 4 describes the experiments conducted on both artificial and real-world datasets. Finally, Section 5 presents the conclusion of the work.

#### 2. Brief Review of ELM

ELM aims to reach not only the smallest training error but also the smallest norm of output weights [16] between the hidden layer and the output layer. According to Bartlett’s theory [19], the smaller norm of weights is, the better generalization performance of networks tends to have. Thus, better generalization performance can be expected for ELM networks. In [17], equality constraints are used in ELM, which provides a unified solution for regression, binary, and multiclass classifications.

##### 2.1. Equality-Optimization-Constraints-Based ELM

Given training data , where is the individual feature vector with dimension and is the desired target output, in the one-class classification case, single output node () is enough. The ELM output function can be formulated aswhere is the vector of the output weights between the hidden layer and the output layer, is the input weights connecting input nodes with the th hidden node, is the bias of the th hidden node, is the output vector of the hidden layer with respect to input , and is the activation function (e.g., sigmoid function satisfying ELM universal approximation capability theorems [20, 21]. In fact, is a known nonlinear feature mapping which maps the training data from the -dimensional input space to the -dimensional ELM feature space [17]. The goal of ELM is to minimize the norm of output weights as well as the training errors, which is equivalent towhere is the slack variable of the training sample and controls the tradeoff between the output weights and the errors. Based on the Karush-Kuhn-Tucker (KKT) theorem [22], the corresponding Lagrange function of the primal ELM optimization (2) isthe following optimality conditions of (3) should be satisfied:where is the hidden layer output matrix and is the vector of Lagrange variables. Substituting (4a) and (4b) into (4c) we haveHere is the identity matrix and . Substituting (5) into (4a), we getThe ELM output function (1) can be further derived asIf the hidden nodes’ feature mapping is unknown to users, kernel methods that satisfy Mercer’s condition can be adopted: . The ELM kernel output function can be written asand the kernel matrix for ELM is

##### 2.2. Advances of ELM

Extreme learning machine has gained much more popularity since its advent. It has been able to avoid the problem of time complexity which classic learning techniques are confronted with while providing better generalization performance with less human intervention. Because of such attractive features, researchers have extended the basic ELM to several different directions and many variants of ELM have been developed. For instance, online sequential ELM (OS-ELM) [23, 24] can learn the sequential coming data (one by one or chunk by chunk) with a small effort to update the output weights. The training data are discarded after being learned by the network and the output weights need not be retrained, which is especially efficient for time-series problems. Other typical works include fully complex ELM [25, 26], incremental ELM (I-ELM) [20, 21], sparse ELM [27], ELM with elastic output [28, 29], and ELM ensembles [30–32]. See [33] for further details on the many ELM variants.

When uncertainty is present in the dataset, integration of fuzzy logic system and extreme learning machine tends to enhance the generalization capability of ELM. In [34], a neurofuzzy Takagi-Sugeno-Kang (TSK) fuzzy inference system is constructed utilizing extreme learning machine. The number of inference rules is previously determined by the -means method. One ELM is used to obtain the membership of each fuzzy rule and multiple ELM are used to obtain the consequent part. Rong et al. [35] show that type-1 fuzzy inference system (type-1 FLS) is equivalent to a generalized SLFN. Hence, the hidden nodes work as the antecedent part and the output weights as the consequent part. Then, extreme learning machine is directly applied to the type-1 FLS and the corresponding online sequential fuzzy ELM has also been developed. Deng et al. [36] further extend the idea to type-2 fuzzy inference system (type-2 FLS) because of type-2 FLS’s superiority in modeling high level uncertainty. With the most widely used interval type-2 FLS, the parameters of the antecedents are randomly initialized according to the ELM mechanism. The Moore-Penrose generalized inverse is used to initialize the parameters of the consequents and the parameters are finally refined by Karnik-Mendel algorithm [37]. Many applications have also been investigated in the literature. For example, the hybrid model of ELM with interval type-2 FLS has been applied for permeability prediction [38].

#### 3. The Proposed One-Class Classifier

##### 3.1. Support Vector Data Description

For a better understanding of one-class classifiers, support vector data description (SVDD) [10] is discussed here for one-class classification process. SVDD defines a spherically shaped boundary around the complete target set and is intuitively appealing since it regards the target class as a self-closed system. Let be the training set, and is drawn from the target distribution. SVDD aims to minimize the volume of the sphere as well as the training errors for objects falling outside the boundary, which is equivalent towhere and are the hypersphere’s radius and center, respectively. Parameter controls the tradeoff between the volume and the errors.

The corresponding function of the primal SVDD optimization (9) iswith the Lagrange variables and . should be minimized with respect to , , and maximized with respect to .

Based on the Karush-Kuhn-Tucker (KKT) theorem [22], to get the optimal solutions of (10), we should haveFrom (11c) and and , can be removed and can be further limited to the interval :

Substituting (11a)–(11c) into (10), the dual optimization function can be derived assubject to constraints (11a) and (12). To constitute a flexible data description model, kernel function , with an implicit feature mapping of the data into a higher dimensional feature space, can be adopted to replace the inner product . In this case, the corresponding dual optimization function is changed toThe KKT conditions of the target functions are

The constraints have to be enforced and we have three cases as follows:(1)(2)(3)

Only a small ratio of objects with are called the support vectors. The dual optimization functions (13) and (14) are standard Quadratic Programming (QP) problems and the Lagrange variables can be obtained using some optimization methods such as SMO algorithm [39]. To test a new object** z**, its distance to the center of the sphere is calculated. The classifier will accept the object if the distance is less than or equal to the radius:

In addition to the batch learning model of SVDD, incremental learning methods [40] of SVM are extended to SVDD algorithm. Yin et al. [28] show an online fault diagnosis process through a hybrid model of incremental SVDD (ISVDD) and ELM with incremental output structure (IOELM). They used the ISVDD to detect the unknown failure model, and the output nodes of IOELM are adaptively increased to recognize the new failure mode.

##### 3.2. The ELM Based One-Class Classifier

When data only from the target class are available, the one-class classifier is trained to accept target objects and reject objects that deviate significantly from the target class. In the training phase, the one-class classifier, which defines a distance function between the objects and the target class, takes in the training set to build the classification model. In general, the classification model contains two important parameters to be determined: threshold and modal parameter . A generic test sample is accepted by the classifier if .

In the training phase, not all the training samples are to be accepted by the one-class classifier due to the presence of outliers or noisy data contained in the training set. Otherwise, the trained classification model may generalize poor to unknown test set when the training set includes abnormal data samples. Usually, threshold is determined such that a user-specified fraction of training samples most deviant from the target class are rejected. For instance, if one is told five percent of training samples are mislabeled, setting makes the classifier more robust. Even when all the samples are correctly labeled, rejecting a small fraction of training samples helps the classifier to learn the most representative model from the training samples.

Any one-class classifier has model parameters which influence the model complexity (flexibility), for example, the number of hidden nodes in autoencoder neural networks or the tradeoff parameter of SVDD. Minimizing the errors of both the target and outlier classes on a cross-validation set is no longer available since there is no data from the outlier class. Fortunately, several model selection criteria [2] have been proposed. Assuming the uniform distribution of the outlier class, consistency-based model selection [41] method is one of the most effective methods used to select the model parameters. The basic idea is that the complexity of the classifier can be increased as long as it still fits the target data. The more complex the model, the smaller the volume of the classifier in the object space and the less the probability of outlier objects falling inside the domain of the classifier. In practice, one can make an ordering of the potential model parameters such that the latter parameter always yields the more complex classifier and chooses the most complex classifier without overfitting the target data.

The compactness hypothesis [42] is the basis for object recognition. It states that similar real world objects have to be close in the feature space. Therefore, for similar objects from the target class, the target outputs should be the same:where is a real number. All the training samples’ target outputs are set to the same value . Then, the desired target output vector is . Training the samples from the target class can directly use the optimization function (2). For a new test sample , the distance function between the sample object and the target class is defined as

The decision whether belongs to the target class or not is based on threshold . Recall that is optimized to reject a small fraction of training samples to avoid overfitting. The distances of the training samples to the target class can be directly determined using (21) and the constraint of (2) From (22), we find the distances are and the larger means the more deviant of the training sample from the target class. Hence, we derive threshold based on a quantile function to reject the most deviant training samples. Denote the sorted sequence of the distances of training samples by such that . Here, and represent the most and the least deviant samples. The function determining can be written as where returns the largest integer not greater than . Then, we can get the decision function for to the target class:

*Remark 1. *The target output can be assigned to arbitrary real number except 0. When , seen from (6), the output weights between the hidden layer and the output layer become 0 (, is the -dimensional zero vector). Therefore, the decision value of any sample is 0 using the proposed classifier. It is obvious that, in such case, the one-class classifier cannot distinguish between the target class and the outlier class. When , as there are infinite possible , there seem to exist infinite ELM based one-class classifiers. To get a universal ELM based one-class classifier, we normalize the distance function (21) by dividing the target output where is the -dimensional unit vector. The normalization formula (25) is to eliminate the possible bias introduced by the target output . In practice, one can set the target output such that (21) is equivalent to (25) and the normalization step is implicitly done.

*Remark 2. *Both random feature mappings and kernels can be used for the proposed one-class classifier. When nonlinear piecewise continuous functions satisfying ELM universal approximation capability theorems [20, 21] are used as the activation function, the ELM network can approximate any target continuous function as long as the number of hidden nodes is large enough. When the feature mapping is unknown, kernel methods can be adopted as shown in (8a) and (8b). Huang et al. [17] have shown ELM, the unified solution for regression, binary, and multiclass classifications. Since the same optimization formula (2) is used in the proposed one-class classifier, this paper also shows ELM, the unified learning mode for one-class classification.

Figure 1 shows the decision boundaries (black curves) of the classifier with incremental hidden nodes using sigmoid function as the activation function. The dataset (blue points) is composed of 100 samples in the plane. Threshold is determined such that and the model parameter is automatically determined by the consistency-based model selection method. When the number of hidden nodes is small , the classifier fails to approximate the target region and some unexpected “holes” without any targets can be seen from the leftmost picture of Figure 1. The weakness alleviates as more hidden nodes are added. When the number of hidden nodes gets large enough, the classifier can be close enough to describe the target class well. This is consistent with ELM universal approximation capability theorems [20, 21].