Abstract

One-class classification problem has been investigated thoroughly for past decades. Among one of the most effective neural network approaches for one-class classification, autoencoder has been successfully applied for many applications. However, this classifier relies on traditional learning algorithms such as backpropagation to train the network, which is quite time-consuming. To tackle the slow learning speed in autoencoder neural network, we propose a simple and efficient one-class classifier based on extreme learning machine (ELM). The essence of ELM is that the hidden layer need not be tuned and the output weights can be analytically determined, which leads to much faster learning speed. The experimental evaluation conducted on several real-world benchmarks shows that the ELM based one-class classifier can learn hundreds of times faster than autoencoder and it is competitive over a variety of one-class classification methods.

1. Introduction

One-class classification [1, 2] has received much interest during recent years, which has also been known as novelty or outlier detection. Different from normal classification, data samples from only one class, called the target class, are well characterized, while there are no or few samples from the other class (also called the outlier class). To reveal the necessity of one-class classification, we take the case of online shopping service as an example. In order to recommend goods users want, it is convenient to track the users’ history shopping lists (positive training samples), while collection of negative training samples is challenging because it is hard to say which one users dislike. Other applications include machine fault detection [3], disease detection [4], and credit scoring [5]. The goal is to “teach” the classifier through observing target samples so that it can be applied to select unknown samples similar to the target class and reject samples which deviate significantly from the target class.

Various types of one-class classifier have been designed and applied in different fields; see [6] for a comprehensive review. Early attempt to obtain a one-class classifier is by estimating the probability density functions based on training data. Parzen density estimation [7, 8] superposes kernel functions on individual training samples to estimate the probability density function. Naive Parzen density estimation, similar to Naive Bayes approach used for classification, fits a Parzen density estimation on each individual feature and multiplies the results for final density estimation. A test sample is rejected if its estimated probability is below a threshold. However, estimating the true density distribution usually requires a large number of training samples.

A simpler task is to find the domain of the data distribution. Schölkopf et al. [9] constructed a hyperplane which is maximally distant from the origin to separate the regions that contain no data. An alternative approach is to find a hypersphere [10] instead of a hyperplane to include the most target data with the minimum radius. Both approaches are cast out in the form of quadratic programming, while some approaches [1113] are of linear programming. One-class LP classifier [11] minimizes the volume of the prism, which is cut by a hyperplane that bounds the data from above with some mild constrains on dissimilarity representations. Lanckriet et al. [13] propose the one-class minimax probability machine that minimizes the worst case probability of misclassification of test data, using only the mean and covariance matrix of the target distribution. When kernel methods are used, the aforementioned domain-based classifiers [2] can obtain more flexible descriptions. Recently, a minimum spanning tree based one-class classifier [14] was proposed. It considered graph edges as additional set of virtual target objects. By constructing a minimum spanning tree, recognition of a new test sample is determined by the shortest distance to the closest edge of that tree.

Autoencoder neural network is one of the reconstruction methods [1] to build a one-class classifier. The simplest architecture of such model is based on the single-hidden layer feed-forward neural networks (SLFNs). Usually, the hidden layer contains a smaller number of nodes than the number of input nodes which works like an information bottleneck. The classifier reproduces the input patterns at the output layer through minimizing the reconstruction error. However, standard backpropagation (BP) algorithm is used to train the networks, which is quite time-consuming. Extreme learning machine [15, 16] is originally developed to address the slow learning speed problem of gradient based learning algorithms for its iterative tuning of the networks’ parameters. It randomly selects all parameters of the hidden neurons and analytically determines the output weights. It is stated [17, 18] in theory that ELM tends to provide the best generalization performance at extreme learning speed since it is a simple tuning-free algorithm.

In this paper, the proposed one-class classifier based on ELM is constructed for situations where only the target class is well described. The proposed one-class classifier utilizes the unified ELM learning theory [17], which leads to extreme learning speed and superior generalization performance. Moreover, the classifier further lessens the human intervention since it is not limited to specific target labels. Both random feature mappings and kernels can be adopted for such classifier which makes it more flexible to unique target descriptions. Constructing the proposed classifier for three quite different specific-designed artificial datasets demonstrates the classifier’s ability to describe universal target class distributions. When real-world datasets are evaluated, the proposed one-classifier is competitive over a variety of one-class models and learns hundreds of times faster than autoencoder neural network for one-class classification.

The rest of the paper is organized as follows. Section 2 briefly reviews extreme learning machine. In Section 3, we first describe the hypersphere perceptron as a one-class classifier and then introduce our proposed ELM based one-class classifier. Section 4 describes the experiments conducted on both artificial and real-world datasets. Finally, Section 5 presents the conclusion of the work.

2. Brief Review of ELM

ELM aims to reach not only the smallest training error but also the smallest norm of output weights [16] between the hidden layer and the output layer. According to Bartlett’s theory [19], the smaller norm of weights is, the better generalization performance of networks tends to have. Thus, better generalization performance can be expected for ELM networks. In [17], equality constraints are used in ELM, which provides a unified solution for regression, binary, and multiclass classifications.

2.1. Equality-Optimization-Constraints-Based ELM

Given training data , where is the individual feature vector with dimension and is the desired target output, in the one-class classification case, single output node () is enough. The ELM output function can be formulated aswhere is the vector of the output weights between the hidden layer and the output layer, is the input weights connecting input nodes with the th hidden node, is the bias of the th hidden node, is the output vector of the hidden layer with respect to input , and is the activation function (e.g., sigmoid function satisfying ELM universal approximation capability theorems [20, 21]. In fact, is a known nonlinear feature mapping which maps the training data from the -dimensional input space to the -dimensional ELM feature space [17]. The goal of ELM is to minimize the norm of output weights as well as the training errors, which is equivalent towhere is the slack variable of the training sample and controls the tradeoff between the output weights and the errors. Based on the Karush-Kuhn-Tucker (KKT) theorem [22], the corresponding Lagrange function of the primal ELM optimization (2) isthe following optimality conditions of (3) should be satisfied:where is the hidden layer output matrix and is the vector of Lagrange variables. Substituting (4a) and (4b) into (4c) we haveHere is the identity matrix and . Substituting (5) into (4a), we getThe ELM output function (1) can be further derived asIf the hidden nodes’ feature mapping is unknown to users, kernel methods that satisfy Mercer’s condition can be adopted: . The ELM kernel output function can be written asand the kernel matrix for ELM is

2.2. Advances of ELM

Extreme learning machine has gained much more popularity since its advent. It has been able to avoid the problem of time complexity which classic learning techniques are confronted with while providing better generalization performance with less human intervention. Because of such attractive features, researchers have extended the basic ELM to several different directions and many variants of ELM have been developed. For instance, online sequential ELM (OS-ELM) [23, 24] can learn the sequential coming data (one by one or chunk by chunk) with a small effort to update the output weights. The training data are discarded after being learned by the network and the output weights need not be retrained, which is especially efficient for time-series problems. Other typical works include fully complex ELM [25, 26], incremental ELM (I-ELM) [20, 21], sparse ELM [27], ELM with elastic output [28, 29], and ELM ensembles [3032]. See [33] for further details on the many ELM variants.

When uncertainty is present in the dataset, integration of fuzzy logic system and extreme learning machine tends to enhance the generalization capability of ELM. In [34], a neurofuzzy Takagi-Sugeno-Kang (TSK) fuzzy inference system is constructed utilizing extreme learning machine. The number of inference rules is previously determined by the -means method. One ELM is used to obtain the membership of each fuzzy rule and multiple ELM are used to obtain the consequent part. Rong et al. [35] show that type-1 fuzzy inference system (type-1 FLS) is equivalent to a generalized SLFN. Hence, the hidden nodes work as the antecedent part and the output weights as the consequent part. Then, extreme learning machine is directly applied to the type-1 FLS and the corresponding online sequential fuzzy ELM has also been developed. Deng et al. [36] further extend the idea to type-2 fuzzy inference system (type-2 FLS) because of type-2 FLS’s superiority in modeling high level uncertainty. With the most widely used interval type-2 FLS, the parameters of the antecedents are randomly initialized according to the ELM mechanism. The Moore-Penrose generalized inverse is used to initialize the parameters of the consequents and the parameters are finally refined by Karnik-Mendel algorithm [37]. Many applications have also been investigated in the literature. For example, the hybrid model of ELM with interval type-2 FLS has been applied for permeability prediction [38].

3. The Proposed One-Class Classifier

3.1. Support Vector Data Description

For a better understanding of one-class classifiers, support vector data description (SVDD) [10] is discussed here for one-class classification process. SVDD defines a spherically shaped boundary around the complete target set and is intuitively appealing since it regards the target class as a self-closed system. Let be the training set, and is drawn from the target distribution. SVDD aims to minimize the volume of the sphere as well as the training errors for objects falling outside the boundary, which is equivalent towhere and are the hypersphere’s radius and center, respectively. Parameter controls the tradeoff between the volume and the errors.

The corresponding function of the primal SVDD optimization (9) iswith the Lagrange variables and . should be minimized with respect to , , and maximized with respect to .

Based on the Karush-Kuhn-Tucker (KKT) theorem [22], to get the optimal solutions of (10), we should haveFrom (11c) and and , can be removed and can be further limited to the interval :

Substituting (11a)–(11c) into (10), the dual optimization function can be derived assubject to constraints (11a) and (12). To constitute a flexible data description model, kernel function , with an implicit feature mapping of the data into a higher dimensional feature space, can be adopted to replace the inner product . In this case, the corresponding dual optimization function is changed toThe KKT conditions of the target functions are

The constraints have to be enforced and we have three cases as follows:(1)(2)(3)

Only a small ratio of objects with are called the support vectors. The dual optimization functions (13) and (14) are standard Quadratic Programming (QP) problems and the Lagrange variables can be obtained using some optimization methods such as SMO algorithm [39]. To test a new object z, its distance to the center of the sphere is calculated. The classifier will accept the object if the distance is less than or equal to the radius:

In addition to the batch learning model of SVDD, incremental learning methods [40] of SVM are extended to SVDD algorithm. Yin et al. [28] show an online fault diagnosis process through a hybrid model of incremental SVDD (ISVDD) and ELM with incremental output structure (IOELM). They used the ISVDD to detect the unknown failure model, and the output nodes of IOELM are adaptively increased to recognize the new failure mode.

3.2. The ELM Based One-Class Classifier

When data only from the target class are available, the one-class classifier is trained to accept target objects and reject objects that deviate significantly from the target class. In the training phase, the one-class classifier, which defines a distance function between the objects and the target class, takes in the training set to build the classification model. In general, the classification model contains two important parameters to be determined: threshold and modal parameter . A generic test sample is accepted by the classifier if .

In the training phase, not all the training samples are to be accepted by the one-class classifier due to the presence of outliers or noisy data contained in the training set. Otherwise, the trained classification model may generalize poor to unknown test set when the training set includes abnormal data samples. Usually, threshold is determined such that a user-specified fraction of training samples most deviant from the target class are rejected. For instance, if one is told five percent of training samples are mislabeled, setting makes the classifier more robust. Even when all the samples are correctly labeled, rejecting a small fraction of training samples helps the classifier to learn the most representative model from the training samples.

Any one-class classifier has model parameters which influence the model complexity (flexibility), for example, the number of hidden nodes in autoencoder neural networks or the tradeoff parameter of SVDD. Minimizing the errors of both the target and outlier classes on a cross-validation set is no longer available since there is no data from the outlier class. Fortunately, several model selection criteria [2] have been proposed. Assuming the uniform distribution of the outlier class, consistency-based model selection [41] method is one of the most effective methods used to select the model parameters. The basic idea is that the complexity of the classifier can be increased as long as it still fits the target data. The more complex the model, the smaller the volume of the classifier in the object space and the less the probability of outlier objects falling inside the domain of the classifier. In practice, one can make an ordering of the potential model parameters such that the latter parameter always yields the more complex classifier and chooses the most complex classifier without overfitting the target data.

The compactness hypothesis [42] is the basis for object recognition. It states that similar real world objects have to be close in the feature space. Therefore, for similar objects from the target class, the target outputs should be the same:where is a real number. All the training samples’ target outputs are set to the same value . Then, the desired target output vector is . Training the samples from the target class can directly use the optimization function (2). For a new test sample , the distance function between the sample object and the target class is defined as

The decision whether belongs to the target class or not is based on threshold . Recall that is optimized to reject a small fraction of training samples to avoid overfitting. The distances of the training samples to the target class can be directly determined using (21) and the constraint of (2) From (22), we find the distances are and the larger means the more deviant of the training sample from the target class. Hence, we derive threshold based on a quantile function to reject the most deviant training samples. Denote the sorted sequence of the distances of training samples by such that . Here, and represent the most and the least deviant samples. The function determining can be written as where returns the largest integer not greater than . Then, we can get the decision function for to the target class:

Remark 1. The target output can be assigned to arbitrary real number except 0. When , seen from (6), the output weights between the hidden layer and the output layer become 0 (, is the -dimensional zero vector). Therefore, the decision value of any sample is 0 using the proposed classifier. It is obvious that, in such case, the one-class classifier cannot distinguish between the target class and the outlier class. When , as there are infinite possible , there seem to exist infinite ELM based one-class classifiers. To get a universal ELM based one-class classifier, we normalize the distance function (21) by dividing the target output where is the -dimensional unit vector. The normalization formula (25) is to eliminate the possible bias introduced by the target output . In practice, one can set the target output such that (21) is equivalent to (25) and the normalization step is implicitly done.

Remark 2. Both random feature mappings and kernels can be used for the proposed one-class classifier. When nonlinear piecewise continuous functions satisfying ELM universal approximation capability theorems [20, 21] are used as the activation function, the ELM network can approximate any target continuous function as long as the number of hidden nodes is large enough. When the feature mapping is unknown, kernel methods can be adopted as shown in (8a) and (8b). Huang et al. [17] have shown ELM, the unified solution for regression, binary, and multiclass classifications. Since the same optimization formula (2) is used in the proposed one-class classifier, this paper also shows ELM, the unified learning mode for one-class classification.

Figure 1 shows the decision boundaries (black curves) of the classifier with incremental hidden nodes using sigmoid function as the activation function. The dataset (blue points) is composed of 100 samples in the plane. Threshold is determined such that and the model parameter is automatically determined by the consistency-based model selection method. When the number of hidden nodes is small , the classifier fails to approximate the target region and some unexpected “holes” without any targets can be seen from the leftmost picture of Figure 1. The weakness alleviates as more hidden nodes are added. When the number of hidden nodes gets large enough, the classifier can be close enough to describe the target class well. This is consistent with ELM universal approximation capability theorems [20, 21].

Remark 3. Autoencoder is one of the most effective neural networks approaches for one-class classification, which has been applied by Manevitz and Yousef for document retrieval [43]. Constrain the number of output nodes that must be equal to the number of input nodes . The hidden layer in such a network actually acts as a bottleneck, where . The idea is that while the bottleneck prevents learning the full identity function on -space, the identity on the small set of examples is in fact learnable. Traditional learning algorithms like BP are used to train the network. Several challenging issues such as local minimum, trivial human intervention, and time consuming in learning stage discourage people who are not familiar in the field to use it, while the ELM based one-class classifier can approximate the target class well as long as the dimensionality of the feature mappings is large enough (cf. Figure 2).

4. Experiments

4.1. Artificial Datasets

First, we illustrate the proposed method with both random feature mappings and kernels on three specific designed artificial datasets, which all contain 100 samples created in a 2D feature space. The first dataset contains four Gaussian distributions (each has 25 samples) with the same unit covariance matrix but with different mean vectors. It is set to test the classifier’s sensitivity to multimodality. The second dataset contains one Gaussian distribution with the first feature with a variance of 1 and the second feature with a variance of 40. Moreover, the two features are rotated over 45 degrees to construct a strong correlation. The third banana-shaped dataset, which has been shown in Section 3, contains one uniform distribution along an arc curve with some small position offsets. It is to test the influence of convexity. In Figure 3, the datasets (blue points) together with the decision boundaries (black curves) in the feature space are illustrated. Sigmoid function acts as the activation function for the method with random feature mappings ( large enough) and Gaussian kernel is used for the method with kernels. All the thresholds are determined such that . The pictures show that methods using both random feature mappings and kernels give reasonable results. However, the method with kernels tends to be superior to the method with random feature mappings since the boundary captures the distribution more precisely, while in Figure 3(a) some small “holes” still exist in the upper left and lower right regions for the method with random feature mappings.

4.2. UCI Datasets

This section compares the performance of the proposed method with a variety of one-class classification algorithms. The popular one-class classifiers to be compared include Parzen [7], Naive Parzen, -means [44], -centers [45], 1-NN [46], -NN [47], autoencoder, PCA [48], MST [14], MPM [13], SVDD [10], LPDD [11], and SVM [9]. The implementations for one-class SVM are carried out using compiled C-coded SVM packages: LIBSVM [49]. All the other algorithms are conducted with Matlab toolbox DD_TOOLS [50]. Binary and multiclass classification datasets taken from UCI Machine Learning Repository [51] are used. The specifications of the datasets are shown in Table 1. The datasets are transformed for one-class classification by setting a chosen class as the target class and all the other classes as the outlier class.

In our experiments, all the inputs have been normalized into range . The samples from the target class are equally partitioned in two sets for training and testing, respectively. All one-class classifiers are trained on target data only and tested on both the remaining target data and all other nontarget data. To assess the performance, we use measure [52], which is defined as a combination of recall () and precision () with an equal weight in the following form:

All the thresholds are determined such that . The Gaussian kernel is used in Parzen, Naive Parzen, MPM, SVDD, SVM, and ELM. The consistency-based model selection method is employed to select the model parameters. For Parzen, MPM, SVDD, SVM, and ELM, the kernel parameter is chosen from 20 aliquots between the minimum and maximum pairwise object distances, so as the smoothing parameter of sigmoid transform function used in LPDD. For -means, -centers, parameter is selected from the range . For ELM, another parameter is chosen from the range and we set a higher priority than ; that is, when the parameter combinations, and , both obtain consistent boundaries, we always choose a smaller rather than a larger . We try every possible parameter setting and find the most complex classifier as long as the classifier is consistent. For Naive Parzen and -NN, the leave-one-out maximum likelihood estimation is used. One-class PCA retains 0.95 variance for the training set. For MST, the complete minimum spanning tree is used. The number of hidden nodes in autoencoder neural network is carefully chosen from a large range and the optimal number is selected.

All the experiments are carried out in Matlab R2013a environment running in E5504 2 GHz CPU, 4 GB RAM. Twenty trials have been conducted for each dataset and the average and corresponding standard deviations are shown in Tables 2 and 3. The best results are shown in boldface. As an example, we give a detailed description for the diabetes experiment. First, all the samples from both the target class and the outlier class are normalized into range . Then, the 500 training target samples are randomly divided into two equal sets (250 samples for each set). One of the sets is used for training the one-class classifier and the other set, together with all the samples from the outlier class, is used for testing only. After that, the consistency-based model selection method is employed to select the model parameters for each classifier using only the training set. Finally, the other target set with the outlier set is judged by the trained classifier with precision and recall recorded. value is then derived as (26). The same procedure repeats for twenty times and the corresponding mean and deviation values are calculated. It can be seen that the generalization performance of ELM is the best in five of the eight experiments while in the other experiments, except for sonar dataset, the performance is comparable to the best classifier. Table 4 presents a detailed performance comparison of two datasets, including precision and recall.

Table 5 reports the execution time comparisons in seconds between the ELM, autoencoder, and SVDD classifiers for all the eight experiments. As observed from Table 5, the advantage of the ELM on training time is quite obvious. ELM can generally learn hundreds of times faster than autoencoder neural network due to the tuning-free mechanism. Besides, ELM also learns much faster than SVDD without solving a QP problem. For testing time, since autoencoder may obtain a more compact network and the parameters have been tuned in the training phase, the computational time depends on the specific task. The computational complexity of ELM mostly depends on the number of samples while autoencoder depends on both the number of samples and the number of dimensions. Thus, for datasets with relatively small size and high dimensions, such as arrhythmia dataset, ELM obtains a smaller testing time, while for datasets with relatively large size and low dimensions, such as abalone dataset, autoencoder reacts faster to the testing samples. However, ELM still tends to outperform autoencoder with respect to both training time and accuracy. It is obvious that ELM and SVDD obtain a similar testing time since both of them utilize a kernel function.

5. Conclusion

This paper presents a simple and efficient one-class classifier utilizing extreme learning machine, which also shows ELM, the unified learning mode for one-class classification. Both random feature mappings and kernels can be used for the proposed classifier while the method with kernels tends to be superior to the method with random feature mappings. Moreover, the proposed classifier with kernels achieves the best results on five of the eight UCI datasets, which suggests ELM being effective for one-class classification problem. We have also discussed the relationships and differences between autoencoder neural network and ELM network for one-class classification. Although autoencoder neural network has been successfully applied in many applications, the slow gradient based method is still used to tune all the parameters, which is far slower than required. On the other hand, the ELM based one-class classifier has an analytical solution which can obtain superior generalization performance at much faster learning speed. Possible future directions include the fusion of fuzzy logic and ELM for one-class classification, one-class classifier ensembles with ELM, and substituting autoencoder with the ELM based one-class classifier for deep learning.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is partially sponsored by Natural Science Foundation of China (nos. 61175115, 61272320, 61379100, and 61472388). The authors would like to thank the helpful discussions with Mr. Fan Wang and Dr. Laiyun Qing.