Abstract

A large vector-angular region and margin (LARM) approach is presented for novelty detection based on imbalanced data. The key idea is to construct the largest vector-angular region in the feature space to separate normal training patterns; meanwhile, maximize the vector-angular margin between the surface of this optimal vector-angular region and abnormal training patterns. In order to improve the generalization performance of LARM, the vector-angular distribution is optimized by maximizing the vector-angular mean and minimizing the vector-angular variance, which separates the normal and abnormal examples well. However, the inherent computation of quadratic programming (QP) solver takes training time and at least space, which might be computational prohibitive for large scale problems. By   and  -approximation algorithm, the core set based LARM algorithm is proposed for fast training LARM problem. Experimental results based on imbalanced datasets have validated the favorable efficiency of the proposed approach in novelty detection.

1. Introduction

The task of novelty detection is to learn a model from normal examples in training patterns and hence can classify the test patterns. In real-world novelty detection applications, it is usually assumed that normal training patterns can be well sampled, while abnormal training patterns are severely undersampled, which is due to expensive measurement cost or infrequency of abnormal events. Therefore, only normal training patterns are used to build detection model in most novelty detection algorithms. Generally, novelty detection may be seen as one-class classification problem. Recently, novelty detection has gained much research attention in real-world applications such as network intrusion detection [1], jet engine health monitoring [2], medical data [3], and aviation safety [4, 5].

In this paper, the kernel-based novelty detection algorithm is studied in-depth, which is very popular and has been proved to be successful recently. Various kernel-based novelty detection approaches have been proposed, such as one-class support vector machine (OCSVM) [6] and support vector data description (SVDD) [7]. OCSVM was proposed by Schölkopf et al. [6], in which, to improve generalization ability, novelty detection boundary is constructed to separate the origin from the input samples with the maximal margin. The performance of OCSVM is very sensitive to the parameters, making it difficult to be generalized to other applications [8].

SVDD was proposed by Tax and Duin [7], in which the minimal ball is constructed to enclose most of the training samples. Novelty point is assessed by determining whether a test point lies within the minimal ball or not. The margin between the closed boundary surrounding the positive data and that surrounding the negative data is zero, which makes the method of poor generalization ability. A small sphere and large margin (SSLM) approach was proposed by Wu and Ye [9], in which the smallest hypersphere is constructed to surround the normal data; meanwhile, the margin from any outlier to this hypersphere is as large as possible. An incremental weighted one-class support vector machine for mining streaming data was proposed by Krawczyk and Wózniak [10, 11], in which the weights to each object are modified according to its level of significance, and the shape of the decision boundary is influenced only by new objects that carry new and useful knowledge extending the competence of the classifier.

Support vector machine (SVM) can be solved through figuring out quadratic programming (QP) problem, which has the important computational advantage of avoiding the problem of local minima. However, solving the corresponding SVM problems using the naive implementation of QP solver takes computational time complexity and at least space complexity if the number of training patterns is . Obviously, the naive implementation of QP solver is difficult to meet the practical application of novelty detection in large scale datasets. Tsang et al. proposed the core vector machine (CVM) [12, 13] as the approximation algorithm of minimum enclosing ball (MEB) for large scale problems. The key idea is that the implementation of QP solver for corresponding SVM problems could be equivalently viewed as MEB problems. By utilizing an approximation algorithm for the MEB problem in computational geometry, the time complexity of CVM algorithm is linear to the number of training patterns. Moreover, the space complexity is irrelevant to the number of training patterns.

As mentioned above, only normal training patterns are used to build the detection model in most novelty detection algorithms. In practical applications of novelty detection, it is difficult, but not impossible, to obtain a very few abnormal training patterns. For instance, in machine fault detection, in addition to extensive measurements on the normal working conditions, there may be also some measurements on faulty situations [14]. Recently, extensive and comprehensive researches have been carried out in both academia and industry to solve the imbalanced novelty detection problem.

Kernel-based novelty detection based on imbalanced data is researched in this paper. Suppose , , is a given training dataset with examples, where is the th input instance, is a class identity label associated with instance , is the set of majority training patterns and , is the set of minority training patterns and , and . is the feature mapping function defined by a given kernel function . The length of the perpendicular projection of the training pattern onto the vector is expressed as , which actually reflects the information about the angular and the Euclidean distances between and in the Euclidean vector space. According to the definition in [15], is called vector-angular.

In this paper, a large vector-angular region and margin (LARM) algorithm and its fast training method based on core set are proposed for novelty detection, where the training patterns are imbalanced. The main contributions of this paper lie in three aspects. Firstly, the boundary of SVM is only determined by the support vectors and the distribution of the data in the training set is not considered [16]. However, recent theoretical results have proved that data distribution information is crucial to the generalization performance [17, 18]. The proposed algorithm in this paper aims to find an optimal vector in the feature space, in which the mean and the variance of vector-angular are maximized and minimized, respectively. Therefore, normal and abnormal examples are well separated when projected onto the optimal vector joining their large mean and small variance. Secondly, the proposed LARM integrates one-class and binary classification algorithms to tackle the novelty detection problem based on imbalanced data, which constructs the largest vector-angular region in the feature space to separate normal training patterns and maximizes the vector-angular margin between the optimal vector-angular region and the abnormal data. Since the number of normal training patterns is sufficient, the largest vector-angular region is constructed accurately, which can minimize the chance of accepting the normal examples. To achieve better generalization performance, the vector-angular margin between the surface of this optimal vector-angular region and the abnormal data is maximized. Thirdly, the core set based LARM algorithm is proposed for fast training LARM problem. The time and space complexity of core set based LARM are linear to and independent of the number of training patterns, respectively.

The structure of this paper is organized as follows. Section 1 introduces the novelty detection technique and presents an analysis of the existing problems. Section 2 introduces -support vector machine (-SVM), two-class SVDD, and maximum vector-angular margin classifier (MAMC). Section 3 presents the proposed LARM for novelty detection and its fast training method based on core set. Experimental results are shown in Section 4 and conclusions are given in Section 5.

2. -SVM, SVDD, and MAMC

2.1. -SVM

-SVM was proposed by Schölkopf et al. [19] to solve the binary classification problem, which uses the parameter to control the number of support vectors and the bound of the classification errors. -SVM can be modeled as follows:where is the normal vector of the decision hyperplane, is the bias of the classifier, is the margin, is the vector of slack variables, and is a positive constant. -SVM obtains the optimal hyperplane for separating the two classes with a maximal margin . To classify a testing instance , the decision function takes the sign function of the optimal hyperplane .

2.2. SVDD

One-class SVDD and two-class SVDD were proposed by Tax and Duin in 2004 [7], in which the minimal ball is constructed to enclose most of the training patterns. Here, we only review two-class SVDD that can utilize the abnormal data. Two-class SVDD can be modeled as follows:where and are the radius and the center of the hypersphere, and are two trade-off parameters which can treat imbalanced datasets, and is the vector of slack variables. The testing instance can be determined, whether it is inside of the optimal hypersphere or not. Hence, the decision function of two-class SVDD is .

2.3. MAMC

MAMC was proposed by Hu et al. in 2012 [15], which attempts to find an optimal vector in the feature space based on the maximum vector-angular margin. MAMC can be modeled as follows:where is the optimized vector, is the vector-angular margin, is the vector of slack variables, and and are two positive constants. To classify a testing instance , the decision function is defined as .

3. Core Set Based Large Vector-Angular Region and Margin

In this section, LARM algorithm and its fast training method based on core set are proposed for novelty detection with imbalanced data.

3.1. LARM

To tackle the novelty detection problem on imbalanced data, the distribution of vector-angular and maximization of vector-angular margin are considered in this paper. Figure 1 illustrates the principle of LARM.

Firstly, LARM is adopted to find an optimal vector in the feature space, which attempts to maximize the vector-angular mean and minimize the vector-angular variance simultaneously. Here, the vector-angular expresses the length of projection of training pattern onto the optimal vector . Therefore, the normal and abnormal examples are well separated when projected onto the optimal vector joining their large mean and small variance.

Secondly, for the learning problem on imbalanced data, the largest vector-angular region in the feature space is constructed to separate the normal data. Since the number of normal training patterns is sufficient, the largest vector-angular region is constructed accurately, which can minimize the chances of accepting the normal examples. Meanwhile, to achieve a favorable generalization performance, the vector-angular margin between the surface of this optimal vector-angular region and the abnormal data is maximized.

3.1.1. Primal Formulation of LARM

Formally, define the training pattern matrix , label column vector , and label diagonal matrix . According to the definition in [18], the vector-angular mean and vector-angular variance between training patterns , and vector can be expressed as

Then, the primal LARM can be formulated as the following optimization problem:where is the optimal vector, is the width of vector-angular region, is the vector-angular margin, is the vector of slack variables, and , , , and are four positive constants.

According to [18], for problem (5) is expressed as follows:

Hence, can be obtained, where is the kernel matrix. Problem (5) can be formulated as follows:where , , and is the th column of .

3.1.2. Dual Problem

To investigate the problem with constraints described as (7), the Lagrangian function is constructed as follows:where and are Lagrange multipliers. The following equations can be obtained by making the partial derivatives of with respect to the primal variables to zero:

Substituting (9)–(13) into (8), the dual form can be obtained, which omits constants without influence on optimization:where , , and is the inverse matrix of and .

The dual problem (14) is a QP problem, which has the same form as the dual of the -SVM [19, 20]. Therefore, the QP problem (14) can be easily solved by SMO algorithm in LIBSVM [21].

Suppose is the optimal vector of the dual problem (14). According to (13), can be expressed as follows:

To compute and , two sets are considered:

According to the Karush-Kuhn-Tucker (KKT) conditionsand (11) and (12), , , , and can be obtained. Hence, set and , and and can be expressed as

3.1.3. Decision Function

It can be seen that minimizing the cost function (5) will make the width of vector-angular region and vector-angular margin as large as possible. Meanwhile, the optimal vector in feature space is found, which makes the normal and abnormal examples well separated when projected onto the optimal vector joining their large mean and small variance. Therefore, the testing patterns can be classified in terms of the vector-angular between the vector and the training patterns . The optimal separating hyperplane of SVM is , which is at the middle of the margin. Similarly, the separating hyperplane of LARM is defined at the center of the margin. Hence, for testing instance , the decision function is expressed as follows:

3.1.4. -Property

Let and represent the number of margin errors of the normal and abnormal training patterns and and denote the number of support vectors of the normal and abnormal training patterns, respectively. According to (9) and (10), the following formulas can be obtained:

By using similar proof about -property in [19] and by making use of (20), inequalities (21) can be obtained:

The inequalities (21) indicate that (or ) is a lower bound of the fraction of support vectors in the normal (or abnormal) dataset and an upper bound of the fraction of misclassified patterns in the normal (or abnormal) dataset. The -property of LARM can be used for parameter selection in the following experiments.

3.2. Core Set Based LARM

As mentioned above, the dual problem of LARM can be actually formulated as a QP problem. So, solving the corresponding QP problem of LARM takes computational time complexity and space complexity. When the number of training patterns is large, it is thus computationally infeasible. Inspired from the core set based approximate MEB algorithms, and -approximation algorithm is utilized for fast training LARM problem, which is called core set based LARM. Firstly, core sets of training patterns are obtained by and -approximation algorithm to achieve the distribution of vector-angular region of the normal and abnormal examples. The core set is a subset of the original training patterns and the optimization problem can be approximately solved on the core set. Secondly, the LARM problem is solved by SMO algorithm [22] using the obtained core set. According to [12, 13], the number of core sets is independent of both the number and the dimension of training patterns, and the time complexity is linear to the number of training patterns while the space complexity is independent of the number of training patterns. The schematic illustration of core set based LARM is shown in Figure 2.

Suppose is the core set of the iteration, is the optimal vector in the feature space of the iteration, is the minimum distance between the center of the vector-angular margin and any point in core set of the iteration, and is the maximum distance between the center of the vector-angular margin and any point in core set of the iteration. Given , according to [12, 13], the core set based LARM is trained as follows.(i)Initialize , , and .(ii)Terminate if there is no training point falls outside the vector-angular region . Go to step (vi).(iii)Find and ; is the furthest away from the center of the vector-angular margin and is the shortest away from the center of the vector-angular margin. Set .

The distance between the center of the vector-angular margin and any point is expressed as follows:where is the width of vector-angular region at the th iteration, is the vector-angular margin at the iteration, and the set is constructed by all training patterns outside the vector-angular region .

Computing (22) for all training patterns, takes time at the iteration. When is large, time cost will be enormous. In order to reduce the computation cost, the probabilistic speedup method [23] is used to accelerate the vector-angular computations in steps (ii) and (iii). The details of time and space complexities can be seen in [12, 13].(iv)Find the new vector-angular region .(v)Increase by 1 and go back to step (ii).(vi)Solve the LARM problem (14) by the core set .(vii)Classify the test pattern by the decision function (19).

4. Experimental Results

The proposed core set based LARM is evaluated on twenty datasets, including both LIBSVM datasets [24] and UCI datasets [25]. Details of the datasets are listed in Table 1, where is the data dimension, #pos is the total number of normal patterns, #neg is the total number of abnormal patterns, is the number of normal training patterns, and is that of abnormal training patterns. The dataset size is ranged from 178 to more than 495,141, and the proportion of major and minor data is ranged from 10 : 1 to 1000 : 1. Experiments are repeated for 10 times with random data partitions, the geometric mean accuracy and the standard deviation are recorded.

4.1. Performance Measurement and Parameter Selection

The performance of core set based LARM is compared with three kernel-based algorithms: -SVM, SVDD, and MAMC. The geometric mean accuracy [26] is used for both parameter selection and algorithm evaluation, where is the classification accuracy of the positive class and is the classification accuracy of the negative class. The measurement is widely applied in imbalanced data [14, 26, 27], and it considers the classification results on both the positive and the negative classes. To make the experimental results persuasive enough, all the parameters of -SVM, SVDD, MAMC, and core set based LARM are selected by fivefold cross validation.

In all experiments, the Radial Basis Function (RBF) is taken as the kernel function: where is the kernel parameter of the RBF. For all the algorithms, RBF parameter is calculated by [12, 13]where and is the diagonal elements of matrix .

For -SVM, parameter is searched in , where .

For SVDD, parameter is searched in and parameter is searched by the ratio belonging to .

For MAMC, parameter is searched in and parameter is searched in .

For core set based LARM, parameter is searched in and parameters and are searched in . From (21), and can be achieved, which are most associated with the percentage of support vectors and margin errors. From Section 4.2, we can see that parameters and have faint effect on the accuracy rate. Therefore, parameters and are set to 1 and , respectively.

4.2. Parameters Influence

There are five parameters in core set based LARM, that is, , , , , and . To verify the influence of the parameters on the performance of core set based LARM, experiments on some representative datasets are performed. By fixing other parameters, the influence of every parameter on some representative datasets is further studied, which is shown in Figures 37.

Figure 3 shows the influence of on the geometric mean accuracy and the number of core sets by varying from 10 to 100 while fixing , , , and as the suggested value obtained by the cross validation described in Section 4.1. Figure 4 shows the influence of on the geometric mean accuracy and the number of core sets by varying from 0.001 to 0.01 while fixing , , , and in the same way. Figure 5 shows the influence of on the geometric mean accuracy and the number of core sets by varying from 0.001 to 0.01 while fixing , , , and in the same way. Figure 6 shows the influence of on the geometric mean accuracy and the number of core sets by varying from to while fixing , , , and in the same way. Figure 7 shows the influence of on the geometric mean accuracy and the number of core sets by varying from to while fixing , , , and in the same way.

From Figures 37, it can be seen that parameters , , , , and have faint effect on the geometric mean accuracy and the number of core sets, which make the core set based LARM even more attractive in practice. Therefore, parameters , , , , and obtained by the cross validation described in Section 4.1 are acceptable for all experiments.

4.3. Numerical Results
4.3.1. Detection Performance

For each dataset, samples are randomly split into training patterns and testing patterns with the proportion described in Table 1. Parameters of -SVM, SVDD, MAMC, and core set based LARM are selected by fivefold cross validation to make the experimental results persuasive enough.

The geometric mean accuracy is used for the performance evaluation. Experiments are repeated for 10 times with random data partitions. The average accuracy and the standard deviation are listed in Table 2. NULL shows that there is no return result in 10 hours. Furthermore, with regard to every dataset, the difference between the bold results and the best geometric mean accuracy is not significant, which is determined by the Wilcoxon rank-sum test, with the confidence level of 0.05.

From Table 2, it can be concluded that the performance of core set based LARM is comparable to the best of -SVM, SVDD, and MAMC on all datasets. The core set based LARM performs significantly better than -SVM, SVDD, and MAMC on 12, 9, and 13 over 20 datasets, respectively. It illustrates that, by using and -approximation algorithm for training LARM, the generalization performance of core set based LARM is comparable to or even better than the best of -SVM, SVDD, and MAMC.

4.3.2. Time Cost

The time cost of -SVM, SVDD, MAMC, and core set based LARM on different datasets is shown in Tables 3 and 4. The average and standard deviation of training time (including parameters selection and model training time) are shown in Table 3. The average and standard deviation of testing time are shown in Table 4. All the experiments are conducted on the computer with an [email protected] GHz CPU and 8 GB SDRAM. NULL shows that there is no return result in 10 hours. Furthermore, with regard to every dataset, the difference between the bold results and the best time cost is not significant, which is determined by the Wilcoxon rank-sum test, with the confidence level of 0.05.

From Table 3, it can be clearly seen that the training time of core set based LARM is longer than the best of -SVM, SVDD, and MAMC, when the number of the training patterns is less than 2,143. However, when the number of training patterns is larger than 2,686 such as SDD, MC, Shuttle, Cod-rna, S. segmentation, and Covtype, the training time of core set based LARM is shorter than the best of -SVM, SVDD, and MAMC. When the number of training patterns increases to 141,792, the average training time of core set based LARM does not exceed 65 seconds. Therefore, the training time of core set based LARM does not increase very quickly with the number of training patterns.

As can be seen from Table 4, the best testing time of -SVM, SVDD, and MAMC performs slightly better than core set based LARM on 11 over 20 datasets; the longest time gap is 0.002 second. However, the testing time of core set based LARM is not the worst one. When the number of testing patterns is 353,349, such as Covtype, the average testing time of core set based LARM is about 1.5 seconds. It shows that the core set based LARM can detect testing examples fast.

5. Conclusion

In this paper, a novel LARM algorithm and its fast training method based on core set are proposed for novelty detection on imbalanced data. The proposed LARM algorithm combines the ideas of one-class and binary classification algorithms, which constructs the largest vector-angular region in the feature space to separate normal training patterns and maximizes the vector-angular margin between this optimal vector-angular region and the abnormal data. In order to make the generalization performance of LARM better, the vector-angular distribution is optimized by maximizing the vector-angular mean and minimizing the vector-angular variance. To improve the computation efficiency, and -approximation algorithm is proposed for fast training LARM based on core set. The time and space complexity of core set based LARM are linear to and independent of the number of training patterns, respectively. Comprehensive experiments have validated the effectiveness of proposed approach. In the future, it will be interesting to extend the idea of LARM to handle one-class learning problem.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant nos. U1433103 and U1333116; the Science and Technology Foundation of Civil Aviation Administration of China under Grant no. 20150227; and the Fundamental Research Foundation for the Central Universities of CAUC under Grant no. 3122014D022 and no. 3122014B002.