Artificial Intelligence and Data Mining 2014View this Special Issue
Cost-Sensitive Support Vector Machine Using Randomized Dual Coordinate Descent Method for Big Class-Imbalanced Data Classification
Cost-sensitive support vector machine is one of the most popular tools to deal with class-imbalanced problem such as fault diagnosis. However, such data appear with a huge number of examples as well as features. Aiming at class-imbalanced problem on big data, a cost-sensitive support vector machine using randomized dual coordinate descent method (CSVM-RDCD) is proposed in this paper. The solution of concerned subproblem at each iteration is derived in closed form and the computational cost is decreased through the accelerating strategy and cheap computation. The four constrained conditions of CSVM-RDCD are derived. Experimental results illustrate that the proposed method increases recognition rates of positive class and reduces average misclassification costs on real big class-imbalanced data.
The most popular strategy for the design of classification algorithms is to minimize the probability of error, assuming that all misclassifications have the same cost and classes of dataset are balanced [1–6]. The resulting decision rules are usually denoted as cost insensitive. However, in many important applications of machine learning, such as fault diagnosis  and fraud detection, certain types of error are much more costly than others. Other applications involve significantly class-imbalanced datasets, where examples from different classes appear with substantially different probability. Cost-sensitive support vector machine (CSVM)  is one of the most popular tools to deal with class-imbalanced problem and unequal misclassification problem. However, in many applications, such data appear with a huge number of examples as well as features.
In this work we consider the cost-sensitive support vector machine architecture . Although CSVMs are based on a very solid learning-theoretic foundation and have been successfully applied to many classification problems, it is not well understood how to design big data learning of the CSVM algorithm. CSVM usually maps training vectors into a high dimensional space via a nonlinear function. Due to the high dimensionality of the weight vector, one solves the dual problem of CSVM by the kernel trick. In some applications, data appear in a rich dimensional feature space; the performances are similar with/without nonlinear mapping. If data are not mapped, we can often train much data set.
Recently, many methods have been proposed for linear SVM in large-scale scenarios . In all methods, dual coordinate descent methods for dual problem of CSVM are one of popular methods to deal with large-scale convex optimization problem. However, they do not focus on big data learning of CSVM. We focus on big data class-imbalanced learning by CSVM.
This paper is organized as follows. In Section 2 basic theory of cost-sensitive support vector machine is described. In Section 3 we derive our proposed algorithm. Section 4 discusses both speed-ups and four constrained conditions of cost-sensitive support vector machine. Implementation issues are investigated in Section 5. Experiments show efficiently our proposed method.
2. Basic Theory of Cost-Sensitive Support Vector Machine
Cost-sensitive support vector machine such as 2C-SVM is proposed in . Consider examples set , , , , . The 2C-SVM has primal optimization problem: Lagrange multipliers method can be used to solve the constrained optimization problem. In this method, the Lagrange equation is defined as follows: where the , ’s are called the Lagrange multipliers. ’s partial derivatives are set to be zero as follows:
And solve . The formula (3) is extended to be
Equation (4) can be reformatted as
Lagrange equation (8) is simplified as
Recall that the equation above is obtained by minimizing with respect to and . Putting this together with the constraints and the constraints (5) to (7), the following dual optimization problem of 2C-SVM is obtained as where , are Lagrange multipliers and , , , are misclassification cost parameters. The matrixes and are defined, respectively, as follows:
3. Cost-Sensitive Support Vector Machine Using Randomized Dual Coordinate Descent Method
In this section, randomized dual coordinate descent method is used to solve 2C-SVM that is one version of cost-sensitive support vector machine. The optimization process starts from an initial point and generates a sequence of vectors , . The process from to is called an outer iteration. In each outer iteration is called inner iteration, so that are sequentially updated. Each outer iteration thus generates vectors , , such that
For updating to , the following one-variable subproblem is solved as where . This is general process of one-variable coordinate descent method. However, the sum constrained condition of the optimization problem (13) is denoted as follows: where is exactly determined by the other ,s, and if we were to hold fixed, then we cannot make any change without violating the constraint condition (14) in the optimization problem.
Thus if we want to update some subject of the ,s, we must update at least two of them simultaneously in order to keep satisfying the constraints. We currently have some setting of the s that satisfy the constraint conditions (14) and suppose we have decided to hold fixed and reoptimize the dual problem of CSVM (10) with respect to , (subject to the constraints). Equation (6) is reformatted as
Since the right hand side is fixed, we can just let it be denoted by some constant :
We can form the updated coordinates before and after with respect to , : , are updated coordinates before and after with respect to . Consider the following dual two-variable subproblem of 2C-SVM: where is constant, , and
Treating , , as constants, we should be able to verify that this is just some quadratic function in . We can easily maximize this quadratic function by setting its derivative to zero and solving the optimization problem. The following two-variable optimization problem is obtained as
The closed form is derived as follows:
We now consider the box constraints (10) of the two-variable optimization problem. The box constraints as , , , are classified to four boxes constraints according to the labels of examples from some two coordinates.
Firstly, suppose that the labels of examples are as
The sum constraints of Lagrange multipliers according to these labels are as
Another expressing of sum constraints of Lagrange multipliers is as
The box constraints of Lagrange multipliers are defined as
We obtain the new expressing of box constraints of Lagrange multipliers as the following:
Thus we obtain stricter box constraints of Lagrange multipliers , according to , as the following:
Secondly, suppose that the labels of examples are as
Similarly, similar stricter box constraints of Lagrange multipliers , are obtained as follows:
Thirdly, suppose that the labels of examples are as
Similarly, the similar stricter box constraints of Lagrange multipliers are obtained as follows:
Finally, suppose that the labels of examples are as
Similarly, the similar stricter box constraints of Lagrange multipliers are obtained as follows:
For the simplification, set is defined as the temp solution, which would be edited to satisfy
From linear constraints with respect to in (17), the value of is obtained as
4. The Modified Proposed Method
4.1. Two Speeding-Up Strategies
To avoid duplicate and invalid iterative process calculations, the two given conditions are not updated. If one of two conditions is met, then the algorithm skips this iteration that can significantly reduce the computational amount and accelerate the convergence speed.
Condition 1. If or , then the constrained conditions are not updated.
Due to the box constraints of (10), there will appear a lot of boundary points of coordinates ( or ) in computing process. If there are two coordinates ( and ) as the value of 0 or C in an iteration process, the analytical solution of the two subvariables updates coordinates without calculating. The reason is that the formula (17) guarantees or , while double restricted box constrained optimization if the result is 0 or . Constrained conditions will be edited ultimately as 0 or . The constrained conditions are not updated.
Condition 2. If projected gradient is 0, the constrained conditions are not updated.
4.2. Dual Problem of Cost-Sensitive Support Vector Machine Using Randomized Dual Coordinate Descent Method and Its Complexity Analysis
From the above algorithm derivation of view, solving of CSVM seems to have been successful, but the computational complexity of the solving process is also larger. Assume that the average value of nonzero feature of each sample is . Firstly, the computational complexity of the inner product matrix , is , but the process can be operated in advance and stored into memory. Secondly, the calculation of takes the computational complexity . The amount of calculation is very great when the data size is large. However, there is a linear relationship model CSVM:
Thus, is further simplified as follows:
Solving the corresponding formula (23) can be simplified as follows:
As can be seen, computing with the complexity of becomes computing with the complexity of . Thus the calculation times reduce . However, where is the updated still computational complexity . The amount of calculation can be reduced significantly when is updated by changing . Let , be the values of the current selection; , are updated values, which can be updated via a simple way:
Its computational complexity only is . So, whether calculating or updating , coordinate gradient computation complexity is , which is one of the coordinate gradient method rapid convergence speed reasons.
When assigned an initial value and constraints based on set the initial point where , are the total number of samples and the number of positive samples, respectively. Thus the weight vector of the original problem is obtained by optimizing Lagrangian multipliers of the dual problem.
4.3. Description of Cost-Sensitive Support Vector Machine Using Randomized Dual Coordinate Descent Method
Accelerated conditions are judged by Section 4.1. One chooses the coordinate optimized number . We use formula (43) to calculate , formulas (39) and (40) to update , , and the formula (45) to update , respectively. We can see that inner iteration takes effort. The computer memory is mainly used to store samples information and each sample point and their inner products . Cost-sensitive support vector machine using dual randomized coordinate gradient descent algorithm is described as follows.
Algorithm 3. Cost-sensitive support vector machine using randomized dual coordinate descent algorithm (CSVM-RDCD). Input: sample information , . Output: . Initialize and the corresponding . For do
Step 1. Randomly choose , and the corresponding , .
Step 6. Until A stopping condition is satisfied, End for.
5. Experiments and Analysis
5.1. Experiment on Big Class-Imbalanced Benchmark Datasets Classification Problem
In this section, we analyze the performance of the proposed cost-sensitive support vector machine using randomized dual coordinate descent method (CSVM-RDCD). We compare our implementation with the state-of-the-art cost-sensitive support vector. Three implementations of related cost-sensitive SVMs are compared. We proposed the cost-sensitive SVM using randomized dual coordinate descent method (CSVM-RDCD) by modifying the LibSVM [8, 9] source code. Eitrich and Lang  proposed parallel cost-sensitive support vector machine (PCSVM).  proposed cost-sensitive support vector machines (CSSVM).
Table 1 lists the statistics of data sets. KDD99, Web Spam, Covertype, MNIST, SIAM1, and SIAM11 are obtained from the following data website. KDD99 is at http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. Web Spam is at http://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html. Covertype is at http://archive.ics.uci.edu/ml/datasets/Covertype. MNIST is at http://yann.lecun.com/exdb/mnist/. SIAM is at https://c3.nasa.gov/dashlink/resources/138/.
To evaluate the performance of CSVM-RDCD method, we use a stratified selection to split each dataset to 9/10 training and 1/10 testing. We briefly describe each set below. For each dataset we choose the class with the higher cost or fewer data points as the target or positive class. All multiclass datasets were converted to binary data sets. In particular, the binary datasets SIAM1 and SIAM2 are datasets which have been constructed from the same multiclass dataset but with different target class and different imbalance ratios.
Evaluation of the performance of the three algorithms using CSVM was 10-fold cross-validation tested average misclassification cost, the training time, and the recognition rate of the positive class (i.e., the recognition rate of the minority class).
Average misclassification cost (AMC) represents 10-fold cross-validation average misclassification cost on test datasets for related CSVMs, described as follows: where and represent the number of the positive examples misclassified as the negative examples in the test dataset and the number of the negative examples misclassified as the positive examples in the test data set, respectively. and denote the cost of the positive examples misclassified as the negative examples in the test data set and the cost of the negative examples misclassified as the positive examples in the test data set, respectively. denotes the number of test examples.
The recognition rate of the positive class is the number of classified positive classes and the number of the positive classes on testing dataset.
Training time in seconds is used to evaluate the convergence speeding of three algorithms using CSVM on the same computer.
Cost-sensitive parameters of three algorithms using CSVM are specified as the following Table 2. Cost-sensitive parameters and are valued according to the class ratio of datasets, namely, the class ratio of minority class and majority class.
Three datasets with relative class imbalance are examined. Namely, KDD99 (intrusion detection), Web span, and MNIST datasets are considered. Three datasets with severe class imbalance are examined. Namely, Covertype, SIAM1, and SIAM11 datasets are considered. The average misclassification cost comparison of three algorithms using CSVM is shown in Figure 1 for each of the datasets. The CSVM-RDCD algorithm outperforms the PCSVM and CSVM on all datasets.
The recognition rate of positive class comparison of three algorithms using CSVM is shown in Figure 2 for each of the datasets. The CSVM-RDCD algorithm outperforms the PCSVM and CSSVM on all datasets, surpasses the PCSVM on four datasets, and ties with the PCSVM on two datasets.
We examine large datasets with relative imbalance ratios and severe imbalance ratios to evaluate the convergence speed of CSVM-RDCD algorithm. The training time comparison of three algorithms using CSVM is shown in Figure 3 for each of the datasets. The CSVM-RDCD algorithm outperforms the PCSVM and CSSVM on all datasets.
5.2. Experiment on Real-World Big Class-Imbalanced Dataset Classification Problems
In order to verify the effectiveness of the proposed algorithm CSVM-RDCD on real-world big class-imbalanced data classification problems, it was evaluated using the real vibration data measured in the wind turbine. The experimental data were from the SKF WindCon software and collected from a wind turbine gearbox type TF138-A . The vibration signals were continuously acquired by an accelerometer mounted on the outer case of the gearbox.
All parameter settings for the dataset are listed in Table 3. The statistical results of the big class-imbalanced data problems that measure the quality of results (average misclassification cost, recognition rate of positive class, and training time) are listed in Table 4. From Table 4, it can be concluded that CSVM-RDCD is able to consistently achieve superior performance in the big class-imbalanced data classification problems.
Experimental results show that it is applicable to solve cost-sensitive SVM dual problem using randomized dual coordinate descent method on the large-scale experimental data sets. The proposed method can achieve superior performance in the average misclassification cost, recognition rate of positive class, and training time. Large-scale experimental data sets show that cost-sensitive support vector machines using randomized dual coordinate descent method run more efficiently than both PCSVM and CSSVM; especially randomized dual coordinate descent algorithm has advantage of training time on large-scale data sets. CSSVM needs to build complex whole gradient and kernel matrix and needs to select the set of complex work in solving process of decomposition algorithm. Decomposition algorithm updates full uniform gradient information as a whole, the computational complexity for the full gradient update. PCSVM also has similar computational complexity. Randomized dual coordinate gradient method updates linearly the coordinates, its computational complexity as , which increases considerably the convergence speed of the proposed method.
Randomized dual coordinate descentmethod (RDCD) is the optimization algorithm to update the global solution which is obtained by solving an analytical solution of the suboptimal problem. The RDCD method has the rapid convergence rate, which is mainly due to the following: the subproblem has formal analytical solution, which is solved in solving process without complex numerical optimization; the next component of RDCD method in solving process is updated on the basis of a previous component; compared with the full gradient information updated CSSVM method as a whole, the objective function of RDCD method can decline faster; the single coordinate gradient calculation of RDCD method is simpler and easier than the full gradient calculation.
Randomized dual coordinate descent method is applied to cost-sensitive support vector machine, which expanded the scope of application of the randomized dual coordinate descent method. For large-scale class-imbalanced problem, a cost-sensitive SVM using randomized dual coordinate descent method is proposed. Experimental results and analysis show the effectiveness and feasibility of the proposed method.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is partially supported by the National Science Fund for Distinguished Young Scholars of China (no. 61025015), the National Natural Science Foundation of China (nos. 51305046 and 61304019), the Research Foundation of Education Bureau of Hunan Province, China (no. 12A007), Open Fund of Hunan Province University Key Laboratory of Bridge Engineering (Changsha University of Science and Technology), the Key Laboratory of Renewable Energy Electric-Technology of Hunan Province (Changsha University of Science and Technology), the Key Laboratory of Efficient and Clean Energy Utilization, College of Hunan Province, and the Introduction of Talent Fund of Changsha University of Science and Technology.
M. A. Davenport, “The 2nu-SVM: a cost-sensitive extension of the nu-SVM,” Tech. Rep. TREE 0504, Department of Electrical and Computer Engineering, Rice University, 2005.View at: Google Scholar
C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan, “A dual coordinate descent method for large-scale linear SVM,” in Proceedings of the 25th International Conference on Machine Learning, pp. 408–415, July 2008.View at: Google Scholar
T. Eitrich and B. Lang, “Parallel cost-sensitive support vector machine software for classification,” in Proceedings of the Workshop from Computational Biophysics to Systems Biology, NIC Series, pp. 141–144, John von Neumann Institute for Computing, Jülich, Germany, 2006.View at: Google Scholar