Research Article  Open Access
CostSensitive Support Vector Machine Using Randomized Dual Coordinate Descent Method for Big ClassImbalanced Data Classification
Abstract
Costsensitive support vector machine is one of the most popular tools to deal with classimbalanced problem such as fault diagnosis. However, such data appear with a huge number of examples as well as features. Aiming at classimbalanced problem on big data, a costsensitive support vector machine using randomized dual coordinate descent method (CSVMRDCD) is proposed in this paper. The solution of concerned subproblem at each iteration is derived in closed form and the computational cost is decreased through the accelerating strategy and cheap computation. The four constrained conditions of CSVMRDCD are derived. Experimental results illustrate that the proposed method increases recognition rates of positive class and reduces average misclassification costs on real big classimbalanced data.
1. Introduction
The most popular strategy for the design of classification algorithms is to minimize the probability of error, assuming that all misclassifications have the same cost and classes of dataset are balanced [1–6]. The resulting decision rules are usually denoted as cost insensitive. However, in many important applications of machine learning, such as fault diagnosis [7] and fraud detection, certain types of error are much more costly than others. Other applications involve significantly classimbalanced datasets, where examples from different classes appear with substantially different probability. Costsensitive support vector machine (CSVM) [2] is one of the most popular tools to deal with classimbalanced problem and unequal misclassification problem. However, in many applications, such data appear with a huge number of examples as well as features.
In this work we consider the costsensitive support vector machine architecture [1]. Although CSVMs are based on a very solid learningtheoretic foundation and have been successfully applied to many classification problems, it is not well understood how to design big data learning of the CSVM algorithm. CSVM usually maps training vectors into a high dimensional space via a nonlinear function. Due to the high dimensionality of the weight vector, one solves the dual problem of CSVM by the kernel trick. In some applications, data appear in a rich dimensional feature space; the performances are similar with/without nonlinear mapping. If data are not mapped, we can often train much data set.
Recently, many methods have been proposed for linear SVM in largescale scenarios [4]. In all methods, dual coordinate descent methods for dual problem of CSVM are one of popular methods to deal with largescale convex optimization problem. However, they do not focus on big data learning of CSVM. We focus on big data classimbalanced learning by CSVM.
This paper is organized as follows. In Section 2 basic theory of costsensitive support vector machine is described. In Section 3 we derive our proposed algorithm. Section 4 discusses both speedups and four constrained conditions of costsensitive support vector machine. Implementation issues are investigated in Section 5. Experiments show efficiently our proposed method.
2. Basic Theory of CostSensitive Support Vector Machine
Costsensitive support vector machine such as 2CSVM is proposed in [1]. Consider examples set , , , , . The 2CSVM has primal optimization problem: Lagrange multipliers method can be used to solve the constrained optimization problem. In this method, the Lagrange equation is defined as follows: where the , ’s are called the Lagrange multipliers. ’s partial derivatives are set to be zero as follows:
And solve . The formula (3) is extended to be
Equation (4) can be reformatted as
By incorporating (5) to (7), Lagrange equation (2) is rewritten as
Lagrange equation (8) is simplified as
Recall that the equation above is obtained by minimizing with respect to and . Putting this together with the constraints and the constraints (5) to (7), the following dual optimization problem of 2CSVM is obtained as where , are Lagrange multipliers and , , , are misclassification cost parameters. The matrixes and are defined, respectively, as follows:
3. CostSensitive Support Vector Machine Using Randomized Dual Coordinate Descent Method
In this section, randomized dual coordinate descent method is used to solve 2CSVM that is one version of costsensitive support vector machine. The optimization process starts from an initial point and generates a sequence of vectors , . The process from to is called an outer iteration. In each outer iteration is called inner iteration, so that are sequentially updated. Each outer iteration thus generates vectors , , such that
For updating to , the following onevariable subproblem is solved as where . This is general process of onevariable coordinate descent method. However, the sum constrained condition of the optimization problem (13) is denoted as follows: where is exactly determined by the other ,s, and if we were to hold fixed, then we cannot make any change without violating the constraint condition (14) in the optimization problem.
Thus if we want to update some subject of the ,s, we must update at least two of them simultaneously in order to keep satisfying the constraints. We currently have some setting of the s that satisfy the constraint conditions (14) and suppose we have decided to hold fixed and reoptimize the dual problem of CSVM (10) with respect to , (subject to the constraints). Equation (6) is reformatted as
Since the right hand side is fixed, we can just let it be denoted by some constant :
We can form the updated coordinates before and after with respect to , : , are updated coordinates before and after with respect to . Consider the following dual twovariable subproblem of 2CSVM: where is constant, , and
From (18), we obtain where is the th component of the gradient . By incorporating (21), (18) is rewritten as
From (17), we have . Equation (22) is reformed as follows:
Treating , , as constants, we should be able to verify that this is just some quadratic function in . We can easily maximize this quadratic function by setting its derivative to zero and solving the optimization problem. The following twovariable optimization problem is obtained as
The closed form is derived as follows:
We now consider the box constraints (10) of the twovariable optimization problem. The box constraints as , , , are classified to four boxes constraints according to the labels of examples from some two coordinates.
Firstly, suppose that the labels of examples are as
The sum constraints of Lagrange multipliers according to these labels are as
Another expressing of sum constraints of Lagrange multipliers is as
The box constraints of Lagrange multipliers are defined as
We obtain the new expressing of box constraints of Lagrange multipliers as the following:
Thus we obtain stricter box constraints of Lagrange multipliers , according to , as the following:
Secondly, suppose that the labels of examples are as
Similarly, similar stricter box constraints of Lagrange multipliers , are obtained as follows:
Thirdly, suppose that the labels of examples are as
Similarly, the similar stricter box constraints of Lagrange multipliers are obtained as follows:
Finally, suppose that the labels of examples are as
Similarly, the similar stricter box constraints of Lagrange multipliers are obtained as follows:
For the simplification, set is defined as the temp solution, which would be edited to satisfy
From linear constraints with respect to in (17), the value of is obtained as
4. The Modified Proposed Method
4.1. Two SpeedingUp Strategies
To avoid duplicate and invalid iterative process calculations, the two given conditions are not updated. If one of two conditions is met, then the algorithm skips this iteration that can significantly reduce the computational amount and accelerate the convergence speed.
Condition 1. If or , then the constrained conditions are not updated.
Due to the box constraints of (10), there will appear a lot of boundary points of coordinates ( or ) in computing process. If there are two coordinates ( and ) as the value of 0 or C in an iteration process, the analytical solution of the two subvariables updates coordinates without calculating. The reason is that the formula (17) guarantees or , while double restricted box constrained optimization if the result is 0 or . Constrained conditions will be edited ultimately as 0 or . The constrained conditions are not updated.
Condition 2. If projected gradient is 0, the constrained conditions are not updated.
4.2. Dual Problem of CostSensitive Support Vector Machine Using Randomized Dual Coordinate Descent Method and Its Complexity Analysis
From the above algorithm derivation of view, solving of CSVM seems to have been successful, but the computational complexity of the solving process is also larger. Assume that the average value of nonzero feature of each sample is . Firstly, the computational complexity of the inner product matrix , is , but the process can be operated in advance and stored into memory. Secondly, the calculation of takes the computational complexity . The amount of calculation is very great when the data size is large. However, there is a linear relationship model CSVM:
Thus, is further simplified as follows:
Solving the corresponding formula (23) can be simplified as follows:
As can be seen, computing with the complexity of becomes computing with the complexity of . Thus the calculation times reduce . However, where is the updated still computational complexity . The amount of calculation can be reduced significantly when is updated by changing . Let , be the values of the current selection; , are updated values, which can be updated via a simple way:
Its computational complexity only is . So, whether calculating or updating , coordinate gradient computation complexity is , which is one of the coordinate gradient method rapid convergence speed reasons.
When assigned an initial value and constraints based on set the initial point where , are the total number of samples and the number of positive samples, respectively. Thus the weight vector of the original problem is obtained by optimizing Lagrangian multipliers of the dual problem.
4.3. Description of CostSensitive Support Vector Machine Using Randomized Dual Coordinate Descent Method
Accelerated conditions are judged by Section 4.1. One chooses the coordinate optimized number . We use formula (43) to calculate , formulas (39) and (40) to update , , and the formula (45) to update , respectively. We can see that inner iteration takes effort. The computer memory is mainly used to store samples information and each sample point and their inner products . Costsensitive support vector machine using dual randomized coordinate gradient descent algorithm is described as follows.
Algorithm 3. Costsensitive support vector machine using randomized dual coordinate descent algorithm (CSVMRDCD). Input: sample information , . Output: . Initialize and the corresponding . For do
Step 1. Randomly choose , and the corresponding , .
Step 2.
Step 3.
Step 4.
Step 5.
Step 6. Until A stopping condition is satisfied, End for.
5. Experiments and Analysis
5.1. Experiment on Big ClassImbalanced Benchmark Datasets Classification Problem
In this section, we analyze the performance of the proposed costsensitive support vector machine using randomized dual coordinate descent method (CSVMRDCD). We compare our implementation with the stateoftheart costsensitive support vector. Three implementations of related costsensitive SVMs are compared. We proposed the costsensitive SVM using randomized dual coordinate descent method (CSVMRDCD) by modifying the LibSVM [8, 9] source code. Eitrich and Lang [10] proposed parallel costsensitive support vector machine (PCSVM). [6] proposed costsensitive support vector machines (CSSVM).
Table 1 lists the statistics of data sets. KDD99, Web Spam, Covertype, MNIST, SIAM1, and SIAM11 are obtained from the following data website. KDD99 is at http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. Web Spam is at http://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html. Covertype is at http://archive.ics.uci.edu/ml/datasets/Covertype. MNIST is at http://yann.lecun.com/exdb/mnist/. SIAM is at https://c3.nasa.gov/dashlink/resources/138/.

To evaluate the performance of CSVMRDCD method, we use a stratified selection to split each dataset to 9/10 training and 1/10 testing. We briefly describe each set below. For each dataset we choose the class with the higher cost or fewer data points as the target or positive class. All multiclass datasets were converted to binary data sets. In particular, the binary datasets SIAM1 and SIAM2 are datasets which have been constructed from the same multiclass dataset but with different target class and different imbalance ratios.
Evaluation of the performance of the three algorithms using CSVM was 10fold crossvalidation tested average misclassification cost, the training time, and the recognition rate of the positive class (i.e., the recognition rate of the minority class).
Average misclassification cost (AMC) represents 10fold crossvalidation average misclassification cost on test datasets for related CSVMs, described as follows: where and represent the number of the positive examples misclassified as the negative examples in the test dataset and the number of the negative examples misclassified as the positive examples in the test data set, respectively. and denote the cost of the positive examples misclassified as the negative examples in the test data set and the cost of the negative examples misclassified as the positive examples in the test data set, respectively. denotes the number of test examples.
The recognition rate of the positive class is the number of classified positive classes and the number of the positive classes on testing dataset.
Training time in seconds is used to evaluate the convergence speeding of three algorithms using CSVM on the same computer.
Costsensitive parameters of three algorithms using CSVM are specified as the following Table 2. Costsensitive parameters and are valued according to the class ratio of datasets, namely, the class ratio of minority class and majority class.

Three datasets with relative class imbalance are examined. Namely, KDD99 (intrusion detection), Web span, and MNIST datasets are considered. Three datasets with severe class imbalance are examined. Namely, Covertype, SIAM1, and SIAM11 datasets are considered. The average misclassification cost comparison of three algorithms using CSVM is shown in Figure 1 for each of the datasets. The CSVMRDCD algorithm outperforms the PCSVM and CSVM on all datasets.
The recognition rate of positive class comparison of three algorithms using CSVM is shown in Figure 2 for each of the datasets. The CSVMRDCD algorithm outperforms the PCSVM and CSSVM on all datasets, surpasses the PCSVM on four datasets, and ties with the PCSVM on two datasets.
We examine large datasets with relative imbalance ratios and severe imbalance ratios to evaluate the convergence speed of CSVMRDCD algorithm. The training time comparison of three algorithms using CSVM is shown in Figure 3 for each of the datasets. The CSVMRDCD algorithm outperforms the PCSVM and CSSVM on all datasets.
5.2. Experiment on RealWorld Big ClassImbalanced Dataset Classification Problems
In order to verify the effectiveness of the proposed algorithm CSVMRDCD on realworld big classimbalanced data classification problems, it was evaluated using the real vibration data measured in the wind turbine. The experimental data were from the SKF WindCon software and collected from a wind turbine gearbox type TF138A [11]. The vibration signals were continuously acquired by an accelerometer mounted on the outer case of the gearbox.
All parameter settings for the dataset are listed in Table 3. The statistical results of the big classimbalanced data problems that measure the quality of results (average misclassification cost, recognition rate of positive class, and training time) are listed in Table 4. From Table 4, it can be concluded that CSVMRDCD is able to consistently achieve superior performance in the big classimbalanced data classification problems.


Experimental results show that it is applicable to solve costsensitive SVM dual problem using randomized dual coordinate descent method on the largescale experimental data sets. The proposed method can achieve superior performance in the average misclassification cost, recognition rate of positive class, and training time. Largescale experimental data sets show that costsensitive support vector machines using randomized dual coordinate descent method run more efficiently than both PCSVM and CSSVM; especially randomized dual coordinate descent algorithm has advantage of training time on largescale data sets. CSSVM needs to build complex whole gradient and kernel matrix and needs to select the set of complex work in solving process of decomposition algorithm. Decomposition algorithm updates full uniform gradient information as a whole, the computational complexity for the full gradient update. PCSVM also has similar computational complexity. Randomized dual coordinate gradient method updates linearly the coordinates, its computational complexity as , which increases considerably the convergence speed of the proposed method.
6. Conclusions
Randomized dual coordinate descentmethod (RDCD) is the optimization algorithm to update the global solution which is obtained by solving an analytical solution of the suboptimal problem. The RDCD method has the rapid convergence rate, which is mainly due to the following: the subproblem has formal analytical solution, which is solved in solving process without complex numerical optimization; the next component of RDCD method in solving process is updated on the basis of a previous component; compared with the full gradient information updated CSSVM method as a whole, the objective function of RDCD method can decline faster; the single coordinate gradient calculation of RDCD method is simpler and easier than the full gradient calculation.
Randomized dual coordinate descent method is applied to costsensitive support vector machine, which expanded the scope of application of the randomized dual coordinate descent method. For largescale classimbalanced problem, a costsensitive SVM using randomized dual coordinate descent method is proposed. Experimental results and analysis show the effectiveness and feasibility of the proposed method.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work is partially supported by the National Science Fund for Distinguished Young Scholars of China (no. 61025015), the National Natural Science Foundation of China (nos. 51305046 and 61304019), the Research Foundation of Education Bureau of Hunan Province, China (no. 12A007), Open Fund of Hunan Province University Key Laboratory of Bridge Engineering (Changsha University of Science and Technology), the Key Laboratory of Renewable Energy ElectricTechnology of Hunan Province (Changsha University of Science and Technology), the Key Laboratory of Efficient and Clean Energy Utilization, College of Hunan Province, and the Introduction of Talent Fund of Changsha University of Science and Technology.
References
 M. A. Davenport, “The 2nuSVM: a costsensitive extension of the nuSVM,” Tech. Rep. TREE 0504, Department of Electrical and Computer Engineering, Rice University, 2005. View at: Google Scholar
 M. Kim, “Large margin costsensitive learning of conditional random fields,” Pattern Recognition, vol. 43, no. 10, pp. 3683–3692, 2010. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 Y.J. Park, S.H. Chun, and B.C. Kim, “Costsensitive casebased reasoning using a genetic algorithm: application to medical diagnosis,” Artificial Intelligence in Medicine, vol. 51, no. 2, pp. 133–145, 2011. View at: Publisher Site  Google Scholar
 J. Kim, K. Choi, G. Kim, and Y. Suh, “Classification cost: an empirical comparison among traditional classifier, CostSensitive Classifier, and MetaCost,” Expert Systems with Applications, vol. 39, no. 4, pp. 4013–4019, 2012. View at: Publisher Site  Google Scholar
 C.Y. Yang, J.S. Yang, and J.J. Wang, “Margin calibration in SVM classimbalanced learning,” Neurocomputing, vol. 73, no. 1–3, pp. 397–411, 2009. View at: Publisher Site  Google Scholar
 H. MasnadiShirazi, N. Vasconcelos, and A. Iranmehr, “Costsensitive support vector machines,” http://arxiv.org/abs/1212.0975. View at: Google Scholar
 Y. Artan, M. A. Haider, D. L. Langer et al., “Prostate cancer localization with multispectral MRI using costsensitive support vector machines and conditional random fields,” IEEE Transactions on Image Processing, vol. 19, no. 9, pp. 2444–2455, 2010. View at: Publisher Site  Google Scholar
 C.J. Hsieh, K.W. Chang, C.J. Lin, S. S. Keerthi, and S. Sundararajan, “A dual coordinate descent method for largescale linear SVM,” in Proceedings of the 25th International Conference on Machine Learning, pp. 408–415, July 2008. View at: Google Scholar
 R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin, “LIBLINEAR: a library for large linear classification,” The Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008. View at: Google Scholar  Zentralblatt MATH
 T. Eitrich and B. Lang, “Parallel costsensitive support vector machine software for classification,” in Proceedings of the Workshop from Computational Biophysics to Systems Biology, NIC Series, pp. 141–144, John von Neumann Institute for Computing, Jülich, Germany, 2006. View at: Google Scholar
 B. Tang, W. Liu, and T. Song, “Wind turbine fault diagnosis based on Morlet wavelet transformation and WignerVille distribution,” Renewable Energy, vol. 35, no. 12, pp. 2862–2866, 2010. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2014 Mingzhu Tang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.