Mathematical Problems in Engineering

Volume 2015 (2015), Article ID 294985, 7 pages

http://dx.doi.org/10.1155/2015/294985

## Support Vector Machines for Unbalanced Multicategory Classification

Department of Statistics and Computer Science, Kunsan National University, Gunsan 573-701, Republic of Korea

Received 16 December 2014; Revised 6 February 2015; Accepted 7 February 2015

Academic Editor: Yaguo Lei

Copyright © 2015 Kang-Mo Jung. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Classification is a very important research topic and its applications are various, because data can be easily obtained in these days. Among many techniques of classification the support vector machine (SVM) is widely applied to bioinformatics or genetic analysis, because it gives sound theoretical background and its performance is superior to other methods. The SVM can be rewritten by a combination of the hinge loss function and the penalty function. The smoothly clipped absolute deviation penalty function satisfies desirably statistical properties. Since standard SVM techniques typically treat all classes equally, it is not well suited to unbalanced proportion data. We propose a robust method to treat unbalanced cases based on the weights of the class. Simulation and a numerical example show that the proposed method is effective to analyze unbalanced proportion data.

#### 1. Introduction

Classification is a very important research topic and is applied to many applications such as health science and bioinformatics. Many classification methods have been proposed in the literature, for example, linear discriminant analysis, logistic regression, the -nearest neighbors, and support vector machine (SVM) as in [1]. Among them SVM is considered to be popular in engineering, because it gives sound theoretical background. It has been widely applied to bioinformatics or genetic analysis and its performance is superior to other methods.

In these days we can easily obtain high dimensional data thanks to computer technology. Since the number of predictors is very large, variable selection is crucial to get a meaningful model. In fact the model with one thousand nonzero predictors cannot be interpretable and it does not give any information on the data. Furthermore if the true model is sparse, the fitted model should be sparse. Variable selection is an important research topic in linear regression modeling especially when there are tremendous predictors (see [2]). Thus simultaneous variable selection and estimation methods have been suggested. It is called a penalized method. The SVM can be considered as a penalized method consisting of the hinge loss function and the penalty function.

After the Ridge regression estimator many penalty functions have been proposed. Among them the least absolute shrinkage and selection operator (LASSO) proposed by [3] with the penalty function is very popular, because it has the sparse property. However, the LASSO estimate can be biased for the large absolute coefficients, which was pointed out by [4]. They proposed a nonconvex penalty function, the smoothly clipped absolute deviation (SCAD) penalty satisfying desirable properties: the unbiasedness, the sparseness, and the continuity. Since the SCAD function is not convex, the standard optimization algorithm cannot be applied. Thus the locally quadratic approximation (LQA) algorithm can be adopted to solve the optimum of the objective function as in [5].

The traditional SVM treats the observations for each class equally. When the observations for each class are unequal, especially the observation for one of classes being relatively small to others, the SVM does not consider the classes corresponding to the smallest numbers of observations. Ignoring the minority classes the traditional SVM can decrease the overall misclassification rate. However, the minority classes have important characteristics for the data. For example, the cancer patients class from the data of hospital patients is much smaller than other patients. However, the cancer class has significant information on the death rate and so the cancer class should not be ignored.

In this paper we develop SVM algorithms to treat unbalanced cases based on the weights of the loss function and the SCAD penalty function. Since it increases the impact of the minority classes, the result of classification does not ignore the minority classes while the traditional SVM ignores the minority classes. We used the local linear approximation (LLA) of the SCAD function to cover the deficiencies of the LQA algorithm.

This paper is organized as follows. In Section 2 we review the SVM for unbalanced cases and its statistical properties. We consider two classifiers based on the overall misclassification rate and the sum of within group error rate. The latter classifier does not ignore the minor classes and it is more applicable to unbalanced cases. The and the SCAD penalty functions are briefly reviewed. Section 3 gives an algorithm to implement the proposed method. Reference [6] proposed a LQA for the SVM with the SCAD function. We use a LLA algorithm which can be written by a linear programming problem to minimize the nondifferentiable and nonconvex objective function. Section 4 provides the simulation results and a numerical example. They show that the performance of the proposed method is superior to the traditional SVM in the view of the consideration of the minority classes. Section 5 gives some discussion and concluding remarks.

#### 2. SVM for Unbalanced Cases

For a multiclass classification problem, a training sample , , is given, where the input vector and the output label , is the number of observations, and is the dimension of the input vector. Suppose that the samples are drawn from an unknown joint probability distribution .

The classifier trained by the training sample can predict the class of future input vector . The standard classification criterion gives the same misclassification costs for different classes. The loss function in this case is the 0-1 loss function, , and the risk function corresponding to the 0-1 loss function can be written as in the empirical version. Here the function implies the indicator function. However, when there are minority classes in the output label, the classifier based on the overall misclassification rate does not give the information on the minority classes having very important characteristics of the data. For example, the proportion of the cancer patients is a minor class among general patients visiting a hospital. However, the cancer patients can be especially considered for the hospital. Unbalanced proportion samples are often found in real world data.

The classical criterion finds a decision rule minimizing the overall misclassification rate: where is a -dimensional random vector from a probability distribution and is the proportion of class in the population. If is very small, the overall misclassification rate (1) can be very small, because the misclassification rate for the th class does not influence the quantity. Even if the misclassification rate for the th class is very high, it could be ignored. Thus if the proportion term in (1) was deleted, then the minor class having very small could influence the classifier. We consider a classifier which minimizes by discarding the term in (1). Denote , as the within group error for the th class. The empirical version of term (2) can be calculated as the sum of the ratio of the number of misclassifications in class , the sum of the within group error rate (see [7]). Criterion (2) is called the within group error rate criterion.

Research on this topic has focused on the methods at the data and algorithmic levels and it can be categorized such as resampling methods for balancing the dataset, modification of existing learning algorithm. Undersample and oversample are methods of resampling technique. Reference [7] proposed an adaptive weighted learning using the SVM.

The SVM is a classification method based on a large margin. It finds the hyperplane maximizing the separation distance between the two classes in the separable case. The SVM uses the slack variables, the so-called soft margin method to minimize the separation distance with the mislabeled samples. There are several extensions from the binary SVM to the multiclass SVM such as one-versus-one and one-versus-rest (see [8]). However, these methods provided poor performance when the data is dominated by only one class. Reference [9] proposed a simultaneous multiclass SVM. Reference [5] suggested a simultaneous SVM algorithm with the SCAD penalty function.

The multiclass SVM minimizes the objective function (see [9]) where , the function is the th element of the -dimensional decision function with a sum-to-zero constraint , and the parameter controls the trade-off between the training error and the model complexity. Equation (3) is limited to the linear SVM, where . The SVM objective function consists of the loss function and the penalty function, which is the same as the objective function in penalized linear regression (see [10]).

For a sparse model the SVM as in [11] provides the information on valuable variables by discarding redundant noise input variables like the LASSO as in [3] which is popular in penalized linear regression models. Reference [12] proposed a multiclass SVM, which performs classification and variable selection simultaneously through an -norm penalized sparse representation. However, the solution is biased for large coefficients. Reference [4] proposed a nonconvex penalty function, the SCAD penalty function, and desirable properties for penalty functions such as the unbiasedness of the estimator, the sparseness of the model, and the continuity of the estimator on the tuning parameter . Unfortunately, the penalty does not satisfy the unbiasedness and the sparsity, and the penalty does not satisfy the unbiasedness.

The SCAD function can be written by where and . Reference [4] recommended that the parameter from the simulation results. Since the derivatives of the SCAD function outside the range are zero, the SCAD SVM estimates have the unbiasedness. Because the SCAD function is singular at zero, the SCAD SVM provides a sparse model. Like the standard SVM (3) the SCAD SVM minimizes the objective function In linear SVM case, gives the objective function of the SCAD SVM: The classical classification rule (1) naturally becomes And the classifier based on the within group error rate (2) can be written by

If one class dominates all population, in the point of the misclassification rate all of input vector can be classified as the dominating class. However, it does not yield a meaningful classifier. For unbalanced proportion data (6) cannot detect the minority class. The equal hinge loss function in (6) can ignore the minority class and so the unequal hinge loss function with the weight for the class can solve the unbalanced case as in [7]. Now instead of the unweighted SVM objective function (6) we consider a weighted SVM objective function with the SCAD penalty function in unbalanced cases: where is the weight for the th observation with the th class. A weighted SVM is proposed for the robustness of the SVM which is not sensitive to outlier or leverage points (see [13]).

We consider the weight for each class as where is the estimate of the population proportion of the th class as in [7]. It is called a proportional weight to the number of observations for each class. Weight (12) considers only the unbalanced proportion. When the dataset has outlying observations, the SVM based on weight (12) may not give the exact underlying decision function . Thus we propose the weight for the data as where denotes the within group errors on the training dataset with equal weights. We call this weight an adaptive weight. The within group errors can be calculated as the misclassification rate for the th class. Weight (13) gives much more weights on the minority class and the well-classified group got the less weight. The larger values of in (13) represent well-classified observations. Thus well-classified observations are given by small weights. Also the small values of imply the corresponding observations to be misclassified. Therefore, the corresponding weights get larger and the learning machine keeps the observations having important information. The proposed weight (13) considers both robustness and unbalanced proportion of the data, because the terms and reflect the unbalanced proportion of the data and the resistance to outlying observations, respectively. However, weight (12) only considers the unbalanced proportion for the classes.

#### 3. Algorithm

Since the SCAD function is not convex, the objective function in (10) is not convex. Thus standard optimization techniques cannot be adopted. Usually the nonconvex penalty can give good statistical properties but the implementation is not easy. Reference [4] used the LQA algorithm which has drawbacks like the backward elimination. That is, for numerical stability if the coefficient of the variable is close to zero, the variable was deleted and it is not included in the final model. Furthermore, the solution of the LQA algorithm can be written in a Ridge-type and it does not guarantee the sparseness of the solution like the property of the Ridge regression estimator. Reference [14] proposed a perturbation of the LQA algorithm and the proposed algorithm renders the objective function differentiable. Then it optimizes this differentiable function using a minorize-maximize algorithm. But it is not easy to select the size of perturbation. Reference [15] proposed a methodology on a one-step sparse estimation procedure in nonconcave penalized likelihood models which is called the LLA algorithm. It has neither the drawbacks of the LQA algorithm nor the numerical instability at zero.

By the Taylor expansion of the SCAD function we obtain the following approximation equality: where the first derivative function of the SCAD function is By putting (14) into (10), we obtain the objective function up to constants where is an initial solution near the true value and the restriction of (10) is still effective. We introduce the variable with adjusted subindex and use the fact that , , and , where is defined similar to . Then the weighted SVM objective function (10) becomes The first derivative function of the SCAD function (4) was evaluated at the initial value . Equation (17) can be minimized by standard optimization packages. We can obtain the optimum solution and then the parameters can be estimated by and . Equation (17) can be solved by lpSolve in program.

The linear programming problem can be formulated by where is an -vector of variables with , which is composed of the variables , , , , . Now we set the constraints , and the coefficient vector of the objective function from the linear programming problem in (17).

Since the SCAD function is approximated by the absolute function, the method is similar to the penalty SVM. From the point of the variable selection view it is well known that the SVM gives very useful information on the variable selection for classification with greatly exceeding the number of predictors (see [12]). Thus the proposed method is very effective in variable selection, especially when the number of variables is large.

We summarize the proposed algorithm for the weighted multiclass SVM with the SCAD penalty function by the following steps.(1)Tuning process is as follows.(1.1)Find the values , which minimize (17) for tuning data with initial estimates , and from to .(1.2)Obtain the value minimizing the misclassification rate which is the overall misclassification rate (CLSC) or the within group error rate (MWGE).(2)Training process is as follows.(2.1)Set the initial solutions , for and .(2.2)Solve the linear programming problem (17) for training data with for the given tuning parameter .(2.3)Set the weights by (12) or (13) with the above solution.(2.4)Solve (17) and set the parameters and for and .(3)Test process is as follows.(3.1)Calculate the misclassification rates for test data based on the tuning parameter and the parameters and .

#### 4. Numerical Experiment

In this section, we compare the performance of the proposed algorithm and the classical SVM of the multiclass SVM for unbalanced proportion data. We conducted the experiments on simulation and a real dataset.

##### 4.1. Simulation

We consider a simple three-class dimensional data. The class information on the th data can be randomly allocated according to the proportions and , respectively. Here we can see that the first class is the minority class having the expected proportions and , respectively. The training data are generated as follows. First, , are generated from the distribution with degrees of freedom . Second, the responses are allocated randomly and the predictors and , where for classes 1–3, respectively (see [16]). The number of observations is 20, 30, and 30 for each class in the data with the proportion 1 : 2 : 2 and 20, 80, and 80 for the other data.

Our simulation data consists of training data, tuning data, and independent test data. The coefficients , are estimated based on the training data and the performance is evaluated on the test data of the sample sizes and generating from the same distribution described above. The tuning data of sizes and are used to determine the tuning parameter . We conducted simulation iterations to evaluate the within group error for each class. All simulations are carried out using program.

Tables 1 and 2 summarize the within group error rate for equal weights, proportion weights (12), and the adaptive weight (13) combining the proportion, misclassification rates, and the robustness to outliers. The values in the tables are evaluated on the test data. We consider two classifiers, the classical classifier (CLSC) of (1) and the minimum within group error (MWGE) of (2). We compared the proposed algorithm with the equal weight SCAD SVM which is a classical SCAD SVM. Tables 1 and 2 show that the equal weight scheme ignores the minor class 1 to minimize the overall misclassification rate. However, the proposed algorithms (proportional weights and adaptive weights) did not ignore the minor class 1 which may be a main interesting class. For the equal weight scheme the misclassification rate for class 1 is close to 1, but the overall misclassification rate is not high, because class 1 has the small proportion. Particularly for the proportion 1 : 4 : 4 case the equal weights scheme did not discriminate class 1 at all. It is not desirable, when class 1 itself is the main interest. As the degree of unbalanced data goes up, the proposed algorithms will get the effectiveness of weights.