An Interior Point Method for -SVM and Application to Feature Selection in Classification
This paper studies feature selection for support vector machine (SVM). By the use of the regularization technique, we propose a new model -SVM. To solve this nonconvex and non-Lipschitz optimization problem, we first transform it into an equivalent quadratic constrained optimization model with linear objective function and then develop an interior point algorithm. We establish the convergence of the proposed algorithm. Our experiments with artificial data and real data demonstrate that the -SVM model works well and the proposed algorithm is more effective than some popular methods in selecting relevant features and improving classification performance.
Feature selection plays an important role in solving the classification problems with high dimension features, such as text categorization [1, 2], gene expression array analysis [3–5], and combinatorial chemistry [6, 7]. The advantages of feature selection include (i) ignoring noisy or irrelevant features would prevent overfitting and improve the generalization performance; (ii) a sparse classifier can reduce the computation cost; (iii) a small set of important features is desirable for interpretability.
We address the embedded feature selection methods in the context of linear support vector machines (SVMs). Existing feature selection methods embedded in SVMs fall into three approaches . In the first approach, some greedy search strategies are applied to iteratively adding or removing features from the data. Guyon et al.  developed a recursive feature elimination (RFE) algorithm, which has shown good performance on gene selection for microarray data. Beginning with the full feature subset, SVM-RFE trains a SVM at each iteration, and then eliminates the feature that decreases the margin the least. Rakotomamonjy et al.  extended this method by using other ranking criteria including the radius margin bound and the span-estimate.
The second approach is to optimize a scaling parameter vector that indicates the importance of each feature. Weston et al.  proposed an iterative method to optimize the scaling parameters by minimizing the bounds on leave-one-out error. Peleg and Meir  learned the scaling factors based on the global minimization of a data-dependent generalization error bound.
The third category of approaches is to minimize the number of features by adding a sparsity term to the SVM formulation. Though standard SVM based on can be solved easily by convex quadratic programming, its solution may not be a desirable sparse solution. A popular way to deal with this problem is the use of regularization technique, which results in a -SVM. It is to minimize , subject to some linear constraints, where . When , When , . The -SVM can find the sparsest classifier by minimizing , the number of nonzero elements in . However, it is discrete and NP-hard. From computational point of view, it is very difficult to develop efficient numerical methods to solve the problem. A widely used technique in dealing with the -SVM is to use a smoothing technique so that the discrete model is approached by a smooth problem [4, 12, 13]. However, as the function is not even continuous, it is not desirable that a smoothing technique based method would work well. Chan et al.  explored a convex relaxation to the cardinality constraint and obtained a relaxed convex problem that is close to but different from the previous -SVM. An alternative method is to minimize the convex envelope of the , such as -SVM. The -SVM is a convex problem and can yield sparse solution. It can be equivalent to a linear programming and hence can be solved efficiently. Indeed the regularization has become quite welcome in SVM [12, 15, 16] and is well known as the LASSO  in the statistics literature. However, the regularization problem often leads to suboptimal sparsity in reality . In many cases, the solutions yielded from -SVM are less sparse than those of -SVM. The problem with can find sparser solutions than the problem, which was evidenced in extensive computational [19–21]. It has become a welcome strategy in sparse SVM [22–27].
In this paper, we focus on the regularization and propose a novel -SVM. Recently, Xu et al.  justified that the sparsity-promotion ability of the problem was strongest among the minimization problems with all and similar in . So the problem can be taken as a representative of () problems. However, as proved by Ge et al. , finding the global minimal value of the problem was still strongly NP-hard. But computing a local minimizer of the problem could be done in polynomial time. Our contributions of this paper are twofold. One is to derive a smooth constrained optimization reformulation to the -SVM. The objective function of the problem is a linear function and the constraints are quadratic and linear. We will establish the equivalence between the constrained problem and the -SVM. We will also show the existence of the KKT condition of the constrained problem. Our second contribution is to develop an interior point method to solve the constrained optimization reformulation and establish its global convergence. We will also test and verify the effectiveness of the proposed method using artificial data and real data.
The rest of this paper is organized as follows. In Section 2, we first briefly introduce the model of the standard SVM (-SVM) and the sparse regularization SVMs. We then reformulate the -SVM into a smooth constrained optimization problem. We propose an interior point method to solve the constrained optimization reformulation and establish its global convergence in Section 3. In Section 4, we do numerical experiments to test the proposed method. Section 5 gives the conclusive remarks.
2. A Smooth Constrained Optimization Reformulation to the -SVM
In this section, after simply reviewing the model of the standard SVM (-SVM) and the sparse regularization SVMs, we derive an equivalent smooth optimization problem to the -SVM model. The smooth optimization problem is to minimize a linear function subject to some simple quadratic constraints and linear constraints.
2.1. Standard SVM
In a two-class classification problem, we are given a training data set , where is the feature vector and is the class label. The linear classifier is to construct the following decision function: where is the weight vector and is the bias. The prediction label is if and otherwise. The standard SVM (-SVM)  aims to find the separating hyperplane between two classes with maximal margin and minimal training errors, which leads to the following convex optimization problem: where is the norm of , is the loss function to allow training errors for data that may not be linearly separable, and is a user-specified parameter to balance the margin and the losses. As the problem is a convex quadratic program, it can be solved by existing methods, such as the interior point method and active set method efficiently.
2.2. Sparse SVM
The -SVM is a nonsparse regularizer in the sense that the learned decision hyperplane often utilizes all the features. In practice, peoples prefer to sparse SVM so that only a few features are used to make a decision. For this purposes, the following -SVM becomes very welcome: where . stands for the number of nonzero elements of , and for , is defined by (1). Problem (4) is obtained by replacing penalty () by penalty () in (3). The standard SVM (3) corresponds to the model (4) with .
Figure 1 plots the penalty in one dimension. We can see from the figure that the smaller is, the larger penalties are imposed on the small coefficients (). Therefore, the penalties with may achieve sparser solution than the penalty. In addition, the imposes large penalties on large coefficients, which may lead to biased estimation for large coefficients. Consequently, the () penalties become attractive due to their good properties in sparsity, unbiasedness  and oracle . We are particularly interested in the penalty. Recently, Xu et al.  revealed the representative role of the penalty in the regularization with . We will apply penalty to SVM to perform feature selection and classification jointly.
2.3. -SVM Model
We pay particular attention to the -SVM, namely, problem (4) with . We will derive a smooth constrained optimization reformulation to the -SVM so that it is relatively easy to design numerical methods. We first specify the -SVM: Denote by and the feasible region of the problem; that is, Then the -SVM can be written as an impact form
It is a nonconvex and non-Lipschitz problem. Due to the existence of the term , the objective function is not even directionally differentiable at a point with some , which makes the problem very difficult to solve. Existing numerical methods that are very efficient for solving smooth problem could not be used directly. One possible way to develop numerical methods for solving (7) is to smoothing the term using some smoothing function such as with some . However, it is easy to see that the derivative of will be unbounded as and . Consequently, it is not desirable that the smoothing function based numerical methods could work well.
Recently, Tian and Yang  proposed an interior point -penalty function method to solve general nonlinear programming problems by using a quadratic relaxation scheme for their -lower order penalty problems. We will follow the idea of  to develop an interior point method for solving the -SVM. To this end, in the next subsection, we reformulate problem (7) to a smooth constrained optimization problem.
2.4. A Reformulation to the -SVM Model
Consider the following constrained optimization problem: It is obtained by letting in the objective function and adding constraints and , , in (7). Denote by the feasible region of the problem; that is, Let . Then the above problem can be written as
The following theorem establishes the equivalence between the -SVM and (10).
Proof. Let be a solution of the -SVM (7) and let be a solution of the constrained optimization problem (10). It is clear that . Moreover, we have , , and hence Since , we have This together with (11) implies that . The proof is complete.
It is clear that the constraint functions of (10) are convex. Consequently, at any feasible point, the set of all feasible directions is the same as the set of all linearized feasible directions.
As a result, the KKT point exists. The KKT system of the problem (10) can be written as the following system of nonlinear equations: where are the Lagrangian multipliers, , is diagonal matrix and , .
For the sake of simplicity, the properties of the reformulation to -SVM are shown in Appendix A.
3. An Interior Point Method
3.1. Auxiliary Function
Following the idea of the interior point method, the constrained problem (10) can be solved by minimizing a sequence of logarithmic barrier functions as follows: where is the barrier parameter, converging to zero from above.
The KKT system of problem (14) is the following system of linear equations: where are the Lagrangian multipliers, , , , and are diagonal matrices, and and stand for the vector whose elements are all ones.
3.2. Newton’s Method
We apply Newton’s method to solve the nonlinear system (15) in variables , , , , and . The subproblem of the method is the following system of linear equations: where is the Jacobian of the left function in (15) and takes the formwhere , , , , and .
We can rewrite (16) as It follows from the last five equations that vector can be expressed as Substituting (19) into the first four equations of (18), we obtain where matrix takes the form with blocks and and .
3.3. The Interior Pointer Algorithm
Let and ; we first present the interior pointer algorithm to solve the barrier problem (14), and then discuss the details of the algorithm.
Algorithm 2. The interior pointer algorithm (IPA) is as follows. Step 0. Given tolerance , set , , , . Let . Step 1. Stop if KKT condition (15) holds. Step 2. Compute from (20) and from (19). Compute . Update the Lagrangian multipliers to obtain . Step 3. Let . Go to Step 1.
In Step 2, a step length is used to calculate . We estimate by Armijo line search , in which for some and satisfies the following inequalities: where .
To avoid ill-conditioned growth of and guarantee the strict dual feasibility, the Lagrangian multipliers should be sufficiently positive and bounded from above. Following a similar idea of , we first update the dual multipliers by where ; where the parameters and satisfy , .
Since positive definiteness of the matrix is demanded in this method, the Lagrangian multipliers should satisfy the following condition: For the sake of simplicity, the proof is given in Appendix B.
Therefore, if satisfies (26), we let . Otherwise, we would further update it by the following setting: where constants and satisfy with . It is not difficult to see that the vector determined by (27) satisfies (26).
In practice, the KKT conditions (15) are allowed to be satisfied within a tolerance . It turns to be that the iterative process stops, while the following inequalities meet: where is related to the current barrier parameter , and satisfies as .
The proposed interior point method successively solves the barrier subproblem (14) with a decreasing sequence . We simply reduce both and by a constant factor . Finally, we test optimality for problem (10) by means of the residual norm .
Here, we present the whole algorithm to solve the -SVM problem(10)
Algorithm 4. Algorithm for solving -SVM problem is as follows. Step 0. Set , , , . Given constants , , and , let . Step 1. Stop if and . Step 2. Starting from , apply Algorithm 2 to solve (14) with barrier parameter and stopping tolerance . Set , , and . Step 3. Set and . Let and go to Step 1.
The convergence of the proposed interior point method can be proved. We list the theorem here and give the proof in Appendix C.
Theorem 6. Let be generated by Algorithm 4 by ignoring its termination condition. Then the following statements are true. (i)The limit point of satisfies the first order optimality condition (13).(ii)The limit point of the convergent subsequence with unbounded multipliers is a Fritz-John point  of problem (10).
In this section, we tested the constrained optimization reformulation to the -SVM and the proposed interior point method. We compared the performance of the -SVM with -SVM , -SVM , and -SVM  on artificial data and ten UCI data sets (http://archive.ics.uci.edu/ml/). These four problems were solved in primal, referencing the machine learning toolbox Spider (http://people.kyb.tuebingen.mpg.de/spider/). The -SVM and -SVM were solved directly by quadratic programming and linear programming, respectively. To the NP-hard problem -SVM, a commonly cited approximation method Feature Selection Concave (FSV)  was applied and then the FSV problem was solved by a Successive Linear Approximation (SLA) algorithm. All the experiments were run in the personal computer (1.6 GHz of CPU, 4 GB of RAM) with MATLAB R2010b on 64 bit Windows 7.
In the proposed interior point method (Algorithms 2 and 4), we set the parameters as , , , , and . The balance parameter was selected by 5-fold cross-validation on training set over the range , . After training, the weights that did not satisfy the criteria  were set to zero. Then the cardinality of the hyperplane was computed as the number of the nonzero weights.
4.1. Artificial Data
First, we took an artificial binary linear classification problem as an example. The problem is similar to that in . The probability of or is equal. The first 6 features are relevant but redundant. In 70% samples, the first three features were drawn as and the second three features as . Otherwise, the first three were drawn as and the second three as . The rest features are noise , . Here, is the dimension of input features. The inputs were scaled to mean zero and standard deviation. In each trial, 500 points were generated for testing and the average results were estimated over 30 trials.
In the first experiment, we consider the cases with the fixed feature size and different training sample sizes . The average results over the 30 trials are shown in Table 1 and Figure 2. Figure 2 (left) plots the average cardinality of each classifier. Since the artificial data sets have 2 relevant and nonredundant features, the ideal average cardinality is 2. Figure 2 (left) shows that the three sparse SVMs, -SVM, -SVM, and -SVM, can achieve sparse solution, while the -SVM almost uses full features in each data set. Furthermore, the solutions of -SVM and -SVM are much sparser than -SVM. As shown in Table 1, the -SVM selects more than 6 features in all cases, which implies that some redundant or irrelevant features are selected. The average cardinalities of -SVM and -SVM are similar and close to 2. However, when and 20, the -SVM has the average cardinalities of 1.42 and 1.87, respectively. It means that the -SVM sometimes selects only one feature in low sample data set and maybe ignores some really relevant feature. Consequently, with the cardinalities between 2.05 and 2.9, -SVM has the more reliable solution than -SVM. In short, as far as the number of selected features is concerned, the -SVM behaves better than the other three methods.
Figure 2 (right) plots the trend of the prediction accuracy versus the size of the training sample. The classification performance of all methods is generally improved with the increasing of the training sample size . -SVM has the best prediction performance in all cases and a slightly better than -SVM. -SVM shows more accuracy in classification than -SVM and -SVM, especially in the case of . As shown in Table 1, when there are only 10 training samples, the average accuracy of -SVM is , while the results of -SVM and -SVM are and , respectively. Compared with -SVM and -SVM, -SVM has the average accuracy increased by and , respectively, as can be explained in what follows. To the -SVM, all features are selected without discrimination, and the prediction would be misled by the irrelevant features. To the -SVM, few features are selected, and some relevant features are not included, which would put negative impact on the prediction result. As the tradeoff between -SVM and -SVM, -SVM has better performance than the two.
The average results over ten artificial data sets in the first experiment are shown in the bottom of Table 1. On average, the accuracy of -SVM is lower than the -SVM, while the features selected by -SVM are less than -SVM. It indicates that the -SVM can achieve much sparser solution than -SVM with little cost of accuracy. Moreover, the average accuracy of -SVM over 10 data sets is higher than -SVM with the similar cardinality. To sum up, the -SVM provides the best balance between accuracy and sparsity among the three sparse SVMs.
To further evaluate the feature selection performance of -SVM, we investigate whether the features are correctly selected. For the -SVM is not designed for feature selection, it is not included in this comparison. Since our artificial data sets have 2 best features (), the best result should have the two features () ranking on the top according to their absolute values of weights . In the experiment, we select the top 2 features with the maximal for each method and calculate the frequency that the top 2 features are and in 30 runs. The results are listed in Table 2. When the training sample size is too small, it is difficult to discriminate the two most important features for all sparse SVMs. For example, when , the selected frequencies of -SVM, -SVM, and -SVM are 7, 3, and 9, respectively. When increases, all methods tend to make more correct selection. Moreover, Table 2 shows that the -SVM outperforms the other two methods in all cases. For example, when , the selected frequencies of -SVM and-SVM are and , respectively, and the result of -SVM is 27. The -SVM selects too many redundant or irrelevant features, which may influence the ranking in some extent. Therefore, -SVM is not so good as -SVM at distinguishing the critical features. The -SVM has the lower hit frequency than -SVM, which is probably due to the excessive small feature subset it obtained. Above all, Tables 1 and 2 and Figure 2 clearly show that the -SVM is a promising sparsity driven classification method.
In the second simulation, we consider the cases with various dimensions of feature space and the fixed training sample size . The average results over 30 trials are shown in Figure 3 and Table 3. Since there are only 6 relevant features yet, the larger means the more noisy features. Figure 3 (left) shows that as the dimension increases from 20 to 200, the number of features selected by -SVM increases from 8.26 to 23.1. However, the cardinalities of -SVM and -SVM keep stable (from 2.2 to 2.95). It indicates that the -SVM and -SVM are more suitable for feature selection than -SVM.
Figure 3 (right) shows that with the increasing of the noise features, the accuracy of -SVM drops significantly (from to ). On the contrary, to the other three sparse SVMs, there is little change in the accuracy. It reveals that SVMs can benefit from the features reduction.
Table 3 shows the average results over all data sets in the second experiment. On average, the solution of -SVM yields much sparser than -SVM and a slightly better accuracy than -SVM.
4.2. UCI Data Sets
We further tested the reformulation and the proposed interior point methods to -SVM on 10 UCI data sets . There are 8 binary classification problems and 2 multiclass problems (wine, image). Each feature of the input data was normalized to zero mean and unit variance, and the instances with missing value were deleted. Then, the data was randomly split into training set () and testing set (). For the two multiclass problems, a one-against-rest method was applied to construct a binary classifier for each class. We repeated the training and testing procedure 10 times, and the average results were shown in Tables 4 and 5 and Figure 4.
Tables 4 and 5 summarize the feature selection and classification performance of the numerical experiments on UCI data sets, respectively. Here, is the numbers of samples and is the number of the input features. For the two multiclass data sets, the numbers of the classes are marked behind their names. Sparsity is defined as and the small value of sparsity is preferred. The data sets are arranged in descending order according to the dimension. The lowest cardinality and the best accuracy rate for each problem are bolded.
As shown in Tables 4 and 5, the three sparse SVMs can encourage sparsity in all data sets, while remaining roughly identical accuracy with -SVM. Among the three sparse methods, the -SVM has the lowest cardinality (14.18) and the highest classification accuracy (86.52%) on average. While the -SVM has the worst feature selection performance with the highest average cardinality (24.21), and the -SVM has the lowest average classification accuracy (83.51%).
Figure 4 plots the sparsity (left) and classification accuracy (right) of each classifier on UCI data sets. In three data sets (6, 9, 10), the -SVM has the best performance both in feature selection and classification among the three sparse SVMs. Compared with -SVM, -SVM can achieve sparser solution in nine data sets. For example, in the data set “8 SPECTF,” the features selected by -SVM are less than -SVM, at the same time, the accuracy of -SVM is higher than -SVM. In most data sets (4, 5, 6, 9, 10), the cardinality of -SVM drops significantly (at least ) with the equal or a slightly better result in accuracy. Only in the data set “3 Wine,” the accuracy of -SVM is decreased by , but the sparsity provided by -SVM leads to improvement over -SVM. In the rest three data sets (1, 2, 7), the two methods have similar results in feature selection and classification. As seen above, the -SVM can provide lower dimension representation than -SVM with the competitive prediction performance.
Figure 4 (right) shows that, compared with -SVM, -SVM has the classification accuracy improved in all data sets. For instance, in four data sets (3, 4, 8, 10), the classification accuracy of -SVM is at least higher than -SVM. Especially in the data set “10 Musk,” -SVM gives a rise in accuracy over -SVM. Meanwhile, it can be observed from Figure 4 (left) that -SVM selects fewer feature than -SVM in five data sets (5, 6, 7, 9, 10). For example, in the data sets 6 and 9, the cardinalities of -SVM are and less than -SVM, respectively. In summary, -SVM presents better classification performance than -SVM, while it is effective in choosing relevant features.
In this paper, we proposed a regularization technique for simultaneous feature selection and classification in the SVM. We have reformulated the -SVM into an equivalent smooth constrained optimization problem. The problem possesses a very simple structure and is relatively easy to develop numerical methods. By the use of this interesting reformulation, we proposed an interior point method and established its convergence. Our numerical results supported the reformulation and the proposed method. The -SVM can get more sparsity solution than -SVM with the comparable classification accuracy. Furthermore, the -SVM can achieve more accuracy classification results than -SVM (FSV).
Inspired by the good performance of the smooth optimization reformulation of -SVM, there are some interesting topics deserving further research. For examples, to develop more efficient algorithms for solving the reformulation, to study nonlinear -SVM, and to explore varies applications of the -SVM and further validate its effective are interesting research topics. Some of them are under our current investigation.
A. Properties of the Reformulation to -SVM
Let be set of all feasible points of the problem (10). For any , we let and be the set of all feasible directions and linearized feasible directions of at . Since the constraint functions of (10) are all convex, we immediately have the following lemma.
Lemma A.1. For any feasible point of (10), one has
Based on the above Lemma, we can easily derive a first order necessary condition for (10). The Lagrangian function of (10) is where are the Lagrangian multipliers. By the use of Lemma A.1, we immediately have the following theorem about the first order necessary condition.
Theorem A.2. Let be a local solution of (10). Then there are Lagrangian multipliers such that the following KKT conditions hold:
The following theorem shows that the level set will be bounded.
Theorem A.3. For any given constant , the level set is bounded.
Proof. For any , we have Combining with , , and , , we have Moreover, for any , Consequently, are bounded. What is more, from the condition we have Thus if the feasible region is not empty, then we have Hence is also bounded. The proof is complete.
B. Proof of Lemma 3
Lemma 3 (see Section 3) shows that Algorithm 2 is well defined. We first introduce the following proposition to show that the matrix defined by (21) is always positive definite which ensures Algorithm 2 to be well defined.
Proof. By an elementary deduction, we have for any , , , with where , , , and .
B.1. Proof of Lemma 3
Proof. It is easy to see that if , which implies that (23) holds for all .
Suppose . We have Since matrix is positive definite and , the last equation implies Consequently, there exists a such that the fourth inequality in (23) is satisfied for all .
On the other hand, since is strictly feasible, the point will be feasible for all sufficiently small. The proof is complete.
C. Convergence Analysis
Proof. For the sake of convenience, we use , , to denote the constraint functions of the constrained problem (10).
We first show that are strictly feasible. Suppose on the contrary that there exists an infinite index subset and an index such that . By the definition of and being bounded from below in the feasible set, it must hold that .
However, the line search rule implies that the sequence is decreasing. So, we get a contradiction. Consequently, for any , is bounded away from zero. The boundedness of then follows from (24)–(27).
Lemma C.2. Let and be generated by Algorithm 2. If