Smooth Diagonal Weighted Newton Support Vector Machine
Based on diagonal weighted support vector machine, a smooth model with Newton algorithm is proposed and is called SDWNSVM for short. SDWNSVM introduces the entropy function to approximate the plus function of the slack in the diagonal weighted SVM and is thus different from traditional SSVM that treats a reformulation problem. SDWNSVM utilizes the dual technique to rewrite the objection function by the connotative relation between the primal and dual program, which induces an exact smooth program and differs from traditional SSVM that uses Lagrangian multipliers to roughly substitute for the hyperplane weight. SDWNSVM proves the equivalence between the obtained model and the original one and proposes Newton algorithm to figure out the optimal solution. Numerical experiments on UCI data demonstrate that SDWNSVM has higher accuracies and less iteration than existing methods.
Based on statistical learning theory and dual programming, support vector machine (SVM) has been developed by Vapnik as an efficient method for small-sample data [1–3] and has gained wide application in classification and regression area. When the sample points are completely separable, the hard margin classifiers are introduced which can separate the positive class from the negative class with the maximal margin. When the sample points are nonseparable, the soft margin classifiers are introduced, which allow erroneousness of the misclassified points. Both 1-norm and 2-norm soft margin SVM are proposed, while the former is called box constrained SVM and the latter is called diagonal weighted SVM.
Denote the nonnegative constraints in the plus function form for SVM error Lee and Mangasarian propose smooth support vector machine (SSVM) [4, 5], which has strong mathematical properties, such as strong convexity and infinitely often differentiability. SSVM applied smoothing techniques to one reformulation of the 2-norm soft margin SVM by appending an additional bias term. The approach was originally proposed to induce strong convexity and was later used in [6–8]; this was equivalent to adding a constant feature to the training data and finding a separating hyperplane passing through the origin, which had been proved to have little or no effect on the solutions to the original program. Based on SSVM, many scholars research from various aspects and presented diverse kinds of smooth models [9–20]. Some scholars proposed polynomial function based smooth approach [12–15], quarter penalty function based smooth approach , or rational function based smooth approach  and some scholars generalize to the regression or forecasting area [11, 17]. Among these methods, the widely used models are the fourth polynomial function based smooth approach called FPSSVM; however, the accuracies are limited. Wu and wang proposed a piecewise-smooth support vector machine PWSSVM for classification , which uses a twice continuously differentiable piecewise polynomial function to approximate the plus function and get satisfying results. A detailed study shows that the above smooth models have two drawbacks. The 2-norm of the weight of the separating hyperplane is substituted for that of the Lagrangian multipliers vector. The equivalences of the obtained programs and the original program are proved in none of these methods.
A natural problem exists whether an exact model can be constructed for SVM and whether the proof can be given with regard to the equivalence between the obtained program and the original one. This paper investigates a smooth diagonal weighted newton support vector machine (SDWNSVM), which applies entropy function to the diagonal weighted support vector machine and proves the equivalence between the transformed program and the original one, by making use of the connotative relation between the primal and dual program. We begin from the linear space to obtain the unconstrained smooth model by directly adopting the entropy penalty function. As for the kernel space, we use the dual technique to rewrite the objection. Upon obtaining the smooth models, Newton algorithm is utilized to figure out the solution to SDWNSVM. Numerical experiments are carried out to demonstrate the effectiveness and superiority.
This paper is organized as follows. Section 2 derives the exact smooth model in the linear space. Section 3 proposes the two smooth models in the kernel space with the key transformations illustrated. Section 4 proposes Newton algorithm to determine the solutions to SDWNSVM. Numerical experiments are given in Section 5 to demonstrate the performances. The conclusions are made in the final section.
2. Linear Smooth Model
Given is the training set , in which and . The training of standard diagonal weighted SVM equals the following program:
Here is the normal to the separating hyperplane, is the bias, and is the misclassified error. The linear separating hyperplane is , with as the input point.
Denote the training set in the matrix form by , and denote the labels by the diagonal matrix with ones or minus ones along its diagonal. Having these notations, we can rewrite (1) as
For , we define the plus function as and denote the entropy function with smooth parameter
Lemma 1. is a strict convex function.
Proof. Figuring out the first order derivative for with respect to variable gives the following equation, which guarantees that Lemma 1 holds:
Lemma 2. For any and holds, where is defined as in (3) with smooth parameter .
Proof. For , we have
For , is a monotonically increasing function, and
Thus holds, which completes the proof.
Corollary 3. The entropy penalty function converges to the plus function; that is, .
By figuring out the limit of as approaches infinity, it is easy to show that Corollary 3 holds.
Corollary 3 guarantees that the entropy function can be used to approximate the plus function. It should be pointed out that, if we generalize to , the element will still satisfy the corresponding results of Lemmas 1 and 2 and Corollary 3, and Lemma 2 becomes Lemma 4.
Lemma 4. For any vector with , the inequality holds, where is defined as in (3) with parameter .
Write the slack vector in the plus function form:
Then one converts the diagonal weighted SVM into the equivalent unconstrained formulation:
For any vector , the diagonal matrix is defined as follows with the elements of along the diagonal:
Now one figures out the gradient vector and the Hessian matrix . For the sake of simplicity, five new vectors are introduced; namely, , and :
Following from (10) and (11), one concludes that is convex and strict convex. Theorem 5 guarantees the existences of the global minima for the convex functions.
Theorem 5. Suppose is a nonempty convex set and is a convex function; then the local minima of on are also global minima.
Proof. Respectively, denote by and the local and global minimal point of the function ; then and the following inequality holds for any real number :
It is noted that ; for arbitrary there exists sufficiently small such that
Here is a small hollow region with as the center and as the radius.
Take inequality (12) and (13) into account; we directly derive the following inequality, which implies that is not a local minimal and thus contradicts the hypothesis: Now we prove Theorem 5.
Theorem 6. Let and . Define the two programs for the real valued functions and in as follows:
where is the smooth parameter; then(1)there, respectively, exist global minima of and ;(2)let and be the global minima of and ; the following upper bound inequality holds with :
Thus, converges to as approaches infinity.
Proof. Firstly, we prove the existence of local minima.
Define the level set of as for any real number .
Since dominates the plus function , the two level sets satisfy inequality , and the two level sets satisfy the following for : Hence, both and are compact subsets in .
It is easy to prove the convexity of and the strong convexity of . Theorem 5 guarantees that there, respectively, exist global minima.
Secondly, we establish the convergence.
Utilizing the first order optimality conditions of the convex function and the strict convex function ,
Adding up the above two inequalities, we derive the following inequality: Define ; we apply Lemma 2 to the right of and obtain the following inequality, which testifies the convergence:
3. Kernel Smooth Model
In the kernel space, the following program is minimized, where is the nonlinear map:
3.1. Rough Kernel Smooth Model
Similar to SSVM, the 2-norm of the Lagrangian column vector is used to replace that of the weight of the hyperplane. The matrix formation for kernel diagonal weighted SVM is as follows:
The unconstrained program in the plus function form and the smooth form for the kernel 2-norm soft margin SVM is, respectively, illustrated as follows:
Theorem 7. Let and . For the two programs and in (24) and (25),(1)there, respectively, exist global minima of and .(2)let and be the global minima of and one has the following inequality with :
Thus, converges to as approaches infinity.
3.2. Exact Kernel Smooth Model
Now we will propose the exact smooth model in the kernel space.
Since the nonlinear map is unknown, the connotative conditions between the primal program and the dual one are exploited.
It is known that at the optimal solution
Using this relation, we have the following two equations, in which :
Put the above two equations back into (22); we get the following program:
Proof. Evidently, if and are the primal and dual optimal solutions, then is feasible for (31).
For feasible and , inequality (31) can be deduced from (29):
Thus, is the optimal solution to (31).
From the constraint in (31), we write the slack in the form of plus function as
Then we obtain the model in the plus function form and the model in the entropy function form in the kernel space:
Corollary 10. As approaches infinity, the unique solution to converges to the unique solution to ; that is, .
In this section, we will first prove the existences of the unique solutions to the smooth models (16), (25), and (35) and next propose Newton algorithm to figure out solutions to the smooth models of diagonal weighted SVM.
Proof. We define in the linear space and in the kernel space.
Let ; then for any column vector , the inequalities and hold.
Using the results for in (11), (27), and (37), we know that holds, which guarantee that programs (16), (25) and (35) are convex and they all have unique solutions. Now, we complete the proof.
Step 0 (initialization). Input training set , the initial point , and . Calculate the gradient vector and the Hessian matrix . Set .
Step 1 (termination). Having , stop if the gradient vector is zero, that is, ; else, go to step 2.
Step 2 (newton direction). Determine the direction by setting the linearization of around equal to zero, which gives linear equations in variables:
Step 3 (Armijo stepsize). Find the search direction using Armijo stepsize, that is, choose , such that for ,
Step 4 (update). Update the point according to
Compute the gradient vector . Let . Go to step 1.
Here we are solving a linear system of equations in SDWNSVM instead of solving a quadratic program, which is the case for classical SVM.
Comments. Although the presented SDWNSVM looks similar to SSVM, they are different in nature. Their main differences can be summarized into two points.(1)The unconstrained programs have different convexities in SDWNSVM and SSVM. The program is convex in SDWNSVM, while it is strict convex in SSVM.(2)The distances between and have different upper bounds equality in SDWNSVM and SSVM. By the first order optimality condition of the strict convex function, SSVM has the upper bound inequality as
Obviously, with regard to the distance between and , the upper bound of the rough kernel smooth model in SDWNSVM is twice that in SSVM, while the upper bound of the exact kernel smooth model in SDWNSVM is absolutely different form that in SSVM, as illustrated in (38)
5. Experiments and Comparisons
We demonstrate now the effectiveness and speed of SDWNSVM on several UCI datasets. All the experiments are carried out on a PC with P4 CPU, 3.06 GHz, and 1 GB memory. The programs are written in pure MATALAB 7.01 language. The parameters are chosen for the optimal performances on a tuning set by the ten-fold cross validation.
5.1. Parameter Setting
In the linear space, only the penalty needs to be specified. In the kernel space, the penalty and the kernel width need to be specified with respect to the radial basis kernel function .
The detailed procedure of ten-fold cross validation consists of the following three steps.(1)Setting the Parameter-Ranges. Specify penalty , and specify as the kernel width.(2)Estimating the Performance. Select or from the parameter-ranges, run ten-fold cross validation to train a classifier, and test its performances. Firstly, divide the training set into ten subsets; secondly, select nine subsets as the training set to construct a classifier; finally, take the one subset left as the tuning set to test the generalization performances of the obtained classifier.(3)Selecting the Optimal Parameters. Select the parameters as the optimal parameters, which are corresponding to the highest ten-fold testing accuracies.
Obviously, the training and testing procedures need to be repeated ten times. The ten-fold training accuracy is defined as the averaged accuracy obtained on the nine training subsets, while the ten-fold testing accuracy is defined as the averaged accuracy obtained on the tuning set.
We carry out the experiments on the ionosphere data with in the linear space and illustrate performances of the ten-fold cross validation for SSVM and SDWNSVM.
Data in Table 1 lead to the following conclusions.(1)SSVM and SDWNSVM both have satisfying training and testing accuracies. The former has the training and testing accuracy of about 93.86% and 89.48%, while the latter has the training and testing accuracy of about 94.21% and 89.19%.(2)SSVM and SDWNSVM both have low iteration number and training time. The former has an iteration number of about 7.6 and a training time of about 0.14 while the latter has an iteration number of about 5 and a training time of about 0.12.(3)SSVM and SDWNSVM both have small variances. SSVM has the training and testing accuracy variance of about 0.0062 and 0.2844, while SDWNSVM has the training and testing accuracy variance of about 0.0044 and 0.2803. SSVM has an iteration number variance of about 0.2667 and a training time variance of about 0.0032, while SDWNSVM has an iteration number variance of 0 and a training time variance of about 0.0031.(4)The differences are not statistically significant in the ten-fold cross validation.
The differences among the ten folds are in Table 2
Data in Table 2 bring about the following observations, which demonstrate that the differences among the ten different folds are trivial.(1)The differences of the training and testing accuracies are mostly not larger than 0.03.(2)The differences of the iterations are mostly about 1 in SSVM and 0 in SDWNSVM.(3)The differences of the training time are mostly not larger than 0.03.
Tables 1 and 2 demonstrate that the ten-fold cross validation has satisfying training and testing accuracies; also the differences among the ten different folds are trivial and they are not statistically significant.
5.2. Performances in the Linear and Kernel Space
The experiment is carried out in the liner space on several datasets. SDWNSVM is compared with RLP, SVM||1||, SVM||2||, and FSV in terms of ten-fold training and testing accuracies. The results are illustrated in Table 3 with the penalty parameter specified by .
Obviously, SDWNSVM has the highest ten-fold training and testing correctness.
SDWNSVM is also compared with SSVM, SOR (successive over relaxation), and on the large-scale adult data in Figure 1, whose training sizes vary from 1605 to 32562. The linear kernel is used and the penalty parameter is selected from by the ten-fold cross validation.
Evidently, SDWNSVM has the highest ten-fold training accuracies.
SDWNSVM is compared with SSVM on the Bupa Liver and Pima Indians data. The former data set is composed of 241 “train” examples and 104 “test” examples, containing five attributes for each sample. The latter Pima Indians diabetes data is composed of 500 plus examples and 268 minus examples, eight attributes for each example. The results are illustrated in Table 4 with parameters selected by ten fold cross validation with and .
The following conclusions are directly drawn from Table 4.(1)The iterations of SDWNSVM are identical to those of SSVM.(2)The training time of SDWNSVM remains the same as that of SSVM in linear space, but a little higher than that of SSVM in the kernel space.(3)The training and testing accuracies of SDWNSVM are higher than those of SSVM. In the linear space, SDWNSVM improves the training and testing accuracies by about 1.42% and 2.11%. In the kernel space, SDWNSVM improves the testing accuracies by about 0.96%.
To further demonstrate the superiority of SDWNSVM, we will compare SDWNSVM with earlier SSVM algorithm, as well as the currently frequently used algorithms with good performance, namely, PWSSVM and FPSSVM. The training time, the training and testing accuracies are selected as the assessment indexes.
The “tried and true” checkerboard data set is used, which is generated by uniformly discrediting the regions to points and labeling two classes “White” and “Black.” For the sake of comparisons, the same parameters are erected for various algorithms. The Gaussian kernel is used; the kernel width is set to be and the penalty is set to be . The training set is randomly sampling with increasing sizes while the others are selected as the testing set. The averaged results are presented in Table 5 of ten random samplings.
It is obvious that SDWNSVM has the highest training and testing accuracies with the lowest training time among the four different algorithms.
The final experiment is designed to show the accuracies and computational advantages over datasets with larger feature.
The NDC data generator is used to produce data of various dimensional data with different scales. Randomly choose 20% as the training set and choose the others as the testing set. Select the Gaussian kernel; the averaged performances of ten random samplings are illustrated in Table 6 for the four algorithms under the same parameters with the kernel width and the penalty .
From data in Table 6, we draw the following conclusions.(1)The training and testing accuracies increase with the training scales but decrease with the dimension by a certain degree for the four algorithms.(2)For various scale and various dimensional data, SDWNSVM has the highest training and testing accuracies but has the lowest training time among the four different algorithms. Take 200 dimensional data of 10000, for example; SDWNSVM has a training accuracy, respectively, about 1.14%, 1.18%, and 0.48% higher than FPSSVM, FPSSVM, and PWSSVM, while it has a testing accuracy respectively about 0.4%, 0.54%, and 0.17% higher than SSVM, FPSSVM, and PWSSVM. SDWNSVM has a training time respectively about 20.44, 58.24, and 8.05 seconds lower than SSVM, FPSSVM, and PWSSVM.
This paper deals with diagonal weighted SVM and introduces smoothing technique to obtain an efficient training algorithm SDWNSVM. By using the connotative relation found by dual technique, SDWNSVM can be easily extended to the kernel space and has higher training accuracies; what is more, SDWNSVM proves the equivalence between the new smooth model and the original one. By introducing Newton algorithm to figure out the optimal solution, SDWNSVM has lower training time than existing methods. Numerical experiments on several UCI data demonstrate the effectiveness and superiority. Future work includes finding other smooth penalty functions or searching new efficient algorithms to figure out the optimal solution for the unconstrained smooth model.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The work is supported by the Natural Science Foundation of Shaanxi Educational Commission (2010JK773), the Research Fund of the Doctoral Program of Xi’an Shiyou University (YS29030903), and the Natural Science Foundation of China (no. 60974082).
V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 2000.
C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998.View at: Google Scholar
N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, 2000.
O. L. Mangasarian and D. R. Musicant, “Lagrangian support vector machines,” Journal of Machine Learning Research, vol. 1, no. 3, pp. 161–177, 2001.View at: Google Scholar
T. T. Frieß, N. Cristianini, and C. Campbell, “The kernel adatron algorithm: a fast and simple learning procedure for support vector machines,” in Proceedings of the 15th International Conference on Machine Learning, pp. 188–196, Madison, Wis, USA, 1998.View at: Google Scholar
J. D. Shen, “A new smooth support vector machine based on a rational function,” in Proceedings of the International Conference on Information Technology and Management Innovation, pp. 2199–2202, 2012.View at: Google Scholar
Y. Yuan, “Forecasting the movement direction of exchange rate with polynomial smooth support vector machine,” Mathematical and Computer Modelling, vol. 57, no. 3-4, pp. 932–944, 2013.View at: Google Scholar
Y. Yuan, W. Fan, and D. Pu, “Spline function smooth support vector machine for classification,” Journal of Industrial and Management Optimization, vol. 3, no. 3, pp. 529–542, 2007.View at: Google Scholar
Y.-B. Yuan, J. Yan, and C.-X. Xu, “Polynomial smooth support vector machine (PSSVM),” Chinese Journal of Computers, vol. 28, no. 1, pp. 9–17, 2005.View at: Google Scholar
Y. Yuan and T. Huang A, “polynominal smooth support vector machine for classification,” in Proceedings of the 1st International Conference on Advanced Data Mining and Applications, pp. 157–164, Wuhan, China, 2005.View at: Google Scholar
Liang and D. Wu, “A new smooth support vector machine,” in Proceedings of Artificial Intelligence and Computational Intelligence, pp. 266–272, 2010.View at: Google Scholar
C. Qin and S. Liu, “Fuzzy smooth support vector machine with different smooth functions,” Journal of Systems Engineering and Electronics, vol. 23, no. 3, pp. 460–466, 2012.View at: Google Scholar