Research Article  Open Access
Single Directional SMO Algorithm for Least Squares Support Vector Machines
Abstract
Working set selection is a major step in decomposition methods for training least squares support vector machines (LSSVMs). In this paper, a new technique for the selection of working set in sequential minimal optimization (SMO) type decomposition methods is proposed. By the new method, we can select a single direction to achieve the convergence of the optimality condition. A simple asymptotic convergence proof for the new algorithm is given. Experimental comparisons demonstrate that the classification accuracy of the new method is not largely different from the existing methods, but the training speed is faster than existing ones.
1. Introduction
In a classification problem, we consider a set of training samples, that is, the input vectors along with corresponding class labels . Our task is to find a deterministic function that best represents the relation between input vectors and class labels. For classification or forecasting problems in machine learning, support vector machine (SVM) has been adopted in many applications because of its high precision [1–4]. SVMs require the solution of a quadratic programming problem. Another successful method for machine learning is least squares support vector machine (LSSVM) [5]. Instead of solving a quadratic programming problem as in SVMs, the solutions of a set of linear equations are obtained in LSSVMS. There are many proposed algorithms for training LSSVMs: Suykens et al. proposed an iterative algorithm based on conjugate gradient (CG) algorithms [6]; Ferreira et al. presented a gradient system which can train the LSSVM model [7] effectively; Chua introduced efficient computations for large least square support vector machine classifiers [8]; Chu et al. improved the efficiency of the CG algorithm by using one reduced system of linear equations [9]; Keerthi and Shevade extended the sequential minimal optimization (SMO) algorithms to solve the linear equations in LSSVMs where the maximum violating pair (MVP) was selected as the working set [10]; based on the idea of SMO algorithm, Lifeng Bo et al. presented an improved method for working set selection by using functional gain (FG) [11]; Jian et al. designed a multiple kernel learning algorithm for LSSVMs by convex programming [12]; and so on. These numerical algorithms are computationally attractive. Empirical comparisons show that SMO algorithm is more efficient than CG one for the large scale datasets.
Fast SVM training speed with SMO algorithm is an important goal for practitioners and many other proposals have been given for this in the literature. Initially, Platt presented two heuristics that resulted in a bit cumbersome selection [13]. Later, Keerthi et al. introduced the concept of a violating pair to denote two coefficients which cause a violation in the KKT optimality conditions of the dual, and the authors suggested to select always the pair that violated them the most, that is, the maximum violating pair (MVP) [14]. Finally, Fan et al. proposed a second order selection that usually results in faster training than the MVP rule [15]. By the above improvement, we can decrease the computational expense of SMO algorithm, while there are repeated selections of some concrete updating patterns in sequential minimal optimization. They are called training cycles. Barbero et al. studied the presence of them from a geometrical point of view [16]. They pointed out that the training cycles can be partially collapsed in a single updating vector that gave better optimal directions. The idea for training cycles can reduce the number of iterations and kernel operations for SMO algorithm.
Inspired by Barbero et al. [16], we present a single directional SMO algorithm for LSSVMs, abbreviated as SDSMO algorithm. In optimization procedure, an adaptive objective function is selected, and the single directional steps are given for the lagrangian multipliers, which can lessen the number of training cycles and further reduce iterations and kernel operations for SMO algorithm. Experiments show that the training time for LSSVMs by SDSMO algorithm can be reduced significantly, and it has a testing accuracy which is not largely different from traditional SMO algorithm.
The rest of this paper has the following structure. In the next section, LSSVMs are briefly reviewed. In Section 3, SDSMO algorithm for LSSVMs is provided and the convergence of the improved algorithm is proved theoretically. Based on standard datasets, computational experiments describing the effectiveness of the improved algorithm are presented in Section 4. Finally, Section 5 is devoted to concluding remarks.
2. LSSVM
In this section, we concisely review the basic principles of LSSVMs. Given a training dataset of points with input data
In primal weight space, a linear classifier in the new space takes the following form:
The weight vector may be infinite dimensional; hence, using (1) to find the solutions is impossible in general. In order to solve this problem, we would compute the model in the dual space instead of the primal space. Let , and the simple problem without a bias term is considered in this paper as in the paper by Keerthi and Shevade [10]. The Lagrangian for the simple problem is where are Lagrangian multipliers and are called support values. The KarushKuhnTucker (KKT) conditions for optimality are
After elimination of and , we could obtain the following linear system: where , , and is the kernel matrix. By solving the linear system (6), are obtained; hence, LSSVM greatly simplifies the problem. The resulting LSSVM model for function estimation is
For the choice of the kernel function , there are several possibilities: (linear LSSVM); (polynomial LSSVM of degree ); (RBF LSSVM); (MLP LSSVM). In this case, we focus on the choice of an RBF LSSVM for the sequel. When solving large linear systems, we should apply iterative methods to (6), which was introduced by Jiao et al. [17]. The speed of convergence depends on the condition number of the matrix in (6). It is influenced by the choice of in the case of RBF LSSVM. In the following section, we will discuss the algorithm of SMO versions and give the proof of convergence for SDSMO algorithm.
3. SMO and SDSMO Algorithms for LSSVM
For solving the LSSVM problem, the matrix in (6) is usually fully dense and may be too large to be stored. Decomposition methods are designed to handle the difficulties, see Jiao et al. [17]. Unlike other optimization algorithms which update the whole Lagrangian multipliers vector in each iterative process, the decomposition algorithm modifies only a subset of per iteration. We denote the subset as the working set . The SMO algorithm was developed in [10] as a decomposition method to solve the dual problems arising in LSSVM formulations. In each iteration, SMO algorithm restricts to have only two elements. Because of the problem (4) without the bias term , SMO can be simplified to optimize with only one element at an iteration. By substituting the KKT conditions (5) into the Lagrangian (4), the dual problem is to maximize the following objective function: where , and if and otherwise.
The SMO algorithm for (8) is sketched in the following.
Algorithm 1. SMO algorithm for (8) is as follows.(1) Set and find as the initial feasible solution.(2) If the stop criterion is satisfied, stop. If not, find a oneelement working set . Define and and to be subvectors of corresponding to and , respectively.(3) Solve the following subproblem with the variable : where is a permutation of the matrix .(4) Set to be the optimal solution of (9) and . Set and go back to step .
In order to find working set , we usually consider whether the KKT conditions is violated or not. The KKT conditions for the dual problem (8) are , which lead to , . If we define then the KKT optimality condition is violated if there exists any index point such that . SMO algorithm for (8) achieves the convergence of optimal process when , for all .
A simple illustration of this is shown in Figure 1.
Since only one component is updated per iteration, the decomposition method can be quite costly and suffers from slow convergence. For this reason, many researchers improved SMO algorithm. For example, Chen et al. improved SMO algorithm by using the shrinking and caching techniques [18]; Barbero et al. presented a cyclebreaking acceleration of SVM training [16]; and Lin et al. provided threeparameter sequential minimal optimization for support vector machines [19].
As mentioned by Barbero et al. in [16], SMO algorithm is not free of cyclerelated problems. For all in working set , if is optimized with step ( or ) in a single direction per iteration, the number of cycles in SDSMO algorithm will be reduced. We now detail SDSMO formulation in the LSSVM training process.
Define
Then, the KKT optimality condition is violated if there exists any index point such that .
SDSMO algorithm works by optimizing only one at each iteration and keeping the others fixed, that is, is adjusted by a signinvariable step ( or ) per iteration as follows:
The update of causes the change of all the as and; therefore, the function value of will change. At each iteration we need to be sure that the sign of is not variable, that is, if (or ) 0, then ( or ) 0. As increases, (or ) with the sign keeping invariable.
A simple illustration of this is shown in Figure 2.
To derive the optimal step and the termination conditions of iteration, we define as
Because as , . Therefore, let and it can be written as
The optimal step is obtained by maximizing as and the optimal step can induce the change of as
Hence we can choose an index point which has the maximum value of and update by (12) and (16). Suppose and , then is a decreasing sequence. In fact, as , . Therefore can be used as a termination criterion for the iterative algorithm as where is a positive constant. The flowchart of SDSMO algorithm is shown in Algorithm 2.
Algorithm 2. SDSMO algorithm for (8) is as follows.(1) Set and choose such that (or ) for all .(2) If satisfies (18), stop. If not, select (3) Update using and (12).(4) While (), , go back to step .
One theoretical property of SDSMO algorithm is presented in the following.
Theorem 3. The sequence generated by SDSMO algorithm converges to the global optimal solution of (8).
Proof. According to the definition of and combining (16) and (17), the following equation holds:
The positivedefinite kernel function implies , furthermore , and the following equation is obtained:
Equality (20) yields that is a decreasing sequence. Together with , we have that converges. Applying (20) again, we get that converges to as .
Since () is a positivedefinite quadratic form, is a positivedefinite quadratic form too. Therefore, the set is a compact set. lies in this set, so it is a bounded sequence. Let be the limit point of any convergent subsequence , . For all , . According to the definition of , . Inequality (18) yields ; furthermore, for all , . While , so , . From the KKT conditions, is the global optimal solution of (8). Since is strictly convex, (8) has a unique global solution and we denote it as . Assume that does not converge to . Then, for all , there exists an infinite subset such that for all , . Because , for all is a compact set, there is a convergent subsequence. Without loss of generality, we assume its limit to be . Thus, . Since is the global optimal solution of (8), this contradicts that is the unique global optimal solution. The proof of Theorem is completed.
4. Numerical Experiments
Under the framework Algorithm 2, we conduct experiments to check whether using SDSMO is really faster than using SMO or not in this section. There have been two techniques for working set selection in SMOtype decomposition methods. The former is first order SMO (FOSMO) algorithm and the latter is second order SMO (SOSMO) algorithm for LSSVM classifiers [20]; that is, the former uses first order information to achieve fast convergence and the latter uses second order information. Two groups of experiment have been done in order to compare SDSMO with the above two algorithms. All methods are implemented in MATLAB and executed on a personal computer with Intel(R) Core(TM) i3 2.53 GHz processors, 2.00GB memory, and Windows 7 operation systems. For all algorithms, the optimization process is terminated when the maximal violation of the KKT conditions is within . For simplicity, we consider only Gaussian kernel to construct LSSVM.
4.1. The Comparison of SDSMO with First Order SMO
In this section, we compare SDSMO with first order SMO on four benchmark datasets for evaluating the performance of the proposed method. We compare the two methods in terms of computational cost, which is measured by the number of iteration. The examples introduced by Keerthi and Shevade [10] are used. Datasets used for this purpose are Banana, Image, Waveform, and Splice. For each dataset, the value of is determined by the fivefold cross validation on a small random subset.
In the first experiment, we vary over a small range because the extremely small and large values are usually of little interest. We try the following nine values: , . In Table 1, the computational costs associated with the four datasets as functions of are given when the optimization process is terminated.
 
Note: each unit corresponds to iterations. 
As a basis for the comparisons, Table 1 shows the computational costs of first order SMO and SDSMO algorithms at different values of parameter . For first order SMO algorithm, the computational cost increases with the increase of . While for SDSMO algorithm, it is not so. For instance, see the computational cost of SDSMO for the Banana and Waveform datasets. From Table 1, we can see that the number of iterations of SDSMO algorithm is much smaller than that of first order SMO one, especially for Image dataset.
In order to further show the performance of SDSMO algorithm, Tables 2 and 3 are given. The tables report the training time and the generalization performance of first order SMO and SDSMO algorithms for four benchmark datasets. The generalization performance is illustrated by the classification accuracy of an independent test set for each dataset.


From Tables 2 and 3, we can see that the generalization capabilities of both methods are comparable, but the training time of SDSMO algorithm is shorter than first order SMO algorithm. For instance, in the case of Image dataset, the training time for first order SMO algorithm with the best generalization performance is 41.6108 s. It represents the equivalent of ten times the cost of SDSMO algorithm. The classification accuracy for Image dataset with SDSMO algorithm is 0.963, and it is almost equal to the one with first order SMO algorithm. In consequence, the efficacy and feasibility of the proposed SDSMO algorithm is superior to that of first order SMO one for LSSVMs.
4.2. The Comparison of SDSMO with Second Order SMO
To further explore the performance of the proposed method, we compare SDSMO with second order SMO by a second set of experiments on the datasets Titanic, Heart, Breast Cancer, Thyroid, and Pima (available in [21]). We use the datasets provided in [21] to certify the good generalization properties of the proposed method. In Table 4, the number of iterations and execution times per experiment is reported. The misclassification rates are also reported in Table 4.

It can be seen that for these datasets it is better to use SDSMO in Cancer, Pima, and Titanic. The results in Table 4 shows that the biggest improvement with SDSMO happens for Titanic. Therefore, this is further evidence on the previous observation that for largescale problems SDSMO outperforms second order SMO.
The final set of experiments aims to ascertaining how well the SMO algorithm scales for largescale datasets when it uses the different working set selections. In order to test this, we use the datasets a8a and covtype.binary, available with several increasing numbers of patterns in [22].
In Figure 3, we plot the results for a8a with , and covtype.binary with , , respectively. As it can be seen, the number of iterations scales linearly with the training set size. Note that SDSMO needs less iterations to convergence, as expected. And the reduction is greater for covtype.binary because of its larger value of . In any case, the scaling is linear in both cases.
(a)
(b)
5. Conclusion
In this paper, a new algorithm, that is, SDSMO, is proposed. It can be used to select working set for LSSVM classifier training, and its asymptotic convergence is proved theoretically. Based on SMO formulation, the path of oneside convergence is used effectively in our method. The number of iterations and kernel operations in SDSMO algorithm is less than that of the traditional SMO algorithm, so the new algorithm provides faster convergence speed. Simulation experiments have been carried out on four benchmark datasets. The empirical comparisons demonstrate that SDSMO algorithm is much more efficient in terms of computational time than first order and second order SMO, and at the same time there are no large differences in terms of accuracy.
Acknowledgments
The authors would like to thank the Handling Editor and the anonymous reviewers for their constructive comments, which led to significant improvement of the paper. This work was partially supported by the National Natural Science Foundation of China under Grant no. 51174236.
References
 C. H. Song, S. J. Yoo, C. S. Won, and H. G. Kim, “Svm based indoor/mixed/outdoor classification for digital photo annotation in a ubiquitous computing environment,” Computing and Informatics, vol. 27, no. 5, pp. 757–767, 2008. View at: Google Scholar
 T. Van Gestel, J. A. K. Suykens, B. Baesens et al., “Benchmarking least squares support vector machine classifiers,” Machine Learning, vol. 54, no. 1, pp. 5–32, 2004. View at: Publisher Site  Google Scholar
 X. Zeng and X. W. Chen, “SMObased pruning methods for sparse least squares support vector machines,” IEEE Transactions on Neural Networks, vol. 16, no. 6, pp. 1541–1546, 2005. View at: Publisher Site  Google Scholar
 H. Esen, F. Ozgen, M. Esen, and A. Sengur, “Modelling of a new solar air heater through leastsquares support vector machines,” Expert Systems with Applications, vol. 36, no. 7, pp. 10673–10682, 2009. View at: Publisher Site  Google Scholar
 J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Processing Letters, vol. 9, no. 3, pp. 293–300, 1999. View at: Google Scholar
 J. A. K. Suykens, L. Lukas, P. Van Dooren, B. De Moor, and J. Vandewalle, “Least squares support vector machine classifiers: a large scale algorithm,” in Proceedings of the European Conference on Circuit Theory and Design (ECCTD '99), pp. 839–842, Stresa, Italy, 1999. View at: Google Scholar
 L. V. Ferreira, E. Kaszkurewicz, and A. Bhaya, “Solving systems of linear equations via gradient systems with discontinuous righthand sides: application to LSSVM,” IEEE Transactions on Neural Networks, vol. 16, no. 2, pp. 501–505, 2005. View at: Publisher Site  Google Scholar
 K. S. Chua, “Efficient computations for large least square support vector machine classifiers,” Pattern Recognition Letters, vol. 24, no. 1–3, pp. 75–80, 2003. View at: Publisher Site  Google Scholar
 W. Chu, C. J. Ong, and S. S. Keerthi, “An improved conjugate gradient scheme to the solution of least squares SVM,” IEEE Transactions on Neural Networks, vol. 16, no. 2, pp. 498–501, 2005. View at: Publisher Site  Google Scholar
 S. S. Keerthi and S. K. Shevade, “SMO algorithm for leastsquares SVM formulations,” Neural Computation, vol. 15, no. 2, pp. 487–507, 2003. View at: Publisher Site  Google Scholar
 L. Bo, L. Jiao, and L. Wang, “Working set selection using functional gain for LSSVM,” IEEE Transactions on Neural Networks, vol. 18, no. 5, pp. 1541–1544, 2007. View at: Publisher Site  Google Scholar
 L. Jian, Z. Xia, X. Liang, and C. Gao, “Design of a multiple kernel learning algorithm for LSSVM by convex programming,” Neural Networks, vol. 24, no. 5, pp. 476–483, 2011. View at: Publisher Site  Google Scholar
 J. C. Platt, “Training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods: Support Vector Learning, pp. 185–208, MIT Press, Cambridge, Mass, USA, 1999. View at: Google Scholar
 S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, “Improvements to Platt's SMO algorithm for SVM classifier design,” Neural Computation, vol. 13, no. 3, pp. 637–649, 2001. View at: Publisher Site  Google Scholar
 R. E. Fan, P. H. Chen, and C. J. Lin, “Working set selection using second order information for training support vector machines,” Journal of Machine Learning Research, vol. 6, pp. 1889–1918, 2005. View at: Google Scholar
 Á. Barbero, J. López, and J. R. Dorronsoro, “Cyclebreaking acceleration of SVM training,” Neurocomputing, vol. 72, no. 7–9, pp. 1398–1406, 2009. View at: Publisher Site  Google Scholar
 L. Jiao, L. Bo, and L. Wang, “Fast sparse approximation for least squares support vector machine,” IEEE Transactions on Neural Networks, vol. 18, no. 3, pp. 685–697, 2007. View at: Publisher Site  Google Scholar
 P. H. Chen, R. E. Fan, and C. J. Lin, “A study on SMOtype decomposition methods for support vector machines,” IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 893–908, 2006. View at: Publisher Site  Google Scholar
 Y.L. Lin, J.G. Hsieh, H.K. Wu, and J.H. Jeng, “Threeparameter sequential minimal optimization for support vector machines,” Neurocomputing, vol. 74, no. 17, pp. 3467–3475, 2011. View at: Publisher Site  Google Scholar
 J. López and J. A. K. Suykens, “First and second order SMO algorithms for LSSVM classifiers,” Neural Processing Letters, vol. 33, no. 1, pp. 31–44, 2011. View at: Publisher Site  Google Scholar
 G. R. Rätsch, “Benchmark Repository,” Intelligent Data Analysis Group, FraunhoferFIRST, Tech. Rep., 2005. View at: Google Scholar
 C. C. Chang and C. J. Lin, “LIBSVM: a Library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, article 27, 2011. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2013 Xigao Shao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.