Data-Driven Fault Supervisory Control Theory and ApplicationsView this Special Issue
Research Article | Open Access
A Novel Sparse Least Squares Support Vector Machines
The solution of a Least Squares Support Vector Machine (LS-SVM) suffers from the problem of nonsparseness. The Forward Least Squares Approximation (FLSA) is a greedy approximation algorithm with a least-squares loss function. This paper proposes a new Support Vector Machine for which the FLSA is the training algorithm—the Forward Least Squares Approximation SVM (FLSA-SVM). A major novelty of this new FLSA-SVM is that the number of support vectors is the regularization parameter for tuning the tradeoff between the generalization ability and the training cost. The FLSA-SVMs can also detect the linear dependencies in vectors of the input Gramian matrix. These attributes together contribute to its extreme sparseness. Experiments on benchmark datasets are presented which show that, compared to various SVM algorithms, the FLSA-SVM is extremely compact, while maintaining a competitive generalization ability.
The last decade has seen widespread applications of Least Squares Support Vector Machines (LS-SVM) [1, 2] to a variety of classification problems. The LS-SVM involves finding a separating hyperplane of maximal margin and minimizing the empirical risk via a Least Squares loss function. Here the optimization is subject to equality constraints, as opposed to the inequality ones used with a standard SVM [3, 4]. Thus the LS-SVM successfully sidesteps the quadratic programming (QP) required for the training of the standard SVM. As a result, an LS-SVM classifier is equivalent to a set of linear algebraic equations, which are usually solved by the conjugate gradient (CG) method . Empirical studies suggest that an LS-SVM possesses very competitive generalization abilities compared to a regular SVM ; improvement has been made to Suyken’s algorithm by Chu et al.  whose proposal is also based on CG method but with a reduced time complexity. Meanwhile Keerthi et al. advocated training an LS-SVM using the sequential minimal optimization (SMO) algorithm .
While these algorithms have indeed made an LS-SVM more computationally attractive, the nonsparseness of its solution still remains a major, and as yet unsolved, bottleneck. Sparseness in solutions is essential for reducing the time required in predicting the class membership of unlabelled data. The general approach to addressing this issue for an LS-SVM is iterative shrinking of the training set. Here the importance of training samples is evaluated from the weights assigned to them after training. Samples of less importance are removed, and then the remainder forms a reduced training set to be learnt again. In view of the linear relationship between the approximation error and the support value for a training sample, a straightforward pruning strategy is to remove samples whose absolute support values are trivial . A later paper recommended pruning a sample which introduces the smallest error if omitted . However, this necessitates the inversion of the kernel matrix, which is indicative of high computation complexity. Zeng and Chen  suggested a pruning method based on the SMO formulation of an LS-SVM, which leads to faster retraining. Other authors  proposed solving the dual form of an LS-SVM by iterative addition of basis functions from the kernel dictionary either until all the training samples are traversed or until the approximation error is lower than a preset threshold. Despite the lower computation cost with the backfitting scheme adopted, Lagrangian multipliers associated with previously selected training samples were inherited into the next pass, which could compromise the generalization abilities of the resultant LS-SVM.
This paper presents a new LS-SVM formulation called the Forward Least Squares Approximation SVMs (FLSA-SVM). It is trained by Forward Least Squares Approximation(FLSA) , a function approximation method using a Least Squares loss function. The FLSA-SVM loss function decreases monotonically with an increasing number of support vectors, allowing for the removal of slack variables from the equality constraints. Another novelty with the FLSA-SVM lies in the fact that it cleverly transforms the number of support vectors to be the regularization parameter, which indicates the tradeoff between generalization abilities and empirical risk. The FLSA-SVM builds a classifier by iteratively selecting a single basis function for the solution, which contributes the largest reduction to the quadratic cost function.
The paper is organized as follows. Section 2 briefly reviews Least Squares SVM principles. The new sparse LS-SVM—Forward Least Squares Approximation SVM (FLSA-SVM)—is introduced in Section 3. Experimental results are given in Section 4 and concluding remarks in Section 5.
2. Least Squares Support Vector Machines
For a classification problem of pairs of training samples , where and , LS-SVM algorithms seek the optimal separating hyperplane with the orientation vector of the least norm.
To ensure the presence of the optimal hyperplane, the input data are translated into a reproducing kernel hilbert space (RKHS) by a mapping function denoted by . To avoid the curse of dimensionality, the mapping is implemented implicitly by the introduction of “kernel trick.” Its idea is that dot products in the RKHS space can be represented by a Mercer kernel function in input space : Thus the linear discriminant function in the feature space can be formulated as where is the orientation vector and is the bias term.
An LS-SVM finds the hyperplane parameterized by by solving the following optimization problem : where the slack variable denotes the deviation between the actual output of the LS-SVM on sample and its target value . is a parameter which imposes penalty on deviations.
Introducing the Lagrange multipliers for each of the equality constraints gives
Due to the equality constraints, can either be positive or negative according to the Karush-Kuhn-Tucker (KKT) conditions :
The linear equations can be further simplified to where and , , , is unity matrix of rank , and is a column vector of 1s of length. The solution of an LS-SVM is a linear combination of basis function whose associated is nonzero:
It has been noted that the LS-SVM is almost equivalent to ridge regression  since the two methods corresponds to an identical optimization problem. Meanwhile, the equivalence between the LS-SVM and the Kernel Fisher Discriminant method  has also been established .
3. Least Squares Approximation Sparse SVM
Equation (3) shows that an LS-SVM can also be viewed as a ridge regression model. And the optimality conditions (7) indicate that the introduction of slack variable is the root of the nonsparseness problem, since in regression applications, most slack variables end as nonzero . But the introduction of seems inevitable for the representation of training cost and thus the penalty parameter to indicate the tradeoff between training cost and generalization abilities.
The section introduces a new formulation of Least Squares SVM, namely, Forward Least Squares Approximation SVM (FLSA-SVM). The proposed FLSA-SVM algorithm is motivated by the Least Squares “Forward Approximation & Backward Refinement” method for function approximation, which was first developed in the dynamic system identification community by Li et al.
FLSA-SVM is rid of slack variables by employing an approximation function of Least Squares loss function which is Forward Least Squares Approximation (FLSA) algorithm. FLSA approximation function iteratively selects a basis function which causes the steepest reduction in the loss function. The features enable the number of support vectors (SVs), which equals to the number of basis functions, to be the tradeoff between training cost and generalization abilities. In FLSA-SVMs, the sparseness of its solution is ensured and confirmed by the experiments section in which FLSA-SVMs are more sparse than standard SVMs.
In this section, a description of “Forward Approximation” is given, from a machine learning perspective, as a preface to the introduction of LSA-SVM algorithm. The strategy to reduce the computation complexity of FLSA-SVMs is then presented.
3.1. Forward Least Squares Approximation 
Given values of an unknown target function in a Hilbert space at input data . A “dictionary” of functions in the Hilbert space is also given in which is termed as a “basis function.” Forward Least Squares Approximation (FLSA) addresses the estimation of with a linear combinations of basis functions chosen from the dictionary: where is the number of basis functions which expand , is the set of selected basis functions, and is the associated weight vector,
so that the squared norm of the residue vector, denoted by , is minimized: where and is the output vector of on the input data: where is the vector of decision values of basis function on the input data and . Equation (13) suggests that FLSA can be understood as working entirely in the space.
The weight vector that minimizes the loss function (12) is given as for a matrix which is of full column rank . The resultant minimal loss function is Starting from (15), FLSA defines a set of residue matrices for the measurement of the contribution of each basis function to the reduction of loss function: where matrix is full column rank and is composed of output vectors of basis function. is the unity matrix and set . Then, the following equation holds:
For a which produces a full column rank , it is proved that has the following properties:
Meanwhile, the introduction of simplifies the formulation of which is the evaluation of loss function at : And thus
The contribution, denoted by , of a column vector makes the loss function can able to be explicitly expressed:
FLSA algorithm proceeds in a greedy manner which selects one basis function per iteration. The th iteration identifies the index of the th basis function by solving the optimization problem: where and .
The th iteration of FLSA algorithm, in actual fact, establishes the following linear system whose solutions are the last elements of in (13):
Thus which represents a linear equation is stored for the th iteration. Eventually, iterations build up an upper triangle which represents a linear system of (26). The solution can be computed by performing a back substitution procedure that is used by a typical Gaussian elimination process:
The FLSA algorithm, in fact, is closely related to the Orthogonal Least Squares (OLS) method [22–24], which also allows for the explicit formulation of the contribution of a basis function made to the reduction of the squared error loss.
3.2. Forward Least Squares Approximation SVMs
As with standard SVMs, the formulation of LS-SVMs embodies the Structural Risk Minimization (SRM) principle which can be illustrated by Figure 1, in which the dotted line represents the upper bound on the complexity term of function set from which solution is chosen and the dash line the empirical risk. SRM minimizes the upper bound on the expected risk (generalization error), for which the best tradeoff between the complexity term and the empirical risk is required to be found.
To this end, a regularization term is introduced to indicate the tradeoff between the complexity term and the empirical risk, both of which LS-SVMs explicitly formulate. Model selection is performed in the domain of in search for its optimal value.
While in FLSA algorithm, it can be concluded from (22) that the training cost monotonously decreases to the increase in the number of selected basis functions . The training cost can be then plotted against into a curve similar to that of the empirical risk in Figure 1.
This fact motivates the following two innovations to traditional formulation of LS-SVMs: the employment of the parameter as the regularization term; the avoidance of the term to represent empirical risk by using the FLSA as the training algorithm, which minimizes the summed squared residues for any value of . As a result, a new formulation of LS-SVMs—namely, Forward Least Squares Approximation Sparse SVM (FLSA-SVM), which is restricted to be trained by FLSA algorithm, is proposed: where is composed of the indices of the support vectors and is the cardinality of the set. The addition of term to (27) is first introduced by .
In an FLSA-SVM, the parameter is interpreted as the number of nonzeros of Lagrangian multipliers, that is, the number of support vectors (SVs), which is seen more obviously in its Lagrangian by introducing the Lagrange multipliers for each of the equality constraints giving The optimal point requires that
The linear equations can be further simplified to where and , , , is a -by- matrix of 1s, and , where .
It is worth attention that, in this paper, in (32) is also referred to as as it has support vectors. Thus, if all data samples are used as support vectors, then becomes .
FLSA can be applied to train an FLSA-SVM using a kernel-based dictionary of candidate basis functions. Since is defined to be the number of SVs, it thus achieved a more direct control of the sparseness of its solution.
3.3. Automatic Detection of Linear Dependencies
Since the kernel matrix in (32) is a semipositive definite, it is easy to prove the semipositive definiteness of the matrix . Then it is likely the occurrence of linear dependencies among column vectors, each being evaluation of a basis function on the training data. Assume that columns are selected iteratively and linearly independent, where are their column indices in chronological sequence. Denote to be any column which remains as available candidates in and meanwhile can be expressed as a linear combination of . It is thus naturally desired to remove as candidates in order to ensure the sparseness of the solution; avoid any undue computation concerning since
With the introduction of the residue matrix at each iteration as , the following property holds according to (18) and (20): Thus updating the dictionary by , the column vector(s) becomes , that is, automatically pruned. Hence, at each iteration of the FLSA-SVM, any column vector(s) which can be represented by a linear combination of previously selected columns can be automatically pruned. This merit of the FLSA-SVM is one contributor to the sparseness of the resultant solution.
Algorithm 1 gives the pseudocode of the FLSA-SVM algorithm in detail.
3.4. Computation Complexity
As discussed in , for a single round of training, the computational complexity of FLSA-SVM is and space complexity for FLSA-SVM is .
4. Experimental Results
A set of experiments were performed to evaluate the performance of the proposed FLSA-SVM algorithm. The FLSA-SVMs were first applied to the two-spiral benchmark  for an illustrative view of their generalization abilities. The following Gaussian kernel function was used for the experiments in this paper: The standard SVMs were implemented by LIBSVM . The LS-SVM trained by CG method was implemented by the toolbox of LS-SVMlab  and all experiments were run on a Pentium 4 3.2 GHz processor under Windows XP with 2 GB of RAM.
4.1. Experiments on Two-Spiral Dataset
The 2D “two-spiral” benchmark is known to be difficult for pattern recognition algorithms and poses great challenges to neural networks . The training set consists of 194 points of the - plane, half of which has a target value of output and half a target value of −1. These training points describe two intertwining spirals that go around the origin three times, as shown in Figure 2, where the two categories are marked, respectively, by “+” and “o.”
For the FLSA-SVM, the parameter setting of Gaussian kernel was as in (36). With a feasible range of , the optimal regularization parameter was found to be which gave a leave-one-out cross-validation (LOOCV) accuracy of . With standard SVMs, the parameter setting was and , whose LOOCV accuracy is . The SVM classifier was required in support vectors (SVs). The graphical outputs of the FLSA-SVM and the SVM were given by Figure 3. It showed that generally both SVM and FLSA-SVM algorithms recognized the pattern successfully, outputting “two-spiral” in a very smooth and regular manner. But at the area around the coordinate of , the FLSA-SVM showed better performance than the conventional SVM whose decision hyperplane was biased towards the “o” class.
Despite the reduction in the number of SVs which is very notable for such a “hard” classification problem, the FLSA-SVMs are comparatively superior to the following two aspects. In SVMs, it often occurred that given a fixed kernel parameter, the optimal CV accuracy can be obtained from multiple value settings on the regularization parameter. The “two-spiral” problem is a case in point. For the best LOOCV accuracy, with a fixed , the value of can also opt for beyond . There is no specific rule as to which option is the best, and normally the smallest is chosen for a scaled-down feasible region. While in FLSA-SVMs, the setting for the value of the regularization term is more tractable since different options correspond to different learning errors which can be easily tracked with (23). For multiple values of which produce the same optimal CV accuracy, the largest is chosen for a smaller learning error.
In fact, for , the LOOCV accuracy remained stable at . Thus an FLSA-SVM was also trained with the parameter settings of which was depicted in Figure 4(a). For comparisons, an LS-SVM was trained whose parameter settings are and for Gaussian RBF. The solution was parameterized by the entire 194 points and illustrated in Figure 4(b). The decision boundaries of both of the LS-SVM and the FLSA-SVM were generally smooth and followed the pattern satisfactorily, despite the slightly biased segments around the coordinates of and . But still the FLSA-SVM performed much better at the origin area than the LS-SVM.
Table 1 also compares the time cost of 10-fold cross-validation (CV) on the regularization parameter by FLSA-SVMs, SVMs and LS-SVMs. For FLSA-SVMs, the regularization term was assigned integers evenly with an interval of within the range of . For SVMs and LS-SVMs, the parameter was sequentially increased from to in multiple of . Each fold was split into a training set and a validation set, with a division pattern of training versus validation or training versus validation. The total computation time of the classifiers on various folds for each algorithm was reported in Table 1. The time cost of the -fold altogether was given in the last row entry of Table 1. It was clearly shown that the 10-fold cross-validation (CV) of FLSA-SVMs is over 4 times faster than SVMs. The time complexity of FLSA-SVMs remained competitive to that of LS-SVMs using the SMO algorithm and much more reduced than LS-SVMs using the CG method.
These comparisons prove that the FLSA-SVM is very promising in easing the nonsparseness problem of an LS-SVM, in addition to its outstanding generalization performance. And it can obtain a solution whose sparseness is competitive, or even superior, to that of a standard SVM.
4.2. More Benchmark Problems
The FLSA-SVM algorithm was applied to 3 small-scale binary problems: Banana, Image, and Splice which are accessible at http://theoval.cmp.uea.ac.uk/matlab/#benchmarks/. Among all the realizations for each benchmark, the first one of them was used. Experiments were also performed on Banana dataset , which is a medium-scale binary learning problem. The detailed information of the datasets was given in Table 2.
FLSA-SVMs were compared with SVMs, LS-SVMs, the fast sparse approximation scheme for LS-SVM (FSALS-SVM), and its variant, called PFSALS-SVM, both of which were proposed by Jiao et al. . The parameter of FSALS-SVMs and PFSALS-SVMs was uniformly set to be 0.5 which was empirically proved to work well with most datasets . Comparisons were also made against D-optimality orthogonal forward regression (D-OFR)  which is a technique for nonlinear function estimation, promised to yield sparse solutions. The parameters, which were the penalty constant and in (36), were tuned by tenfold cross-validation (CV).
Table 3 presented the classification accuracy of the SVM algorithms. The best results among the four SVM algorithms were highlighted. It can be seen that the FLSA-SVM achieved comparable classification accuracy to the standard SVM, the conventional SVM, FSAL-SVM, PFSALS-SVM, and D-optimality OFR.
The numbers of SVs were compared in Table 4. For all the learning problems, the FLSA-SVM required much less SVs than the SVM and the conventional SVM. In particular, the reduction in SVs reached over 98% and 80%, respectively, on Ringnorm and Banana datasets compared with the SVM. FLSA-SVM has maintained its edges over FSAL-SVM and PFSALS-SVM, particularly with the Ringnorm and Banana datasets. Although the D-OFR method falls into the category of unsupervised learning algorithms, FLSA-SVM, mathematically, has the closest link with the D-OFR method. The results in Table 3 demonstrated that the D-OFR method remained competitive compared with FLSA-SVM on Ringnorm and Banana datasets. However, on the Splice and Image datasets, the D-OFR method failed to achieve any sparse solutions.
The paper proposed a new LS-SVM algorithm—the FLSA-SVM which is trained specifically by the FLSA method of minimized squared error loss. The FLSA-SVM iteratively selects an optimal basis function which is associated with a specific training example into the solution. The algorithm cleverly adapts the number of SVs into the regularization term as the tradeoff between generalization abilities and training cost. As a result, the solution of the FLSA-SVMs is extremely sparse compared to LS-SVMs. Experiments showed that the FLSA-SVM algorithm maintained a comparable accuracy compared to the standard SVM, the LS-SVM and a number of recently developed sparse learning algorithms. Yet the FLSA-SVM showed definite advantages to its counterparts regarding the sparseness of the solution. On small datasets like the two-spiral benchmark, the FLSA-SVM training algorithm also proved to be more efficient than the CG method.
- J. A. K. Suykens, T. V. Gestel, J. Vandewalle, and B. D. Moor, “A support vector machine formulation to PCA analysis and its kernel version,” IEEE Transactions on Neural Networks, vol. 14, no. 2, pp. 447–450, 2003.
- J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Processing Letters, vol. 9, no. 3, pp. 293–300, 1999.
- V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995.
- V. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, NY, USA, 1998.
- J. A. K. Suykens, L. Lukas, P. V. Dooren, B. D. Moor, and J. Vandewalle, “Least squares support vector machine classifiers: a large scale algorithm,” in Proceedings of the European Conference on Circuit Theory and Design (ECCTD '99), pp. 839–842, Stresa, Italy, September 1999.
- T. V. Gestel, J. A. K. Suykens, B. Baesens et al., “Benchmarking least squares support vector machine classifiers,” Machine Learning, vol. 54, no. 1, pp. 5–32, 2004.
- W. Chu, C. J. Ong, and S. S. Keerthi, “An improved conjugate gradient scheme to the solution of least squares SVM,” IEEE Transactions on Neural Networks, vol. 16, no. 2, pp. 498–501, 2005.
- S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, “Improvements to Platt's SMO algorithm for SVM classifier design,” Neural Computation, vol. 13, no. 3, pp. 637–649, 2001.
- J. A. K. Suykens, J. de Brabanter, L. Lukas, and J. Vandewalle, “Weighted least squares support vector machines: robustness and sparce approximation,” Neurocomputing, vol. 48, no. 1, pp. 85–105, 2002.
- B. J. de Kruif and T. J. A. de Vries, “Pruning error minimization in least squares support vector machines,” IEEE Transactions on Neural Networks, vol. 14, no. 3, pp. 696–702, 2003.
- X. Zeng and X. W. Chen, “SMO-based pruning methods for sparse least squares support vector machines,” IEEE Transactions on Neural Networks, vol. 16, no. 6, pp. 1541–1546, 2005.
- L. Jiao, L. Bo, and L. Wang, “Fast sparse approximation for least squares support vector machine,” IEEE Transactions on Neural Networks, vol. 18, no. 3, pp. 685–697, 2007.
- K. Li, J. X. Peng, and G. W. Irwin, “A fast nonlinear model identification method,” IEEE Transactions on Automatic Control, vol. 50, no. 8, pp. 1211–1216, 2005.
- C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998.
- N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, New York, NY, USA, 2000.
- R. Fletcher, Practical Methods of Optimization, John Wiley & Sons, New York, NY, USA, 1987.
- C. Saunders, A. Gammerman, and V. Vovk, “Ridge regression learning algorithm in dual variables,” in Proceedings of the 15th International Conference on Machine Learning (ICML '98), pp. 515–521, Morgan Kaufmann, 1998.
- S. Mika, G. Ratsch, and K. Muller, “A mathematical programming approach to the kernel fisher algorithm,” in Advances in Neural Information Processing Systems, pp. 591–597, 2001.
- T. Gestel, J. A. K. Suykens, G. Lanckriet, A. Lambrechts, B. Moor, and J. Vandewalle, “Bayesian framework for least-squares support vector machine classifiers, gaussian processes, and kernel fisher discriminant analysis,” Neural Computation, vol. 14, no. 5, pp. 1115–1147, 2002.
- X. Xia, K. Li, and G. Irwin, “Improved training of an optimal sparse least squares support vector machine,” in Proceedings of the 17th World Congress The International Federation of Automatic Control (IFAC '08), Seoul, Korea, July 2008.
- C. Lawson and R. Hanson, “Solving least squares problems,” in Prentice-Hall Series in Automatic Computation, Prentice Hall, Englewood Cliffs, NJ, USA, 1974.
- S. Chen, S. Billings, and W. Luo, “Orthogonal least squares methods and their application to non-linear system identification,” International Journal of Control, vol. 50, no. 5, pp. 1873–1896, 1989.
- S. Chen, C. F. N. Cowan, and P. M. Grant, “Orthogonal least squares learning algorithm for radial basis function networks,” IEEE Transactions on Neural Networks, vol. 2, no. 2, pp. 302–309, 1991.
- S. Chen and J. Wigger, “Fast orthogonal least squares algorithm for efficient subset model selection,” IEEE Transactions on Signal Processing, vol. 43, no. 7, pp. 1713–1715, 1995.
- O. L. Mangasarian and D. R. Musicant, “Successive overrelaxation for support vector machines,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1032–1037, 1999.
- K. Li, J. X. Peng, and E. Bai, “A two-stage algorithm for identification of nonlinear dynamic systems,” Automatica, vol. 42, no. 7, pp. 1189–1197, 2006.
- S. Fahlman and C. Lebiere, “The cascade-correlation learning architecture,” in Advances in Neural Information Processing Systems 2, D. S. Touretzky, Ed., 1990.
- C. Chang and C. Lin, “LIBSVM: a library for support vector machines,” Software, vol. 80, pp. 604–611, 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
- K. Pelckmans, J. Suykens, T. van Gestel et al., “LSSVMlab: a matlab/C toolbox for least squares support vector machines,” in Tutorial, KULeuven-ESAT, Leuven, Belgium, 2002.
- J. Garcke, M. Griebel, and M. Thess, “Data mining with sparse grids,” Computing, vol. 67, no. 3, pp. 225–253, 2001.
- L. Breiman, “Arcing classifier (with discussion and a rejoinder by the author),” The Annals of Statistics, vol. 26, no. 3, pp. 801–849, 1998.
- S. Chen, X. Hong, and C. J. Harris, “Regression based D-optimality experimental design for sparse kernel density estimation,” Neurocomputing, vol. 73, no. 4–6, pp. 727–739, 2010.
Copyright © 2013 Xiao-Lei Xia et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.