Abstract
The solution of a Least Squares Support Vector Machine (LSSVM) suffers from the problem of nonsparseness. The Forward Least Squares Approximation (FLSA) is a greedy approximation algorithm with a leastsquares loss function. This paper proposes a new Support Vector Machine for which the FLSA is the training algorithm—the Forward Least Squares Approximation SVM (FLSASVM). A major novelty of this new FLSASVM is that the number of support vectors is the regularization parameter for tuning the tradeoff between the generalization ability and the training cost. The FLSASVMs can also detect the linear dependencies in vectors of the input Gramian matrix. These attributes together contribute to its extreme sparseness. Experiments on benchmark datasets are presented which show that, compared to various SVM algorithms, the FLSASVM is extremely compact, while maintaining a competitive generalization ability.
1. Introduction
The last decade has seen widespread applications of Least Squares Support Vector Machines (LSSVM) [1, 2] to a variety of classification problems. The LSSVM involves finding a separating hyperplane of maximal margin and minimizing the empirical risk via a Least Squares loss function. Here the optimization is subject to equality constraints, as opposed to the inequality ones used with a standard SVM [3, 4]. Thus the LSSVM successfully sidesteps the quadratic programming (QP) required for the training of the standard SVM. As a result, an LSSVM classifier is equivalent to a set of linear algebraic equations, which are usually solved by the conjugate gradient (CG) method [5]. Empirical studies suggest that an LSSVM possesses very competitive generalization abilities compared to a regular SVM [6]; improvement has been made to Suyken’s algorithm by Chu et al. [7] whose proposal is also based on CG method but with a reduced time complexity. Meanwhile Keerthi et al. advocated training an LSSVM using the sequential minimal optimization (SMO) algorithm [8].
While these algorithms have indeed made an LSSVM more computationally attractive, the nonsparseness of its solution still remains a major, and as yet unsolved, bottleneck. Sparseness in solutions is essential for reducing the time required in predicting the class membership of unlabelled data. The general approach to addressing this issue for an LSSVM is iterative shrinking of the training set. Here the importance of training samples is evaluated from the weights assigned to them after training. Samples of less importance are removed, and then the remainder forms a reduced training set to be learnt again. In view of the linear relationship between the approximation error and the support value for a training sample, a straightforward pruning strategy is to remove samples whose absolute support values are trivial [9]. A later paper recommended pruning a sample which introduces the smallest error if omitted [10]. However, this necessitates the inversion of the kernel matrix, which is indicative of high computation complexity. Zeng and Chen [11] suggested a pruning method based on the SMO formulation of an LSSVM, which leads to faster retraining. Other authors [12] proposed solving the dual form of an LSSVM by iterative addition of basis functions from the kernel dictionary either until all the training samples are traversed or until the approximation error is lower than a preset threshold. Despite the lower computation cost with the backfitting scheme adopted, Lagrangian multipliers associated with previously selected training samples were inherited into the next pass, which could compromise the generalization abilities of the resultant LSSVM.
This paper presents a new LSSVM formulation called the Forward Least Squares Approximation SVMs (FLSASVM). It is trained by Forward Least Squares Approximation(FLSA) [13], a function approximation method using a Least Squares loss function. The FLSASVM loss function decreases monotonically with an increasing number of support vectors, allowing for the removal of slack variables from the equality constraints. Another novelty with the FLSASVM lies in the fact that it cleverly transforms the number of support vectors to be the regularization parameter, which indicates the tradeoff between generalization abilities and empirical risk. The FLSASVM builds a classifier by iteratively selecting a single basis function for the solution, which contributes the largest reduction to the quadratic cost function.
The paper is organized as follows. Section 2 briefly reviews Least Squares SVM principles. The new sparse LSSVM—Forward Least Squares Approximation SVM (FLSASVM)—is introduced in Section 3. Experimental results are given in Section 4 and concluding remarks in Section 5.
2. Least Squares Support Vector Machines
For a classification problem of pairs of training samples , where and , LSSVM algorithms seek the optimal separating hyperplane with the orientation vector of the least norm.
To ensure the presence of the optimal hyperplane, the input data are translated into a reproducing kernel hilbert space (RKHS) by a mapping function denoted by . To avoid the curse of dimensionality, the mapping is implemented implicitly by the introduction of “kernel trick.” Its idea is that dot products in the RKHS space can be represented by a Mercer kernel function in input space [14]: Thus the linear discriminant function in the feature space can be formulated as where is the orientation vector and is the bias term.
An LSSVM finds the hyperplane parameterized by by solving the following optimization problem [15]: where the slack variable denotes the deviation between the actual output of the LSSVM on sample and its target value . is a parameter which imposes penalty on deviations.
Introducing the Lagrange multipliers for each of the equality constraints gives
Due to the equality constraints, can either be positive or negative according to the KarushKuhnTucker (KKT) conditions [16]:
The linear equations can be further simplified to where and , , , is unity matrix of rank , and is a column vector of 1s of length. The solution of an LSSVM is a linear combination of basis function whose associated is nonzero:
It has been noted that the LSSVM is almost equivalent to ridge regression [17] since the two methods corresponds to an identical optimization problem. Meanwhile, the equivalence between the LSSVM and the Kernel Fisher Discriminant method [18] has also been established [19].
3. Least Squares Approximation Sparse SVM
Equation (3) shows that an LSSVM can also be viewed as a ridge regression model. And the optimality conditions (7) indicate that the introduction of slack variable is the root of the nonsparseness problem, since in regression applications, most slack variables end as nonzero [20]. But the introduction of seems inevitable for the representation of training cost and thus the penalty parameter to indicate the tradeoff between training cost and generalization abilities.
The section introduces a new formulation of Least Squares SVM, namely, Forward Least Squares Approximation SVM (FLSASVM). The proposed FLSASVM algorithm is motivated by the Least Squares “Forward Approximation & Backward Refinement” method for function approximation, which was first developed in the dynamic system identification community by Li et al.
FLSASVM is rid of slack variables by employing an approximation function of Least Squares loss function which is Forward Least Squares Approximation (FLSA) algorithm. FLSA approximation function iteratively selects a basis function which causes the steepest reduction in the loss function. The features enable the number of support vectors (SVs), which equals to the number of basis functions, to be the tradeoff between training cost and generalization abilities. In FLSASVMs, the sparseness of its solution is ensured and confirmed by the experiments section in which FLSASVMs are more sparse than standard SVMs.
In this section, a description of “Forward Approximation” is given, from a machine learning perspective, as a preface to the introduction of LSASVM algorithm. The strategy to reduce the computation complexity of FLSASVMs is then presented.
3.1. Forward Least Squares Approximation [13]
Given values of an unknown target function in a Hilbert space at input data . A “dictionary” of functions in the Hilbert space is also given in which is termed as a “basis function.” Forward Least Squares Approximation (FLSA) addresses the estimation of with a linear combinations of basis functions chosen from the dictionary: where is the number of basis functions which expand , is the set of selected basis functions, and is the associated weight vector,
so that the squared norm of the residue vector, denoted by , is minimized: where and is the output vector of on the input data: where is the vector of decision values of basis function on the input data and . Equation (13) suggests that FLSA can be understood as working entirely in the space.
The weight vector that minimizes the loss function (12) is given as for a matrix which is of full column rank [21]. The resultant minimal loss function is Starting from (15), FLSA defines a set of residue matrices for the measurement of the contribution of each basis function to the reduction of loss function: where matrix is full column rank and is composed of output vectors of basis function. is the unity matrix and set . Then, the following equation holds:
For a which produces a full column rank , it is proved that has the following properties:
Meanwhile, the introduction of simplifies the formulation of which is the evaluation of loss function at : And thus
The contribution, denoted by , of a column vector makes the loss function can able to be explicitly expressed:
FLSA algorithm proceeds in a greedy manner which selects one basis function per iteration. The th iteration identifies the index of the th basis function by solving the optimization problem: where and .
The th iteration of FLSA algorithm, in actual fact, establishes the following linear system whose solutions are the last elements of in (13):
Thus which represents a linear equation is stored for the th iteration. Eventually, iterations build up an upper triangle which represents a linear system of (26). The solution can be computed by performing a back substitution procedure that is used by a typical Gaussian elimination process:
The FLSA algorithm, in fact, is closely related to the Orthogonal Least Squares (OLS) method [22–24], which also allows for the explicit formulation of the contribution of a basis function made to the reduction of the squared error loss.
3.2. Forward Least Squares Approximation SVMs
As with standard SVMs, the formulation of LSSVMs embodies the Structural Risk Minimization (SRM) principle which can be illustrated by Figure 1, in which the dotted line represents the upper bound on the complexity term of function set from which solution is chosen and the dash line the empirical risk. SRM minimizes the upper bound on the expected risk (generalization error), for which the best tradeoff between the complexity term and the empirical risk is required to be found.
To this end, a regularization term is introduced to indicate the tradeoff between the complexity term and the empirical risk, both of which LSSVMs explicitly formulate. Model selection is performed in the domain of in search for its optimal value.
While in FLSA algorithm, it can be concluded from (22) that the training cost monotonously decreases to the increase in the number of selected basis functions . The training cost can be then plotted against into a curve similar to that of the empirical risk in Figure 1.
This fact motivates the following two innovations to traditional formulation of LSSVMs: the employment of the parameter as the regularization term; the avoidance of the term to represent empirical risk by using the FLSA as the training algorithm, which minimizes the summed squared residues for any value of . As a result, a new formulation of LSSVMs—namely, Forward Least Squares Approximation Sparse SVM (FLSASVM), which is restricted to be trained by FLSA algorithm, is proposed: where is composed of the indices of the support vectors and is the cardinality of the set. The addition of term to (27) is first introduced by [25].
In an FLSASVM, the parameter is interpreted as the number of nonzeros of Lagrangian multipliers, that is, the number of support vectors (SVs), which is seen more obviously in its Lagrangian by introducing the Lagrange multipliers for each of the equality constraints giving The optimal point requires that
The linear equations can be further simplified to where and , , , is a by matrix of 1s, and , where .
It is worth attention that, in this paper, in (32) is also referred to as as it has support vectors. Thus, if all data samples are used as support vectors, then becomes .
FLSA can be applied to train an FLSASVM using a kernelbased dictionary of candidate basis functions. Since is defined to be the number of SVs, it thus achieved a more direct control of the sparseness of its solution.
3.3. Automatic Detection of Linear Dependencies
Since the kernel matrix in (32) is a semipositive definite, it is easy to prove the semipositive definiteness of the matrix . Then it is likely the occurrence of linear dependencies among column vectors, each being evaluation of a basis function on the training data. Assume that columns are selected iteratively and linearly independent, where are their column indices in chronological sequence. Denote to be any column which remains as available candidates in and meanwhile can be expressed as a linear combination of . It is thus naturally desired to remove as candidates in order to ensure the sparseness of the solution; avoid any undue computation concerning since
where .
With the introduction of the residue matrix at each iteration as , the following property holds according to (18) and (20): Thus updating the dictionary by , the column vector(s) becomes , that is, automatically pruned. Hence, at each iteration of the FLSASVM, any column vector(s) which can be represented by a linear combination of previously selected columns can be automatically pruned. This merit of the FLSASVM is one contributor to the sparseness of the resultant solution.
Algorithm 1 gives the pseudocode of the FLSASVM algorithm in detail.

3.4. Computation Complexity
As discussed in [26], for a single round of training, the computational complexity of FLSASVM is and space complexity for FLSASVM is .
4. Experimental Results
A set of experiments were performed to evaluate the performance of the proposed FLSASVM algorithm. The FLSASVMs were first applied to the twospiral benchmark [27] for an illustrative view of their generalization abilities. The following Gaussian kernel function was used for the experiments in this paper: The standard SVMs were implemented by LIBSVM [28]. The LSSVM trained by CG method was implemented by the toolbox of LSSVMlab [29] and all experiments were run on a Pentium 4 3.2 GHz processor under Windows XP with 2 GB of RAM.
4.1. Experiments on TwoSpiral Dataset
The 2D “twospiral” benchmark is known to be difficult for pattern recognition algorithms and poses great challenges to neural networks [30]. The training set consists of 194 points of the  plane, half of which has a target value of output and half a target value of −1. These training points describe two intertwining spirals that go around the origin three times, as shown in Figure 2, where the two categories are marked, respectively, by “+” and “o.”
For the FLSASVM, the parameter setting of Gaussian kernel was as in (36). With a feasible range of , the optimal regularization parameter was found to be which gave a leaveoneout crossvalidation (LOOCV) accuracy of . With standard SVMs, the parameter setting was and , whose LOOCV accuracy is . The SVM classifier was required in support vectors (SVs). The graphical outputs of the FLSASVM and the SVM were given by Figure 3. It showed that generally both SVM and FLSASVM algorithms recognized the pattern successfully, outputting “twospiral” in a very smooth and regular manner. But at the area around the coordinate of , the FLSASVM showed better performance than the conventional SVM whose decision hyperplane was biased towards the “o” class.
(a)
(b)
Despite the reduction in the number of SVs which is very notable for such a “hard” classification problem, the FLSASVMs are comparatively superior to the following two aspects. In SVMs, it often occurred that given a fixed kernel parameter, the optimal CV accuracy can be obtained from multiple value settings on the regularization parameter. The “twospiral” problem is a case in point. For the best LOOCV accuracy, with a fixed , the value of can also opt for beyond . There is no specific rule as to which option is the best, and normally the smallest is chosen for a scaleddown feasible region. While in FLSASVMs, the setting for the value of the regularization term is more tractable since different options correspond to different learning errors which can be easily tracked with (23). For multiple values of which produce the same optimal CV accuracy, the largest is chosen for a smaller learning error.
In fact, for , the LOOCV accuracy remained stable at . Thus an FLSASVM was also trained with the parameter settings of which was depicted in Figure 4(a). For comparisons, an LSSVM was trained whose parameter settings are and for Gaussian RBF. The solution was parameterized by the entire 194 points and illustrated in Figure 4(b). The decision boundaries of both of the LSSVM and the FLSASVM were generally smooth and followed the pattern satisfactorily, despite the slightly biased segments around the coordinates of and . But still the FLSASVM performed much better at the origin area than the LSSVM.
(a)
(b)
Table 1 also compares the time cost of 10fold crossvalidation (CV) on the regularization parameter by FLSASVMs, SVMs and LSSVMs. For FLSASVMs, the regularization term was assigned integers evenly with an interval of within the range of . For SVMs and LSSVMs, the parameter was sequentially increased from to in multiple of . Each fold was split into a training set and a validation set, with a division pattern of training versus validation or training versus validation. The total computation time of the classifiers on various folds for each algorithm was reported in Table 1. The time cost of the fold altogether was given in the last row entry of Table 1. It was clearly shown that the 10fold crossvalidation (CV) of FLSASVMs is over 4 times faster than SVMs. The time complexity of FLSASVMs remained competitive to that of LSSVMs using the SMO algorithm and much more reduced than LSSVMs using the CG method.
These comparisons prove that the FLSASVM is very promising in easing the nonsparseness problem of an LSSVM, in addition to its outstanding generalization performance. And it can obtain a solution whose sparseness is competitive, or even superior, to that of a standard SVM.
4.2. More Benchmark Problems
The FLSASVM algorithm was applied to 3 smallscale binary problems: Banana, Image, and Splice which are accessible at http://theoval.cmp.uea.ac.uk/matlab/#benchmarks/. Among all the realizations for each benchmark, the first one of them was used. Experiments were also performed on Banana dataset [31], which is a mediumscale binary learning problem. The detailed information of the datasets was given in Table 2.
FLSASVMs were compared with SVMs, LSSVMs, the fast sparse approximation scheme for LSSVM (FSALSSVM), and its variant, called PFSALSSVM, both of which were proposed by Jiao et al. [12]. The parameter of FSALSSVMs and PFSALSSVMs was uniformly set to be 0.5 which was empirically proved to work well with most datasets [12]. Comparisons were also made against Doptimality orthogonal forward regression (DOFR) [32] which is a technique for nonlinear function estimation, promised to yield sparse solutions. The parameters, which were the penalty constant and in (36), were tuned by tenfold crossvalidation (CV).
Table 3 presented the classification accuracy of the SVM algorithms. The best results among the four SVM algorithms were highlighted. It can be seen that the FLSASVM achieved comparable classification accuracy to the standard SVM, the conventional SVM, FSALSVM, PFSALSSVM, and Doptimality OFR.
The numbers of SVs were compared in Table 4. For all the learning problems, the FLSASVM required much less SVs than the SVM and the conventional SVM. In particular, the reduction in SVs reached over 98% and 80%, respectively, on Ringnorm and Banana datasets compared with the SVM. FLSASVM has maintained its edges over FSALSVM and PFSALSSVM, particularly with the Ringnorm and Banana datasets. Although the DOFR method falls into the category of unsupervised learning algorithms, FLSASVM, mathematically, has the closest link with the DOFR method. The results in Table 3 demonstrated that the DOFR method remained competitive compared with FLSASVM on Ringnorm and Banana datasets. However, on the Splice and Image datasets, the DOFR method failed to achieve any sparse solutions.
5. Conclusions
The paper proposed a new LSSVM algorithm—the FLSASVM which is trained specifically by the FLSA method of minimized squared error loss. The FLSASVM iteratively selects an optimal basis function which is associated with a specific training example into the solution. The algorithm cleverly adapts the number of SVs into the regularization term as the tradeoff between generalization abilities and training cost. As a result, the solution of the FLSASVMs is extremely sparse compared to LSSVMs. Experiments showed that the FLSASVM algorithm maintained a comparable accuracy compared to the standard SVM, the LSSVM and a number of recently developed sparse learning algorithms. Yet the FLSASVM showed definite advantages to its counterparts regarding the sparseness of the solution. On small datasets like the twospiral benchmark, the FLSASVM training algorithm also proved to be more efficient than the CG method.