The solution of a Least Squares Support Vector Machine (LS-SVM) suffers from the problem of nonsparseness. The Forward Least Squares Approximation (FLSA) is a greedy approximation algorithm with a least-squares loss function. This paper proposes a new Support Vector Machine for which the FLSA is the training algorithm—the Forward Least Squares Approximation SVM (FLSA-SVM). A major novelty of this new FLSA-SVM is that the number of support vectors is the regularization parameter for tuning the tradeoff between the generalization ability and the training cost. The FLSA-SVMs can also detect the linear dependencies in vectors of the input Gramian matrix. These attributes together contribute to its extreme sparseness. Experiments on benchmark datasets are presented which show that, compared to various SVM algorithms, the FLSA-SVM is extremely compact, while maintaining a competitive generalization ability.

1. Introduction

The last decade has seen widespread applications of Least Squares Support Vector Machines (LS-SVM) [1, 2] to a variety of classification problems. The LS-SVM involves finding a separating hyperplane of maximal margin and minimizing the empirical risk via a Least Squares loss function. Here the optimization is subject to equality constraints, as opposed to the inequality ones used with a standard SVM [3, 4]. Thus the LS-SVM successfully sidesteps the quadratic programming (QP) required for the training of the standard SVM. As a result, an LS-SVM classifier is equivalent to a set of linear algebraic equations, which are usually solved by the conjugate gradient (CG) method [5]. Empirical studies suggest that an LS-SVM possesses very competitive generalization abilities compared to a regular SVM [6]; improvement has been made to Suyken’s algorithm by Chu et al. [7] whose proposal is also based on CG method but with a reduced time complexity. Meanwhile Keerthi et al. advocated training an LS-SVM using the sequential minimal optimization (SMO) algorithm [8].

While these algorithms have indeed made an LS-SVM more computationally attractive, the nonsparseness of its solution still remains a major, and as yet unsolved, bottleneck. Sparseness in solutions is essential for reducing the time required in predicting the class membership of unlabelled data. The general approach to addressing this issue for an LS-SVM is iterative shrinking of the training set. Here the importance of training samples is evaluated from the weights assigned to them after training. Samples of less importance are removed, and then the remainder forms a reduced training set to be learnt again. In view of the linear relationship between the approximation error and the support value for a training sample, a straightforward pruning strategy is to remove samples whose absolute support values are trivial [9]. A later paper recommended pruning a sample which introduces the smallest error if omitted [10]. However, this necessitates the inversion of the kernel matrix, which is indicative of high computation complexity. Zeng and Chen [11] suggested a pruning method based on the SMO formulation of an LS-SVM, which leads to faster retraining. Other authors [12] proposed solving the dual form of an LS-SVM by iterative addition of basis functions from the kernel dictionary either until all the training samples are traversed or until the approximation error is lower than a preset threshold. Despite the lower computation cost with the backfitting scheme adopted, Lagrangian multipliers associated with previously selected training samples were inherited into the next pass, which could compromise the generalization abilities of the resultant LS-SVM.

This paper presents a new LS-SVM formulation called the Forward Least Squares Approximation SVMs (FLSA-SVM). It is trained by Forward Least Squares Approximation(FLSA) [13], a function approximation method using a Least Squares loss function. The FLSA-SVM loss function decreases monotonically with an increasing number of support vectors, allowing for the removal of slack variables from the equality constraints. Another novelty with the FLSA-SVM lies in the fact that it cleverly transforms the number of support vectors to be the regularization parameter, which indicates the tradeoff between generalization abilities and empirical risk. The FLSA-SVM builds a classifier by iteratively selecting a single basis function for the solution, which contributes the largest reduction to the quadratic cost function.

The paper is organized as follows. Section 2 briefly reviews Least Squares SVM principles. The new sparse LS-SVM—Forward Least Squares Approximation SVM (FLSA-SVM)—is introduced in Section 3. Experimental results are given in Section 4 and concluding remarks in Section 5.

2. Least Squares Support Vector Machines

For a classification problem of pairs of training samples , where and , LS-SVM algorithms seek the optimal separating hyperplane with the orientation vector of the least norm.

To ensure the presence of the optimal hyperplane, the input data are translated into a reproducing kernel hilbert space (RKHS) by a mapping function denoted by . To avoid the curse of dimensionality, the mapping is implemented implicitly by the introduction of “kernel trick.” Its idea is that dot products in the RKHS space can be represented by a Mercer kernel function in input space [14]: Thus the linear discriminant function in the feature space can be formulated as where is the orientation vector and is the bias term.

An LS-SVM finds the hyperplane parameterized by by solving the following optimization problem [15]: where the slack variable denotes the deviation between the actual output of the LS-SVM on sample and its target value . is a parameter which imposes penalty on deviations.

Introducing the Lagrange multipliers    for each of the equality constraints gives

Due to the equality constraints, can either be positive or negative according to the Karush-Kuhn-Tucker (KKT) conditions [16]:

The linear equations can be further simplified to where and ,  ,  , is unity matrix of rank , and is a column vector of 1s of length. The solution of an LS-SVM is a linear combination of basis function whose associated is nonzero:

It has been noted that the LS-SVM is almost equivalent to ridge regression [17] since the two methods corresponds to an identical optimization problem. Meanwhile, the equivalence between the LS-SVM and the Kernel Fisher Discriminant method [18] has also been established [19].

3. Least Squares Approximation Sparse SVM

Equation (3) shows that an LS-SVM can also be viewed as a ridge regression model. And the optimality conditions (7) indicate that the introduction of slack variable is the root of the nonsparseness problem, since in regression applications, most slack variables end as nonzero [20]. But the introduction of seems inevitable for the representation of training cost and thus the penalty parameter to indicate the tradeoff between training cost and generalization abilities.

The section introduces a new formulation of Least Squares SVM, namely, Forward Least Squares Approximation SVM (FLSA-SVM). The proposed FLSA-SVM algorithm is motivated by the Least Squares “Forward Approximation & Backward Refinement” method for function approximation, which was first developed in the dynamic system identification community by Li et al.

FLSA-SVM is rid of slack variables by employing an approximation function of Least Squares loss function which is Forward Least Squares Approximation (FLSA) algorithm. FLSA approximation function iteratively selects a basis function which causes the steepest reduction in the loss function. The features enable the number of support vectors (SVs), which equals to the number of basis functions, to be the tradeoff between training cost and generalization abilities. In FLSA-SVMs, the sparseness of its solution is ensured and confirmed by the experiments section in which FLSA-SVMs are more sparse than standard SVMs.

In this section, a description of  “Forward Approximation” is given, from a machine learning perspective, as a preface to the introduction of LSA-SVM algorithm. The strategy to reduce the computation complexity of FLSA-SVMs is then presented.

3.1. Forward Least Squares Approximation [13]

Given values of an unknown target function in a Hilbert space at input data . A “dictionary” of functions in the Hilbert space is also given in which    is termed as a “basis function.” Forward Least Squares Approximation (FLSA) addresses the estimation of with a linear combinations of basis functions chosen from the dictionary: whereis the number of basis functions which expand , is the set of selected basis functions, and is the associated weight vector,

so that the squared norm of the residue vector, denoted by , is minimized: where and is the output vector of on the input data: where is the vector of decision values of basis function on the input data and . Equation (13) suggests that FLSA can be understood as working entirely in the space.

The weight vector that minimizes the loss function (12) is given as for a matrix which is of full column rank [21]. The resultant minimal loss function is Starting from (15), FLSA defines a set of residue matrices    for the measurement of the contribution of each basis function to the reduction of loss function: where matrix is full column rank and is composed of output vectors of basis function. is the unity matrix and set . Then, the following equation holds:

For a    which produces a full column rank , it is proved that has the following properties:

Meanwhile, the introduction of simplifies the formulation of which is the evaluation of loss function at : And thus

The contribution, denoted by , of a column vector makes the loss function can able to be explicitly expressed:

FLSA algorithm proceeds in a greedy manner which selects one basis function per iteration. The th iteration identifies the index of the th basis function by solving the optimization problem: where and .

The th iteration of FLSA algorithm, in actual fact, establishes the following linear system whose solutions are the last elements of in (13):

Thus    which represents a linear equation is stored for the th iteration. Eventually, iterations build up an upper triangle which represents a linear system of (26). The solution can be computed by performing a back substitution procedure that is used by a typical Gaussian elimination process:

The FLSA algorithm, in fact, is closely related to the Orthogonal Least Squares (OLS) method [2224], which also allows for the explicit formulation of the contribution of a basis function made to the reduction of the squared error loss.

3.2. Forward Least Squares Approximation SVMs

As with standard SVMs, the formulation of LS-SVMs embodies the Structural Risk Minimization (SRM) principle which can be illustrated by Figure 1, in which the dotted line represents the upper bound on the complexity term of function set from which solution is chosen and the dash line the empirical risk. SRM minimizes the upper bound on the expected risk (generalization error), for which the best tradeoff between the complexity term and the empirical risk is required to be found.

To this end, a regularization term is introduced to indicate the tradeoff between the complexity term and the empirical risk, both of which LS-SVMs explicitly formulate. Model selection is performed in the domain of in search for its optimal value.

While in FLSA algorithm, it can be concluded from (22) that the training cost monotonously decreases to the increase in the number of selected basis functions . The training cost can be then plotted against into a curve similar to that of the empirical risk in Figure 1.

This fact motivates the following two innovations to traditional formulation of LS-SVMs: the employment of the parameter as the regularization term; the avoidance of the term to represent empirical risk by using the FLSA as the training algorithm, which minimizes the summed squared residues for any value of . As a result, a new formulation of LS-SVMs—namely, Forward Least Squares Approximation Sparse SVM (FLSA-SVM), which is restricted to be trained by FLSA algorithm, is proposed: where is composed of the indices of the support vectors and is the cardinality of the set. The addition of term to (27) is first introduced by [25].

In an FLSA-SVM, the parameter is interpreted as the number of nonzeros of Lagrangian multipliers, that is, the number of support vectors (SVs), which is seen more obviously in its Lagrangian by introducing the Lagrange multipliers    for each of the equality constraints giving The optimal point requires that

The linear equations can be further simplified to where and ,  ,  , is a -by- matrix of 1s, and , where .

It is worth attention that, in this paper, in (32) is also referred to as as it has support vectors. Thus, if all data samples are used as support vectors, then becomes .

FLSA can be applied to train an FLSA-SVM using a kernel-based dictionary of candidate basis functions. Since is defined to be the number of SVs, it thus achieved a more direct control of the sparseness of its solution.

3.3. Automatic Detection of Linear Dependencies

Since the kernel matrix in (32) is a semipositive definite, it is easy to prove the semipositive definiteness of the matrix . Then it is likely the occurrence of linear dependencies among column vectors, each being evaluation of a basis function on the training data. Assume that columns are selected iteratively and linearly independent, where are their column indices in chronological sequence. Denote to be any column which remains as available candidates in and meanwhile can be expressed as a linear combination of . It is thus naturally desired to remove as candidates in order to ensure the sparseness of the solution; avoid any undue computation concerning since

where .

With the introduction of the residue matrix at each iteration as , the following property holds according to (18) and (20): Thus updating the dictionary by , the column vector(s) becomes , that is, automatically pruned. Hence, at each iteration of the FLSA-SVM, any column vector(s) which can be represented by a linear combination of previously selected columns can be automatically pruned. This merit of the FLSA-SVM is one contributor to the sparseness of the resultant solution.

Algorithm 1 gives the pseudocode of the FLSA-SVM algorithm in detail.

(i) The data set , ),…, ( ,
(ii) which is the number of support vectors desired in the expansion of the solution and
(iii) A dictionary of basis functions
(i) Current residue vector y, current dictionary which is initially a matrix of evaluations
 of candidate basis functions on training data:
(ii) The matrix and the vector both starts as empty is appended a row and grows
  by one extra element at each iteration, which in the end forms a linear system.
(iii) A variable which is the count of candidate basis functions and a vector
  which contains the indices of basis functions. At the start, for and .
(iv) is made a pointer to the current selected basis functions:
(v) The residue vector is reduced by as the target values for the next linear system of size :
(vi) Update the dictionary matrix and prune the candidate basis functions which can be
 represented as a linear combinations of the previously selected ones:
(vii) If equations and hold where is the number of selected basis functions and
the count of available candidates, it suggests that the initial value setting on has
  exceeded the rank of . Terminate the loop and reset :
(i) basis functions are chosen whose indices are the first elements of .
  columns of matrix with the indices and forms a linear system, on which
  the process of back substitution is performed for the solution:
 (i) The solution is defined by

3.4. Computation Complexity

As discussed in [26], for a single round of training, the computational complexity of FLSA-SVM is and space complexity for FLSA-SVM is .

4. Experimental Results

A set of experiments were performed to evaluate the performance of the proposed FLSA-SVM algorithm. The FLSA-SVMs were first applied to the two-spiral benchmark [27] for an illustrative view of their generalization abilities. The following Gaussian kernel function was used for the experiments in this paper: The standard SVMs were implemented by LIBSVM [28]. The LS-SVM trained by CG method was implemented by the toolbox of LS-SVMlab [29] and all experiments were run on a Pentium 4 3.2 GHz processor under Windows XP with 2 GB of RAM.

4.1. Experiments on Two-Spiral Dataset

The 2D “two-spiral” benchmark is known to be difficult for pattern recognition algorithms and poses great challenges to neural networks [30]. The training set consists of 194 points of the - plane, half of which has a target value of output and half a target value of −1. These training points describe two intertwining spirals that go around the origin three times, as shown in Figure 2, where the two categories are marked, respectively, by “+” and “o.”

For the FLSA-SVM, the parameter setting of Gaussian kernel was as in (36). With a feasible range of , the optimal regularization parameter was found to be which gave a leave-one-out cross-validation (LOOCV) accuracy of . With standard SVMs, the parameter setting was and , whose LOOCV accuracy is . The SVM classifier was required in support vectors (SVs). The graphical outputs of the FLSA-SVM and the SVM were given by Figure 3. It showed that generally both SVM and FLSA-SVM algorithms recognized the pattern successfully, outputting “two-spiral” in a very smooth and regular manner. But at the area around the coordinate of , the FLSA-SVM showed better performance than the conventional SVM whose decision hyperplane was biased towards the “o” class.

Despite the reduction in the number of SVs which is very notable for such a “hard” classification problem, the FLSA-SVMs are comparatively superior to the following two aspects. In SVMs, it often occurred that given a fixed kernel parameter, the optimal CV accuracy can be obtained from multiple value settings on the regularization parameter. The “two-spiral” problem is a case in point. For the best LOOCV accuracy, with a fixed , the value of can also opt for beyond . There is no specific rule as to which option is the best, and normally the smallest is chosen for a scaled-down feasible region. While in FLSA-SVMs, the setting for the value of the regularization term is more tractable since different options correspond to different learning errors which can be easily tracked with (23). For multiple values of which produce the same optimal CV accuracy, the largest is chosen for a smaller learning error.

In fact, for , the LOOCV accuracy remained stable at . Thus an FLSA-SVM was also trained with the parameter settings of which was depicted in Figure 4(a). For comparisons, an LS-SVM was trained whose parameter settings are and for Gaussian RBF. The solution was parameterized by the entire 194 points and illustrated in Figure 4(b). The decision boundaries of both of the LS-SVM and the FLSA-SVM were generally smooth and followed the pattern satisfactorily, despite the slightly biased segments around the coordinates of and . But still the FLSA-SVM performed much better at the origin area than the LS-SVM.

Table 1 also compares the time cost of 10-fold cross-validation (CV) on the regularization parameter by FLSA-SVMs, SVMs and LS-SVMs. For FLSA-SVMs, the regularization term was assigned integers evenly with an interval of within the range of . For SVMs and LS-SVMs, the parameter was sequentially increased from to in multiple of . Each fold was split into a training set and a validation set, with a division pattern of training versus validation or training versus validation. The total computation time of the classifiers on various folds for each algorithm was reported in Table 1. The time cost of the -fold altogether was given in the last row entry of Table 1. It was clearly shown that the 10-fold cross-validation (CV) of FLSA-SVMs is over 4 times faster than SVMs. The time complexity of FLSA-SVMs remained competitive to that of LS-SVMs using the SMO algorithm and much more reduced than LS-SVMs using the CG method.

These comparisons prove that the FLSA-SVM is very promising in easing the nonsparseness problem of an LS-SVM, in addition to its outstanding generalization performance. And it can obtain a solution whose sparseness is competitive, or even superior, to that of a standard SVM.

4.2. More Benchmark Problems

The FLSA-SVM algorithm was applied to 3 small-scale binary problems: Banana, Image, and Splice which are accessible at http://theoval.cmp.uea.ac.uk/matlab/#benchmarks/. Among all the realizations for each benchmark, the first one of them was used. Experiments were also performed on Banana dataset [31], which is a medium-scale binary learning problem. The detailed information of the datasets was given in Table 2.

FLSA-SVMs were compared with SVMs, LS-SVMs, the fast sparse approximation scheme for LS-SVM (FSALS-SVM), and its variant, called PFSALS-SVM, both of which were proposed by Jiao et al. [12]. The parameter of FSALS-SVMs and PFSALS-SVMs was uniformly set to be 0.5 which was empirically proved to work well with most datasets [12]. Comparisons were also made against D-optimality orthogonal forward regression (D-OFR) [32] which is a technique for nonlinear function estimation, promised to yield sparse solutions. The parameters, which were the penalty constant and in (36), were tuned by tenfold cross-validation (CV).

Table 3 presented the classification accuracy of the SVM algorithms. The best results among the four SVM algorithms were highlighted. It can be seen that the FLSA-SVM achieved comparable classification accuracy to the standard SVM, the conventional SVM, FSAL-SVM, PFSALS-SVM, and D-optimality OFR.

The numbers of SVs were compared in Table 4. For all the learning problems, the FLSA-SVM required much less SVs than the SVM and the conventional SVM. In particular, the reduction in SVs reached over 98% and 80%, respectively, on Ringnorm and Banana datasets compared with the SVM. FLSA-SVM has maintained its edges over FSAL-SVM and PFSALS-SVM, particularly with the Ringnorm and Banana datasets. Although the D-OFR method falls into the category of unsupervised learning algorithms, FLSA-SVM, mathematically, has the closest link with the D-OFR method. The results in Table 3 demonstrated that the D-OFR method remained competitive compared with FLSA-SVM on Ringnorm and Banana datasets. However, on the Splice and Image datasets, the D-OFR method failed to achieve any sparse solutions.

5. Conclusions

The paper proposed a new LS-SVM algorithm—the FLSA-SVM which is trained specifically by the FLSA method of minimized squared error loss. The FLSA-SVM iteratively selects an optimal basis function which is associated with a specific training example into the solution. The algorithm cleverly adapts the number of SVs into the regularization term as the tradeoff between generalization abilities and training cost. As a result, the solution of the FLSA-SVMs is extremely sparse compared to LS-SVMs. Experiments showed that the FLSA-SVM algorithm maintained a comparable accuracy compared to the standard SVM, the LS-SVM and a number of recently developed sparse learning algorithms. Yet the FLSA-SVM showed definite advantages to its counterparts regarding the sparseness of the solution. On small datasets like the two-spiral benchmark, the FLSA-SVM training algorithm also proved to be more efficient than the CG method.