Research Article  Open Access
Least Absolute Deviation Support Vector Regression
Abstract
Least squares support vector machine (LSSVM) is a powerful tool for pattern classification and regression estimation. However, LSSVM is sensitive to large noises and outliers since it employs the squared loss function. To solve the problem, in this paper, we propose an absolute deviation loss function to reduce the effects of outliers and derive a robust regression model termed as least absolute deviation support vector regression (LADSVR). The proposed loss function is not differentiable. We approximate it by constructing a smooth function and develop a Newton algorithm to solve the robust model. Numerical experiments on both artificial datasets and benchmark datasets demonstrate the robustness and effectiveness of the proposed method.
1. Introduction
Support vector machine (SVM), introduced by Vapnik [1] and Cristianini and Taylor [2], has been gaining more and more popularity over the past decades as a modern machine learning approach, which has strong theoretical foundation and successes in many realworld applications. However, its training computational load is great, that is, , where is the total size of training samples. In order to reduce the computational effort, many accelerating algorithms have been proposed. Traditionally, SVM is trained by means of decomposition techniques such as SMO [3, 4], chunking [5], [6], and LIBSVM [7], which solve the dual problems by optimizing a small subset of the variables during the iteration procedure. Another kind of accelerating algorithm is the least squares SVM introduced by Suykens and Vandewalle [8] which replaces inequality constraints with equality ones, requiring to solve a linear system of equations and results in an extremely fast training speed.
LSSVM obtains good performance on various classification and regression estimation problems. In LSSVR, it is optimal when the error variables follow a Gaussian distribution because it tries to minimize the sum of squared errors (SSE) of training samples [9]. However, datasets subject to heavytailed errors or outliers are commonly encountered in various applications and the solution of LSSVR may suffer from lack of robustness. In recent years, much effort has been made to increase the robustness of LSSVR. The commonly used approach adopts the weight setting strategies to reduce the influence of outliers [9–13]. In these LSSVR methods, different weight factors are put on the error variables such that the less important samples or outliers have smaller weights. Another approach improves LSSVR’s performances by means of outlier elimination [14–17]. Essentially, LSSVR is sensitive to outliers since it employs the squared loss function which overemphasizes the impact of outliers.
In this paper, we focus on the situation in which the heavytailed errors or outliers are found in the targets. In such a situation, it is well known that the traditional least squares (LS) may fail to produce a reliable regressor, and the least absolute deviation (LAD) can be very useful [18–20]. Therefore, we exploit the absolute deviation loss function to reduce the effects of outliers and derive a robust regression model termed as least absolute deviation SVR (LADSVR). Due to the fact that the absolute deviation loss function is not differentiable, the classical optimization method cannot be used directly to solve the LADSVR. Recently, some algorithms in the primal space for training SVM have been proposed due to their effective computation. Moreover, it is pointed out that the primal domain methods are superior to the dual domain methods when the goal is to find an approximate solution [21, 22]. Therefore, we approximate LADSVR by constructing a smooth function and develop a Newton algorithm to solve the robust model in the primal space. Numerical experiments on both artificial datasets and benchmark datasets reveal the efficiency of the proposed method.
The paper is organized as follows. In Section 2, we briefly introduce classical LSSVR and LSSVR in the primal space. In Section 3, we propose an absolute deviation loss function and derive LADSVR. A Newton algorithm for LADSVR is given in Section 4. Section 5 performs experiments on artificial datasets and benchmark datasets to investigate the effectiveness of LADSVR. In Section 6, some remarkable conclusions are given.
2. Least Squares Support Vector Regression
2.1. Classical LSSVR
In this section, we concisely present the basic principles of LSSVR. For more details, the reader can refer to [8, 9]. Consider a regression problem with a training dataset , where is the input variable and is the corresponding target. To derive a nonlinear regressor, LSSVR can be obtained through solving the following optimization problem: where represents the error variables, represents the model complexity, is a nonlinear mapping which maps the input data into a highdimensional feature space, and is the regularization parameter that balances the model complexity and empirical risk. To solve (1), we need to introduce Lagrangian multipliers and construct a Lagrangian function. Utilizing the KarushKuhnTucker (KKT) conditions, we get the dual optimization problem where , , denotes identity matrix, is the kernel matrix with , and is the kernel function. By solving (2), the regressor can be gained as
2.2. LSSVR in the Primal Space
In this section, we describe LSSVR solved in the primal space following the growing interest in training SVMs in the primal space in the last few years [21, 22]. Primal optimization of an SVM has strong similarities with the dual strategy [21] and can be implemented by the widely popular optimization techniques. The optimization problem of LSSVR (1) can be described as where with , and is a squared loss function, as shown in Figure 1. In the reproducing kernel Hilbert space , we rewrite the optimization problem (4) as
For the sake of simplicity, we can drop the bias without loss of generalization performance of SVR [21]. According to [21], the optimal function for (5) can be expressed as a linear combination of the kernel functions centering the training samples:
Substituting (6) into (5), we have where and is the th row of kernel matrix .
3. Least Absolute Deviation SVR
As mentioned, LSSVR is sensitive to outliers and noises with the squared loss function . When there exist outliers which are far away from the rest of samples, large errors will dominate SSE and the decision hyperplane of LSSVR will severely deviate from the original position deteriorating the performance of LSSVR.
In this section, we propose an absolute deviation loss function to reduce the influence of outliers. This phenomenon is graphically depicted in Figure 1, which shows the squared loss function and the absolute deviation one , respectively. From the figure, the exaggerative effect of at points with large errors, as compared with , is evident.
The robust LADSVR model can be constructed as However, is not differentiable, and the associated optimization problem is difficult to be solved. Inspired by the Huber loss function [23], we propose the following loss function: where is the Huber parameter, and its shape is shown in Figure 1. It is verified that is differentiable. For , approaches . Replacing with in (8), we obtain
4. Newton Algorithm for LADSVR
Noticing that the objective function of (10) is continuous and differentiable, (10) can be easily solved by Newton algorithm. At the th iteration, we divide the training samples into two groups according to and . Let denote the index set of samples lying in the quadratic part of and the index set of samples lying in the linear part of . and represent the number of samples in and ; that is, . For the sake of clarity, we suppose that the two groups are arranged in the order of and . Furthermore, we define diagonal matrices and , where has the first entries being 1 and the others 0, and has the entries from to being 1 and the others 0. Then, we develop a Newton algorithm for (10). The gradient is where and with . The Hessian matrix at the th iteration is The Newton step at the th iteration is Denote . The inverse of can be calculated as follows: The computational complexity of is . Substituting (14) into (13), we obtain Having updated , we get the corresponding regressor The flowchart of implementing LADSVR is depicted as follows.
Algorithm 1. LADSVR (Newton algorithm for LADSVR with absolute deviation loss function).
Input: training set , kernel matrix , and a small real number .(1)Let and calculate . Divide training set into two groups according to . Set .(2)Rearrange the groups in the order of and ; adjust and correspondingly. Solve by (11). If , stop; or else, go to the next step.(3)Calculate by (15) and by (16).(4)Divide training set into two groups according to . Let , and go to step (2).
5. Experiments
In order to test the effectiveness of the proposed LADSVR, we conduct experiments on several datasets, including six artificial datasets and nine benchmark datasets, and compare it with LSSVR. Gaussian kernel is selected as the kernel function in the experiments. All the experiments are implemented on Intel Pentium IV 3.00 GHz PC with 2 GB of RAM using Matlab 7.0 under Microsoft Windows XP. The linear system of equations in LSSVR is realized by Matlab operation “”. Parameters selection is a crucial issue for modeling with the kernel methods, because improper parameters, such as the regularization parameter and kernel parameter , will severely affect the generalization performance of SVR. Grid search [2] is a simple and direct method, which conducts an exhaustive search on the parameters space with the validation minimized. In this paper, we employ grid search for searching their optimal parameters such that they can achieve best performance on the test samples.
To evaluate the performances of the algorithms, we adopt the following four popular regression estimation criterions: root mean square error (RMSE) [24], mean absolute error (MAE), ratio between the sum squared error SSE and the sum squared deviation testing samples SST (SSE/SST) [25], and ratio between interpretable sum deviation SSR and SST (SSR/SST) [25]. These criterions are defined as follows.(1)RMSE = .(2)MAE = .(3)SSE/SST = .(4)SSR/SST = ,where is the number of testing samples, denotes the target, is the corresponding prediction, and . RMSE is commonly used as the deviation measurement between the real and predicted values. It represents the fitting precision. The smaller RMSE is, the better fitting performance is. However, when noises are also used as testing samples, too small value of RMSE probably means overfitting of the regressor. MAE is also a popular deviation measurement between the real and predicted values. In most cases, small SSE/SST indicates good agreement between estimations and real values. Obtaining smaller SSE/SST usually accompanies an increase in SSR/SST. However, the extremely small value of SSE/SST is in fact not good, for it probably means overfitting of the regressor. Therefore, a good estimator should strike balance between SSE/SST and SSR/SST.
5.1. Experiments on Artificial Datasets
In artificial experiments, we generate the artificial datasets by the following Sinc function which is widely used in regression estimation [17, 24]. Consider where represents the Gaussian random variable with zero means and variance , denotes the uniformly random variable in , and depicts the student random variable with freedom degree .
In order to avoid biased comparisons, for each kind of noises, we randomly generate ten independent groups of noisy samples which, respectively, consist of 350 training samples and 500 test samples. For each training dataset, we randomly choose 1/5 samples and add large noise on their targets to simulate outliers. The testing samples are uniformly from the objective Sinc function without any noise. Table 1 shows the average accuracies of LSSVR and LADSVR with ten independent runs. From Table 1, we can see that LADSVR has advantages over LSSVR for all types of noises in terms of RMSE, MAE, and SSE/SST. Hence, LADSVR is robust to noises and outliers. Moreover, LADSVR derives larger SSR/SST value for three types of noises (types II, IV, and V). From Figure 2, we can see that the LADSVR follows the actual data more closely than LSSVR for most of the test samples. The main reason is that LADSVR employs an absolute deviation loss function which reduces the penalty of outliers in the training process. The histograms of LSSVR and LADSVR for distribution of the error variables for these different types of noises are shown in Figure 3. We notice that the histograms of LADSVR for all types of noises are closer to Gaussian distribution, compared with LSSVR. Therefore, our proposed LADSVR derives better approximation than LSSVR.

(a)
(b)
(c)
(d)
(e)
(f)
(a)
(b)
(c)
(d)
(e)
(f)
5.2. Experiments on Benchmark Datasets
In this section, we test nine benchmark datasets to further illustrate the effectiveness of the LADSVR, including Pyrimidines (Pyrim), Triazines, AutoMPG, Boston Housing (BH) and Servo from UCI datasets [26], Bodyfat, Pollution, Concrete Compressive Strength (Concrete) from StatLib database (Available from http://lib.stat.cmu.edu/datasets/), and Machine CPU (MCPU) from the web page (http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html), which are widely used in evaluating various regression algorithms. The detailed descriptions of datasets are presented in Table 2, where train and test denote the number of training and testing samples, respectively. In experiments, each dataset is randomly split into training and testing samples. For each training dataset, we randomly choose 1/5 samples and add large noise on their targets to simulate outliers. Similar to the experiments on artificial datasets, the testing datasets are not added any noise on their targets. All the regression methods are repeated ten times with different partition of training and testing dataset.

Table 2 displays the testing results of LSSVR and the proposed LADSVR. We observe that the three criterions (RMSE, MAE, and SSE/SST) of LADSVR are obviously better than LSSVR on all datasets, which shows that the robust algorithm achieves better generalization performance and has good stability as well. Moreover, our LADSVR algorithm outperforms LSSVR. For instance, LADSVR obtains the smaller RMSE, MAE, and SSE/SST on the Bodyfat dataset; meanwhile, it keeps larger SSR/SST than LSSVR. The proposed algorithm derives the similar results on the MCPU, AutoMPG, and BH datasets.
To obtain the final regressor of LADSVR, the resultant model is implemented in the primal space by classical Newton algorithm iteratively. The number of iterations (Iter) and the running time (Time) including the training and testing time are listed in Table 2. Iter shows the average number of iterations of ten independent runs. Compared with LSSVR, LADSVR requires more running time. The main reason is that the running time of LADSVR is affected by the selection approach of the starting point , the value of , and the number of iterations. In the experiments, the starting point is derived by LSSVR on a small number of training samples. It can be observed that the average number of iterations does not exceed 10, which implies that LADSVR is suitable enough for medium and large scale problems. We notice that LADSVR does not burden the running time severely. A worse case is that the maximum ratio of their speeds is no more than 3 times on Pyrim dataset. These experimental results conclude that the proposed LADSVR is effective in dealing with robust regression problems.
6. Conclusion
In this paper, we propose LADSVR, a novel robust least squares support vector regression algorithm on dataset with outliers. Compared with the classical LSSVR which is based on squared loss function, LADSVR employs an absolute deviation loss function to reduce the influence of outliers. To solve the resultant model, we smooth the proposed loss function by a Huber loss function and develop a Newton algorithm. Experimental results on both artificial datasets and benchmark datasets confirm that LADSVR owns better robustness compared with LSSVR. However, LADSVR still loses sparseness as LSSVR. In the future, we plan to develop more efficient LADSVR to improve the sparseness and robustness.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The work is supported by the National Natural Science Foundation of China under Grant no. 11171346 and Chinese Universities Scientific Fund no. 2013YJ010.
References
 V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995.
 N. Cristianini and J. S. Taylor, An Introduction to Support Vector Machines and Other KernelBased Learning Methods, Cambridge University Press, Cambridge, UK, 2000.
 S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, “Improvements to Platt's SMO algorithm for SVM classifier design,” Neural Computation, vol. 13, no. 3, pp. 637–649, 2001. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 J. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods: Support Vector Learning, B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds., pp. 185–208, MIT Press, Cambridge, UK, 1999. View at: Google Scholar
 E. Osuna, R. Freund, and F. Girosi, “An improved training algorithm for support vector machines,” in Proceedings of the 7th IEEE Workshop on Neural Networks for Signal Processing (NNSP '97), J. Principe, L. Gile, N. Morgan, and E. Wilson, Eds., pp. 276–285, September 1997. View at: Google Scholar
 T. Joachims, “Making largescale SVM learning practical,” in Advances in Kernel MethodsSupport Vector Learning, B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds., pp. 169–184, MIT Press, Cambridge, Mass, USA, 1999. View at: Google Scholar
 C. Chang and C. Lin, “LIBSVM: a library for support vector machines,” 2001, http://www.csie.ntu.edu.tw/~cjlin/. View at: Google Scholar
 J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Processing Letters, vol. 9, no. 3, pp. 293–300, 1999. View at: Publisher Site  Google Scholar
 J. A. K. Suykens, J. De Brabanter, L. Lukas, and J. Vandewalle, “Weighted least squares support vector machines: robustness and sparce approximation,” Neurocomputing, vol. 48, pp. 85–105, 2002. View at: Publisher Site  Google Scholar
 W. Wen, Z. Hao, and X. Yang, “A heuristic weightsetting strategy and iteratively updating algorithm for weighted leastsquares support vector regression,” Neurocomputing, vol. 71, no. 1618, pp. 3096–3103, 2008. View at: Publisher Site  Google Scholar
 K. De Brabanter, K. Pelckmans, J. De Brabanter et al., “Robustness of kernel based regression: acomparison of iterative weighting schemes,” in Proceedings of the 19th International Conference on Artificial Neural Networks (ICANN '09), 2009. View at: Google Scholar
 J. Liu, J. Li, W. Xu, and Y. Shi, “A weighted $Lq$ adaptive least squares support vector machine classifiersRobust and sparse approximation,” Expert Systems with Applications, vol. 38, no. 3, pp. 2253–2259, 2011. View at: Publisher Site  Google Scholar
 X. Chen, J. Yang, J. Liang, and Q. Ye, “Recursive robust least squares support vector regression based on maximum correntropy criterion,” Neurocomputing, vol. 97, pp. 63–73, 2012. View at: Publisher Site  Google Scholar
 L. Xu, K. Crammer, and D. Schuurmans, “Robust support vector machine training via convex outlier ablation,” in Proceedings of the 21st National Conference on Artificial Intelligence (AAAI '06), pp. 536–542, July 2006. View at: Google Scholar
 P. J. Rousseeuw and K. van Driessen, “Computing LTS regression for large data sets,” Data Mining and Knowledge Discovery, vol. 12, no. 1, pp. 29–45, 2006. View at: Publisher Site  Google Scholar  MathSciNet
 W. Wen, Z. Hao, and X. Yang, “Robust least squares support vector machine based on recursive outlier elimination,” Soft Computing, vol. 14, no. 11, pp. 1241–1251, 2010. View at: Publisher Site  Google Scholar
 C. Chuang and Z. Lee, “Hybrid robust support vector machines for regression with outliers,” Applied Soft Computing Journal, vol. 11, no. 1, pp. 64–72, 2011. View at: Publisher Site  Google Scholar
 G. Bassett Jr. and R. Koenker, “Asymptotic theory of least absolute error regression,” Journal of the American Statistical Association, vol. 73, no. 363, pp. 618–622, 1978. View at: Publisher Site  Google Scholar  MathSciNet
 P. Bloomfield and W. L. Steiger, Least Absolute Deviation: Theory, Applications and Algorithms, Birkhauser, Boston, Mass, USA, 1983. View at: MathSciNet
 H. Wang, G. Li, and G. Jiang, “Robust regression shrinkage and consistent variable selection through the LADLasso,” Journal of Business & Economic Statistics, vol. 25, no. 3, pp. 347–355, 2007. View at: Publisher Site  Google Scholar  MathSciNet
 O. Chapelle, “Training a support vector machine in the primal,” Neural Computation, vol. 19, no. 5, pp. 1155–1178, 2007. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 L. Bo, L. Wang, and L. Jiao, “Recursive finite Newton algorithm for support vector regression in the primal,” Neural Computation, vol. 19, no. 4, pp. 1082–1096, 2007. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 P. J. Huber, Robust Statistics, Springer, Berlin, Germany, 2011.
 P. Zhong, Y. Xu, and Y. Zhao, “Training twin support vector regression via linear programming,” Neural Computing and Applications, vol. 21, no. 2, pp. 399–407, 2012. View at: Publisher Site  Google Scholar
 X. Peng, “TSVR: an efficient Twin Support Vector Machine for regression,” Neural Networks, vol. 23, no. 3, pp. 365–372, 2010. View at: Publisher Site  Google Scholar
 C. Blake and C. J. Merz, “UCI repository for machine learning databases,” 1998, http://www.ics.uci.edu/~mlearn/MLRepository.html. View at: Google Scholar
Copyright
Copyright © 2014 Kuaini Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.