Research Article  Open Access
CrossValidation, Bootstrap, and Support Vector Machines
Abstract
This paper considers the applications of resampling methods to support vector machines (SVMs). We take into account the leavingoneout crossvalidation (CV) when determining the optimum tuning parameters and bootstrapping the deviance in order to summarize the measure of goodnessoffit in SVMs. The leavingoneout CV is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples. We analyze the data from a mackerelegg survey and a liverdisease study.
1. Introduction
In recent years, support vector machines (SVMs) have been intensively studied and applied to practical problems in many fields of science and engineering [1–3]. SVMs have many merits that distinguish them from many other machine learning algorithms, including the nonexistence of local minima, the speed of calculation, and the use of only two tuning parameters. There are at least two reasons to use a leavingoneout crossvalidation (CV) [4]. First, the criterion based on the method is demonstrated to be favorable when determining the tuning parameters. Second, the method can estimate the bias of the excess error in prediction. No standard procedures exist by which to assess the overall goodnessoffit of the model based on SVM. By introducing the maximum likelihood principle, the deviance allows us to test the goodnessoffit of the model. Since no adequate distribution theory exists for the deviance, we provide bootstrapping on the null distribution of the deviance for the model having optimum tuning parameters for SVM with a specified significance level [5–8].
The remainder of this paper is organized as follows. In Section 2, using the leavingoneout CV, we focus on the determination of the tuning parameters and the evaluation of the overall goodnessoffit with the optimum tuning parameters based on bootstrapping. The leavingoneout CV is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples [9]. In Section 3, the oneagainstone method is used to estimate a vector of multiclass probabilities for each pair of classes and then to couple the estimates together [3, 10]. In Section 4, the methods are illustrated using mackerelegg survey and liverdisease data. We discuss the relative merits and limitations of the methods in Section 5.
2. Support Vector Machines and Resampling Methods
2.1. Support Vector Machines
Given training pairs , where is an input vector and , the SVM solves the following primal problem: where is a unit vector (i.e., ), denotes the transposition of the matrix, is the kernel function, is the tuning parameter denoting the tradeoff between the margin width and the training data error, and are slack variables. For an unknown input pattern , we have the decision function where are the Lagrange multipliers. We employ the Gaussian radial basis function as the kernel function [3, 11, 12] where is a fixed parameter, and
Binary classification is performed by using the decision function : the input is assigned to the positive class if , and to the negative class otherwise. Platt [13] proposed one method for producing probabilistic outputs from a decision function by using logistic link function where and represent the output of the SVM and the target value for the sample, respectively [14]. This is equivalent to fitting a logistic regression model to the estimated decision values. The unknown parameters in (5) can be estimated by minimizing the crossentropy where Putting from (6), (7), and (8), we obtain Lin et al. [15] observed that the problem of ln(0) never occurs for (9).
2.2. LeavingOneOut CrossValidation
2.2.1. CV Score
We must determine the optimum values of tuning parameters and in (1) and (3), respectively. This can be done by means of the leavingoneout CV; a byproduct is that the excess error rate of incorrectly predicting the outcome is estimated.
Let the initial sample with be independently distributed according to an unknown distribution. The leavingoneout CV algorithm is then given as follows (see, e.g., [5]).
Step 1. From the initial sample , are deleted in order to form the training sample .
Step 2. Using each training sample, fit an SVM and predict the decision value for .
Step 3. From the decision value , we can predict for the deleted th sample using (7) and calculate the predicted loglikelihood .
Step 4. Steps 1 to 3 are repeated for .
Step 5. The CV score (i.e., averaged predicted loglikelihood) is given by
Step 6. Carry out a grid search over tuning parameters and , taking the tuning parameters with minimum CV as optimal. It should be noted that the CV score is asymptotically equivalent to AIC (akaike information criterion) and EIC (extended information criterion) [16–18].
2.2.2. Excess Error Estimation
Let the actual error rate be the probability of incorrectly predicting the outcome of a new observation, given a discriminant rule on initial sample ; this is useful for performance assessment of a discriminant rule. Given a discriminant rule based on the initial sample, the error rates of discrimination are also of interest. As the same observations are used for forming and assessing the discriminant rule, this proportion of errors, called the apparent error rate, underestimates the actual error rate. The estimate of the error rate is seriously biased when the initial sample is small. This bias for a given discriminant rule is called the excess error of that rule. To correct this bias and estimate the error rates, we provide the bias correction of the apparent error rate associated with a discriminant rule, which is constructed by fitting to the training sample in the SVM.
By applying a discriminant rule to the initial sample , we can form the realized discriminant rule . Let be the discrimination rule based on . Given a subject with , we predict the response by . The algorithm for leavingoneout CV that estimates the excess error rate when fitting a SVM is given as follows [9].
Step 1. Generate the training sample , and construct the realized discrimination rule based on . Then, define Then leavingoneout CV error rate is given by
Step 2. Calculate the apparent error
Step 3. The crossvalidation estimator of expected excess error is
2.3. Bootstrapping
Introducing the maximum likelihood principle into the SVM, the deviance allows us to test the goodnessoffit of the model where denotes the maximized log likelihood under some current SVM, and the log likelihood for the saturated model is zero. The deviance given by (15) is, however, not even approximately a distribution for the case in which ungrouped binary responses are available [19, 20]. The number of degrees of freedom (d.f.) required for the test for significance using the assumed distribution for the deviance is a contentious issue. No adequate distribution theory exists for the deviance. The reason for this is somewhat technical (for details, see Section 3.8.3 in [19]). Consequently, the deviance on fitting a model to binary response data cannot be used as a summary measure of the goodnessoffit test of the model.
Based on the above discussion, the percentile of deviance for goodnessoffit test can in principle be calculated. However, the calculations are usually too complicated to perform analytically, so Monte Carlo method can be employed [6, 7].
Step 1. Generate bootstrap samples from the original sample . Let denote the th bootstrap sample.
Step 2. For the bootstrap sample , compute the deviance of (15), denoted by .
Steps 1 and 2 are repeated independently times, and the computed values are arranged in ascending order.
Step 3. Take the value of the th order statistic of the replications as an estimate of the quantile of order .
Step 4. The estimate of the th percentile of is used to test the goodnessoffit of a model having a specified significance level . The value of the deviance of (15) being greater than the estimate of the percentile indicates that the model fits poorly. Typically, the number of replication is in the range of .
2.4. Influential Analysis
Assessing the discrepancies between and at the th observation in (15), the influence measure provides guides and suggestions that may be carefully applied to a SVM [19]. The effect of the th observation on the deviance can be measured by computing where is the deviance with th observation deleted. The distribution of will be approximated by with d.f. = 1 when the fitted model is correct. An index plot is a reasonable rule of thumb for graphically presenting the information contained in the values of . The key idea behind this plot is not to focus on a global measure of goodnessoffit but rather on local contributions to the fit. An influential observation is one that greatly changes the results of the statistical inference when deleted from the initial sample.
Platt [13] proposed the threefold CV for estimating the decision values in (9). However, the value of may be negative because three SVMs are trained on splitted three parts of training pairs . Therefore, in the present paper, we train a single SVM on the training pairs in order to evaluate the decision values and estimate probabilistic outputs according to [15].
3. Multiclass SVM
We consider the discriminant problem with classes and training pairs , where is an input vector and [10, 21, 22]. Let denote the response probabilities, with , for multiclass classification with The loglikelihood is given by
For multiclass classification, the oneagainstone method (also called pairwise classification) is used to produce a vector of multiclass probabilities for each pair of classes, and then to couple the estimates together [10]. The earliest used implementation for multiclass SVM is probably the oneagainstone method of [21]. This method constructs classifiers based on the training on data from the th and th classes of training set.
The SVM solves the primal formulation [3, 10, 23] Given classes of data for any , the goal is to estimate We first estimate pairwise class probabilities by using where and are estimated by minimizing the cross entropy using training data and the corresponding decision values .
Hastie and Tibshirani [21] proposed minimizing the KullbackLeibler (KL) distance between and , where and is the number of training data in the th and th classes.
Wu et al. [10] propose the second approach to obtain from all these 's by optimizing
Thus, we can adopt the leavingoneout CV similar to the method in Section 2.2.
Step 1. From the initial sample , are deleted in order to form the training sample .
Step 2. Using each training sample, fit a SVM in order to estimate by (22), and predict for the deleted th sample .
Step 3. Steps 1 and 2 are repeated for .
Step 4. The CV score is given by
Step 5. Tuning parameters with minimum CV can be determined as optimal by carrying out a grid search over and .
4. Examples
4.1. MackerelEgg Survey Data
We consider data consisting of 634 observations from a 1992 mackerel egg survey [24]. There are the following predictors of egg abundance: the location (longitude and latitude) at which samples were taken, depth of the ocean, distance from the 200 m seabed contour, and, finally, water temperature at a depth of 20 m. We first fit a SVM. In the same manner as described in [11], we determine tuning parameters and . The optimum values of the tuning parameters are . The bootstrap estimator of the percentile for the deviance is . A comparison with the deviance from (15) suggests that the SVM fits the data fairly well. For reference purposes, the histogram of the bootstrapped for is provided in Figure 1.
We can estimate the apparent errors rate of incorrectly predicting outcome and leavingoneout CV error rates for several models as shown in Table 1. The smoothing parameters in generalized additive models (GAM) [24] and the number of hidden units in a neural network in Table 1 are determined using the leavingoneout CV. From Table 1, the leavingoneout CV error rate for the SVM is the smallest among all models, but the apparent error rate is the smallest for the neural network. The CV scores are 477.04, 509.44, and 541.61 for the SVM, logistic discriminant, and neural network, respectively. This implies that the SVM is the best among these three models from the point of view of CV. Figure 2 shows the index plot of , which indicates that no. 399 and no. 601 are influential observations at the level of significance.

4.2. Liver Disease Data
We apply the proposed method to laboratory data collected from 218 patients with liver disorders [25–27]. Four liver diseases were observed: acute viral hepatitis (57 patients), persistent chronic hepatitis (44 patients), aggressive chronic hepatitis (40 patients), and postnecrotic cirrhosis (77 patients). The covariates consist of four liver enzymes: aspartate aminotransferase (AST), alanine aminotransferase (ALT), glutamate dehydrogenase (GIDH), and ornithine carbamyltransferase (OCT). For each pair, the CV performance is measured by training and testing the other of the data. Then, we train the whole training set by using the pair , which achieves the minimum CV score (=187.93) and predicts the test set. The apparent and leavingoneout CV error rates for traing and test samples for several models as shown in Table 2. As shown, the apparent error rate for SVM of training sample and the error rate for SVM of test sample are the smallest among all models, but the leavingoneout CV error rate for SVM of training sample is larger than that of the multinomial logistic discriminant model.

5. Concluding Remarks
We considered the application of resampling methods to SVMs. Statistical inference based on the likelihood approach for SVMs was discussed, and the leavingoneout CV was suggested for determining the tuning of parameters and for estimating the bias of the excess error in prediction. Bootstrapping is used to focus on the evaluation of the overall goodnessoffit with the optimum tuning parameters. Data from a mackerelegg survey and a liverdisease study are used to evaluate the resampling methods.
There is one broad limitation to our approach: the SVM assumed the independence of the predictor variables. More generally, it may be preferable to visualize interactions between predictor variables. The smoothing spline ANOVA models [28] can provide an excellent means for handling data of mutually exclusive groups and a set of predictor variables. We expect that flexible methods for a discriminant model using machine learning theory [1], such as penalized smoothing splines, will be very useful in these realworld contexts.
References
 C. M. Bishop, Pattern Regression and Machine Learning, Springer, New York, NY, USA, 2006.
 N. Cristianini and J. ShaweTylor, An Introduction to Support Vector Machines and Other KernelBased Learning Method, Cambridge University Press, Cambridge, UK, 2000.
 C.W. Hsu, C.C. Chung, and C.J. Lin, “A practical guide to support vector classification,” 2009, http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf. View at: Google Scholar
 P. Zhang, “Model selection via multifold cross validation,” Annals of Statistics, vol. 21, pp. 299–313, 1993. View at: Google Scholar
 B. Efron and R .J. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall, New York, NY, USA, 1993.
 M. Tsujitani and T. Koshimizu, “Neural discriminant analysis,” IEEE Transactions on Neural Networks, vol. 11, no. 6, pp. 1394–1401, 2000. View at: Google Scholar
 M. Tsujitani and M. Aoki, “Neural regression model, resampling and diagnosis,” Systems and Computers in Japan, vol. 37, no. 6, pp. 13–20, 2006. View at: Publisher Site  Google Scholar
 M. Tsujitani and M. Sakon, “Analysis of survival data having timedependent covariates,” IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 389–394, 2009. View at: Publisher Site  Google Scholar
 G. Gong, “Crossvalidation, the jackknife, and the bootstrap: excess error estimation in forward logistic regression,” Journal of the American Statistical Association, vol. 81, pp. 108–113, 1986. View at: Google Scholar
 T.F. Wu, C.J. Lin, and R. C. Weng, “Probability estimates for multiclass classification by pairwise coupling,” Journal of Machine Learning Research, vol. 5, pp. 975–1005, 2004. View at: Google Scholar
 C. W. Hs and C. J. Lin, “A comparison of methods for multiclass support vector machines,” IEEE Transactions on Neural Networks, vol. 13, pp. 415–425, 2002. View at: Google Scholar
 C.C. Chang and C.J. Lin, “LIBSVM: a library for support vector machines,” 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm. View at: Google Scholar
 J. Platt, “Probabilistic outputs for support vector machines and comparison to regularized likelihood methods,” in Advances in Large Margin Classifiers, A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, Eds., MIT Press, Cambridge, Mass, USA, 2000. View at: Google Scholar
 A. Karatzoglou, D. Meyer, and K. Hornik, “Support vector machines in R,” Journal of Statistical Software, vol. 15, no. 9, pp. 1–28, 2006. View at: Google Scholar
 H. T. Lin, C. J. Lin, and R. C. Weng, “A note on Platt's probabilistic outputs for support vector machines,” Machine Learning, vol. 68, no. 3, pp. 267–276, 2007. View at: Publisher Site  Google Scholar
 H. Akaike, “Information theory and an extension of the maximum likelihood principle,” in Proceedings of the 2nd International Symposium on Information Theory, B. N. Petrov and F. Csaki, Eds., pp. 267–281, Akademia Kaido, Budapest, Hungary, 1973. View at: Google Scholar
 M. Ishiguro, Y. Sakamoto, and G. Kitagawa, “Bootstrapping log likelihood and EIC, an extension of AIC,” Annals of the Institute of Statistical Mathematics, vol. 49, no. 3, pp. 411–434, 1996. View at: Google Scholar
 R. Shibata, “Bootstrap estimate of KullbackLeibler information for model selection,” Statistica Sinica, vol. 7, no. 2, pp. 375–394, 1997. View at: Google Scholar
 D. Collett, Modeling Binary Data, Chapman & Hall, New York, NY, USA, 2nd edition, 2003.
 J. M. Landwehr, D. Pregibon, and A. C. Shoemaker, “Graphical methods for assessing logistic regression models,” Journal of the American Statistical Association, vol. 79, pp. 61–71, 1984. View at: Google Scholar
 T. J. Hastie and R. J. Tibshirani, “Classification by pairwise coupling,” Annals of Statistics, vol. 26, no. 2, pp. 451–471, 1998. View at: Google Scholar
 E. J. Bredensteiner and K. P. Bennett, “Multicategory classification by support vector machines,” Computational Optimization and Applications, vol. 12, pp. 53–79, 1999. View at: Google Scholar
 C. W. Hsu and C. J. Lin, “A formal analysis of stopping criteria of decomposition methods for support vector machines,” IEEE Transactions on Neural Networks, vol. 13, no. 5, pp. 1045–1052, 2002. View at: Publisher Site  Google Scholar
 S. N. Wood, Generalized Additive Models an Introduction with R, Chapman & Hall, New York, NY, USA, 2006.
 A. Albert, Multivariate Interpretation of Clinical Laboratory Data, Marcel Dekker, New York, NY, USA, 1992..
 A. Albert and E. Lesaffre, “Multiple group logistic discrimination,” Computers and Mathematics with Applications, vol. 12, no. 2, pp. 209–224, 1986. View at: Google Scholar
 E. Lesaffre and A. Albert, “Multiplegroup logistic regression diagnosis,” Computers and Mathematics with Applications, vol. 38, pp. 425–440, 1989. View at: Google Scholar
 Y. Wang, G. Wahba, C. Gu, R. Klein, and B. Klein, “Using smoothing spline anova to examine the relation of risk factors to the incidence and progression of diabetic retinopathy,” Statistics in Medicine, vol. 16, no. 12, pp. 1357–1376, 1997. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2011 Masaaki Tsujitani and Yusuke Tanaka. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.