Research Article | Open Access
Cross-Validation, Bootstrap, and Support Vector Machines
This paper considers the applications of resampling methods to support vector machines (SVMs). We take into account the leaving-one-out cross-validation (CV) when determining the optimum tuning parameters and bootstrapping the deviance in order to summarize the measure of goodness-of-fit in SVMs. The leaving-one-out CV is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples. We analyze the data from a mackerel-egg survey and a liver-disease study.
In recent years, support vector machines (SVMs) have been intensively studied and applied to practical problems in many fields of science and engineering [1–3]. SVMs have many merits that distinguish them from many other machine learning algorithms, including the nonexistence of local minima, the speed of calculation, and the use of only two tuning parameters. There are at least two reasons to use a leaving-one-out cross-validation (CV) . First, the criterion based on the method is demonstrated to be favorable when determining the tuning parameters. Second, the method can estimate the bias of the excess error in prediction. No standard procedures exist by which to assess the overall goodness-of-fit of the model based on SVM. By introducing the maximum likelihood principle, the deviance allows us to test the goodness-of-fit of the model. Since no adequate distribution theory exists for the deviance, we provide bootstrapping on the null distribution of the deviance for the model having optimum tuning parameters for SVM with a specified significance level [5–8].
The remainder of this paper is organized as follows. In Section 2, using the leaving-one-out CV, we focus on the determination of the tuning parameters and the evaluation of the overall goodness-of-fit with the optimum tuning parameters based on bootstrapping. The leaving-one-out CV is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples . In Section 3, the one-against-one method is used to estimate a vector of multiclass probabilities for each pair of classes and then to couple the estimates together [3, 10]. In Section 4, the methods are illustrated using mackerel-egg survey and liver-disease data. We discuss the relative merits and limitations of the methods in Section 5.
2. Support Vector Machines and Resampling Methods
2.1. Support Vector Machines
Given training pairs , where is an input vector and , the SVM solves the following primal problem: where is a unit vector (i.e., ), denotes the transposition of the matrix, is the kernel function, is the tuning parameter denoting the tradeoff between the margin width and the training data error, and are slack variables. For an unknown input pattern , we have the decision function where are the Lagrange multipliers. We employ the Gaussian radial basis function as the kernel function [3, 11, 12] where is a fixed parameter, and
Binary classification is performed by using the decision function : the input is assigned to the positive class if , and to the negative class otherwise. Platt  proposed one method for producing probabilistic outputs from a decision function by using logistic link function where and represent the output of the SVM and the target value for the sample, respectively . This is equivalent to fitting a logistic regression model to the estimated decision values. The unknown parameters in (5) can be estimated by minimizing the cross-entropy where Putting from (6), (7), and (8), we obtain Lin et al.  observed that the problem of ln(0) never occurs for (9).
2.2. Leaving-One-Out Cross-Validation
2.2.1. CV Score
We must determine the optimum values of tuning parameters and in (1) and (3), respectively. This can be done by means of the leaving-one-out CV; a by-product is that the excess error rate of incorrectly predicting the outcome is estimated.
Let the initial sample with be independently distributed according to an unknown distribution. The leaving-one-out CV algorithm is then given as follows (see, e.g., ).
Step 1. From the initial sample , are deleted in order to form the training sample .
Step 2. Using each training sample, fit an SVM and predict the decision value for .
Step 3. From the decision value , we can predict for the deleted th sample using (7) and calculate the predicted log-likelihood .
Step 5. The CV score (i.e., averaged predicted log-likelihood) is given by
Step 6. Carry out a grid search over tuning parameters and , taking the tuning parameters with minimum CV as optimal. It should be noted that the CV score is asymptotically equivalent to AIC (akaike information criterion) and EIC (extended information criterion) [16–18].
2.2.2. Excess Error Estimation
Let the actual error rate be the probability of incorrectly predicting the outcome of a new observation, given a discriminant rule on initial sample ; this is useful for performance assessment of a discriminant rule. Given a discriminant rule based on the initial sample, the error rates of discrimination are also of interest. As the same observations are used for forming and assessing the discriminant rule, this proportion of errors, called the apparent error rate, underestimates the actual error rate. The estimate of the error rate is seriously biased when the initial sample is small. This bias for a given discriminant rule is called the excess error of that rule. To correct this bias and estimate the error rates, we provide the bias correction of the apparent error rate associated with a discriminant rule, which is constructed by fitting to the training sample in the SVM.
By applying a discriminant rule to the initial sample , we can form the realized discriminant rule . Let be the discrimination rule based on . Given a subject with , we predict the response by . The algorithm for leaving-one-out CV that estimates the excess error rate when fitting a SVM is given as follows .
Step 1. Generate the training sample , and construct the realized discrimination rule based on . Then, define Then leaving-one-out CV error rate is given by
Step 2. Calculate the apparent error
Step 3. The cross-validation estimator of expected excess error is
Introducing the maximum likelihood principle into the SVM, the deviance allows us to test the goodness-of-fit of the model where denotes the maximized log likelihood under some current SVM, and the log likelihood for the saturated model is zero. The deviance given by (15) is, however, not even approximately a distribution for the case in which ungrouped binary responses are available [19, 20]. The number of degrees of freedom (d.f.) required for the test for significance using the assumed distribution for the deviance is a contentious issue. No adequate distribution theory exists for the deviance. The reason for this is somewhat technical (for details, see Section 3.8.3 in ). Consequently, the deviance on fitting a model to binary response data cannot be used as a summary measure of the goodness-of-fit test of the model.
Based on the above discussion, the percentile of deviance for goodness-of-fit test can in principle be calculated. However, the calculations are usually too complicated to perform analytically, so Monte Carlo method can be employed [6, 7].
Step 1. Generate bootstrap samples from the original sample . Let denote the th bootstrap sample.
Step 3. Take the value of the th order statistic of the replications as an estimate of the quantile of order .
Step 4. The estimate of the th percentile of is used to test the goodness-of-fit of a model having a specified significance level . The value of the deviance of (15) being greater than the estimate of the percentile indicates that the model fits poorly. Typically, the number of replication is in the range of .
2.4. Influential Analysis
Assessing the discrepancies between and at the th observation in (15), the influence measure provides guides and suggestions that may be carefully applied to a SVM . The effect of the th observation on the deviance can be measured by computing where is the deviance with th observation deleted. The distribution of will be approximated by with d.f. = 1 when the fitted model is correct. An index plot is a reasonable rule of thumb for graphically presenting the information contained in the values of . The key idea behind this plot is not to focus on a global measure of goodness-of-fit but rather on local contributions to the fit. An influential observation is one that greatly changes the results of the statistical inference when deleted from the initial sample.
Platt  proposed the threefold CV for estimating the decision values in (9). However, the value of may be negative because three SVMs are trained on splitted three parts of training pairs . Therefore, in the present paper, we train a single SVM on the training pairs in order to evaluate the decision values and estimate probabilistic outputs according to .
3. Multiclass SVM
We consider the discriminant problem with classes and training pairs , where is an input vector and [10, 21, 22]. Let denote the response probabilities, with , for multiclass classification with The log-likelihood is given by
For multi-class classification, the one-against-one method (also called pairwise classification) is used to produce a vector of multi-class probabilities for each pair of classes, and then to couple the estimates together . The earliest used implementation for multi-class SVM is probably the one-against-one method of . This method constructs classifiers based on the training on data from the th and th classes of training set.
The SVM solves the primal formulation [3, 10, 23] Given classes of data for any , the goal is to estimate We first estimate pairwise class probabilities by using where and are estimated by minimizing the cross entropy using training data and the corresponding decision values .
Hastie and Tibshirani  proposed minimizing the Kullback-Leibler (KL) distance between and , where and is the number of training data in the th and th classes.
Wu et al.  propose the second approach to obtain from all these 's by optimizing
Thus, we can adopt the leaving-one-out CV similar to the method in Section 2.2.
Step 1. From the initial sample , are deleted in order to form the training sample .
Step 2. Using each training sample, fit a SVM in order to estimate by (22), and predict for the deleted th sample .
Step 4. The CV score is given by
Step 5. Tuning parameters with minimum CV can be determined as optimal by carrying out a grid search over and .
4.1. Mackerel-Egg Survey Data
We consider data consisting of 634 observations from a 1992 mackerel egg survey . There are the following predictors of egg abundance: the location (longitude and latitude) at which samples were taken, depth of the ocean, distance from the 200 m seabed contour, and, finally, water temperature at a depth of 20 m. We first fit a SVM. In the same manner as described in , we determine tuning parameters and . The optimum values of the tuning parameters are . The bootstrap estimator of the percentile for the deviance is . A comparison with the deviance from (15) suggests that the SVM fits the data fairly well. For reference purposes, the histogram of the bootstrapped for is provided in Figure 1.
We can estimate the apparent errors rate of incorrectly predicting outcome and leaving-one-out CV error rates for several models as shown in Table 1. The smoothing parameters in generalized additive models (GAM)  and the number of hidden units in a neural network in Table 1 are determined using the leaving-one-out CV. From Table 1, the leaving-one-out CV error rate for the SVM is the smallest among all models, but the apparent error rate is the smallest for the neural network. The CV scores are 477.04, 509.44, and 541.61 for the SVM, logistic discriminant, and neural network, respectively. This implies that the SVM is the best among these three models from the point of view of CV. Figure 2 shows the index plot of , which indicates that no. 399 and no. 601 are influential observations at the level of significance.
4.2. Liver Disease Data
We apply the proposed method to laboratory data collected from 218 patients with liver disorders [25–27]. Four liver diseases were observed: acute viral hepatitis (57 patients), persistent chronic hepatitis (44 patients), aggressive chronic hepatitis (40 patients), and postnecrotic cirrhosis (77 patients). The covariates consist of four liver enzymes: aspartate aminotransferase (AST), alanine aminotransferase (ALT), glutamate dehydrogenase (GIDH), and ornithine carbamyltransferase (OCT). For each pair, the CV performance is measured by training and testing the other of the data. Then, we train the whole training set by using the pair , which achieves the minimum CV score (=187.93) and predicts the test set. The apparent and leaving-one-out CV error rates for traing and test samples for several models as shown in Table 2. As shown, the apparent error rate for SVM of training sample and the error rate for SVM of test sample are the smallest among all models, but the leaving-one-out CV error rate for SVM of training sample is larger than that of the multinomial logistic discriminant model.
5. Concluding Remarks
We considered the application of resampling methods to SVMs. Statistical inference based on the likelihood approach for SVMs was discussed, and the leaving-one-out CV was suggested for determining the tuning of parameters and for estimating the bias of the excess error in prediction. Bootstrapping is used to focus on the evaluation of the overall goodness-of-fit with the optimum tuning parameters. Data from a mackerel-egg survey and a liver-disease study are used to evaluate the resampling methods.
There is one broad limitation to our approach: the SVM assumed the independence of the predictor variables. More generally, it may be preferable to visualize interactions between predictor variables. The smoothing spline ANOVA models  can provide an excellent means for handling data of mutually exclusive groups and a set of predictor variables. We expect that flexible methods for a discriminant model using machine learning theory , such as penalized smoothing splines, will be very useful in these real-world contexts.
- C. M. Bishop, Pattern Regression and Machine Learning, Springer, New York, NY, USA, 2006.
- N. Cristianini and J. Shawe-Tylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Method, Cambridge University Press, Cambridge, UK, 2000.
- C.-W. Hsu, C.-C. Chung, and C.-J. Lin, “A practical guide to support vector classification,” 2009, http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.
- P. Zhang, “Model selection via multifold cross validation,” Annals of Statistics, vol. 21, pp. 299–313, 1993.
- B. Efron and R .J. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall, New York, NY, USA, 1993.
- M. Tsujitani and T. Koshimizu, “Neural discriminant analysis,” IEEE Transactions on Neural Networks, vol. 11, no. 6, pp. 1394–1401, 2000.
- M. Tsujitani and M. Aoki, “Neural regression model, resampling and diagnosis,” Systems and Computers in Japan, vol. 37, no. 6, pp. 13–20, 2006.
- M. Tsujitani and M. Sakon, “Analysis of survival data having time-dependent covariates,” IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 389–394, 2009.
- G. Gong, “Cross-validation, the jackknife, and the bootstrap: excess error estimation in forward logistic regression,” Journal of the American Statistical Association, vol. 81, pp. 108–113, 1986.
- T.-F. Wu, C.-J. Lin, and R. C. Weng, “Probability estimates for multi-class classification by pairwise coupling,” Journal of Machine Learning Research, vol. 5, pp. 975–1005, 2004.
- C. W. Hs and C. J. Lin, “A comparison of methods for multi-class support vector machines,” IEEE Transactions on Neural Networks, vol. 13, pp. 415–425, 2002.
- C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- J. Platt, “Probabilistic outputs for support vector machines and comparison to regularized likelihood methods,” in Advances in Large Margin Classifiers, A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, Eds., MIT Press, Cambridge, Mass, USA, 2000.
- A. Karatzoglou, D. Meyer, and K. Hornik, “Support vector machines in R,” Journal of Statistical Software, vol. 15, no. 9, pp. 1–28, 2006.
- H. T. Lin, C. J. Lin, and R. C. Weng, “A note on Platt's probabilistic outputs for support vector machines,” Machine Learning, vol. 68, no. 3, pp. 267–276, 2007.
- H. Akaike, “Information theory and an extension of the maximum likelihood principle,” in Proceedings of the 2nd International Symposium on Information Theory, B. N. Petrov and F. Csaki, Eds., pp. 267–281, Akademia Kaido, Budapest, Hungary, 1973.
- M. Ishiguro, Y. Sakamoto, and G. Kitagawa, “Bootstrapping log likelihood and EIC, an extension of AIC,” Annals of the Institute of Statistical Mathematics, vol. 49, no. 3, pp. 411–434, 1996.
- R. Shibata, “Bootstrap estimate of Kullback-Leibler information for model selection,” Statistica Sinica, vol. 7, no. 2, pp. 375–394, 1997.
- D. Collett, Modeling Binary Data, Chapman & Hall, New York, NY, USA, 2nd edition, 2003.
- J. M. Landwehr, D. Pregibon, and A. C. Shoemaker, “Graphical methods for assessing logistic regression models,” Journal of the American Statistical Association, vol. 79, pp. 61–71, 1984.
- T. J. Hastie and R. J. Tibshirani, “Classification by pairwise coupling,” Annals of Statistics, vol. 26, no. 2, pp. 451–471, 1998.
- E. J. Bredensteiner and K. P. Bennett, “Multicategory classification by support vector machines,” Computational Optimization and Applications, vol. 12, pp. 53–79, 1999.
- C. W. Hsu and C. J. Lin, “A formal analysis of stopping criteria of decomposition methods for support vector machines,” IEEE Transactions on Neural Networks, vol. 13, no. 5, pp. 1045–1052, 2002.
- S. N. Wood, Generalized Additive Models an Introduction with R, Chapman & Hall, New York, NY, USA, 2006.
- A. Albert, Multivariate Interpretation of Clinical Laboratory Data, Marcel Dekker, New York, NY, USA, 1992..
- A. Albert and E. Lesaffre, “Multiple group logistic discrimination,” Computers and Mathematics with Applications, vol. 12, no. 2, pp. 209–224, 1986.
- E. Lesaffre and A. Albert, “Multiple-group logistic regression diagnosis,” Computers and Mathematics with Applications, vol. 38, pp. 425–440, 1989.
- Y. Wang, G. Wahba, C. Gu, R. Klein, and B. Klein, “Using smoothing spline anova to examine the relation of risk factors to the incidence and progression of diabetic retinopathy,” Statistics in Medicine, vol. 16, no. 12, pp. 1357–1376, 1997.
Copyright © 2011 Masaaki Tsujitani and Yusuke Tanaka. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.