Abstract

This paper considers the applications of resampling methods to support vector machines (SVMs). We take into account the leaving-one-out cross-validation (CV) when determining the optimum tuning parameters and bootstrapping the deviance in order to summarize the measure of goodness-of-fit in SVMs. The leaving-one-out CV is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples. We analyze the data from a mackerel-egg survey and a liver-disease study.

1. Introduction

In recent years, support vector machines (SVMs) have been intensively studied and applied to practical problems in many fields of science and engineering [13]. SVMs have many merits that distinguish them from many other machine learning algorithms, including the nonexistence of local minima, the speed of calculation, and the use of only two tuning parameters. There are at least two reasons to use a leaving-one-out cross-validation (CV) [4]. First, the criterion based on the method is demonstrated to be favorable when determining the tuning parameters. Second, the method can estimate the bias of the excess error in prediction. No standard procedures exist by which to assess the overall goodness-of-fit of the model based on SVM. By introducing the maximum likelihood principle, the deviance allows us to test the goodness-of-fit of the model. Since no adequate distribution theory exists for the deviance, we provide bootstrapping on the null distribution of the deviance for the model having optimum tuning parameters for SVM with a specified significance level [58].

The remainder of this paper is organized as follows. In Section 2, using the leaving-one-out CV, we focus on the determination of the tuning parameters and the evaluation of the overall goodness-of-fit with the optimum tuning parameters based on bootstrapping. The leaving-one-out CV is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples [9]. In Section 3, the one-against-one method is used to estimate a vector of multiclass probabilities for each pair of classes and then to couple the estimates together [3, 10]. In Section 4, the methods are illustrated using mackerel-egg survey and liver-disease data. We discuss the relative merits and limitations of the methods in Section 5.

2. Support Vector Machines and Resampling Methods

2.1. Support Vector Machines

Given training pairs , where is an input vector and , the SVM solves the following primal problem: where is a unit vector (i.e., ), denotes the transposition of the matrix, is the kernel function, is the tuning parameter denoting the tradeoff between the margin width and the training data error, and are slack variables. For an unknown input pattern , we have the decision function where are the Lagrange multipliers. We employ the Gaussian radial basis function as the kernel function [3, 11, 12] where is a fixed parameter, and

Binary classification is performed by using the decision function : the input is assigned to the positive class if , and to the negative class otherwise. Platt [13] proposed one method for producing probabilistic outputs from a decision function by using logistic link function where and represent the output of the SVM and the target value for the sample, respectively [14]. This is equivalent to fitting a logistic regression model to the estimated decision values. The unknown parameters in (5) can be estimated by minimizing the cross-entropy where Putting from (6), (7), and (8), we obtain Lin et al. [15] observed that the problem of  ln(0) never occurs for (9).

2.2. Leaving-One-Out Cross-Validation
2.2.1. CV Score

We must determine the optimum values of tuning parameters and in (1) and (3), respectively. This can be done by means of the leaving-one-out CV; a by-product is that the excess error rate of incorrectly predicting the outcome is estimated.

Let the initial sample with be independently distributed according to an unknown distribution. The leaving-one-out CV algorithm is then given as follows (see, e.g., [5]).

Step 1. From the initial sample , are deleted in order to form the training sample .

Step 2. Using each training sample, fit an SVM and predict the decision value for .

Step 3. From the decision value , we can predict for the deleted th sample using (7) and calculate the predicted log-likelihood .

Step 4. Steps 1 to 3 are repeated for .

Step 5. The CV score (i.e., averaged predicted log-likelihood) is given by

Step 6. Carry out a grid search over tuning parameters and , taking the tuning parameters with minimum CV as optimal. It should be noted that the CV score is asymptotically equivalent to AIC (akaike information criterion) and EIC (extended information criterion) [1618].

2.2.2. Excess Error Estimation

Let the actual error rate be the probability of incorrectly predicting the outcome of a new observation, given a discriminant rule on initial sample ; this is useful for performance assessment of a discriminant rule. Given a discriminant rule based on the initial sample, the error rates of discrimination are also of interest. As the same observations are used for forming and assessing the discriminant rule, this proportion of errors, called the apparent error rate, underestimates the actual error rate. The estimate of the error rate is seriously biased when the initial sample is small. This bias for a given discriminant rule is called the excess error of that rule. To correct this bias and estimate the error rates, we provide the bias correction of the apparent error rate associated with a discriminant rule, which is constructed by fitting to the training sample in the SVM.

By applying a discriminant rule to the initial sample , we can form the realized discriminant rule . Let be the discrimination rule based on . Given a subject with , we predict the response by . The algorithm for leaving-one-out CV that estimates the excess error rate when fitting a SVM is given as follows [9].

Step 1. Generate the training sample , and construct the realized discrimination rule based on . Then, define Then leaving-one-out CV error rate is given by

Step 2. Calculate the apparent error

Step 3. The cross-validation estimator of expected excess error is

2.3. Bootstrapping

Introducing the maximum likelihood principle into the SVM, the deviance allows us to test the goodness-of-fit of the model where denotes the maximized log likelihood under some current SVM, and the log likelihood for the saturated model is zero. The deviance given by (15) is, however, not even approximately a distribution for the case in which ungrouped binary responses are available [19, 20]. The number of degrees of freedom (d.f.) required for the test for significance using the assumed distribution for the deviance is a contentious issue. No adequate distribution theory exists for the deviance. The reason for this is somewhat technical (for details, see Section  3.8.3 in [19]). Consequently, the deviance on fitting a model to binary response data cannot be used as a summary measure of the goodness-of-fit test of the model.

Based on the above discussion, the percentile of deviance for goodness-of-fit test can in principle be calculated. However, the calculations are usually too complicated to perform analytically, so Monte Carlo method can be employed [6, 7].

Step 1. Generate bootstrap samples from the original sample . Let denote the th bootstrap sample.

Step 2. For the bootstrap sample , compute the deviance of (15), denoted by .
Steps 1 and 2 are repeated independently times, and the computed values are arranged in ascending order.

Step 3. Take the value of the th order statistic of the replications as an estimate of the quantile of order .

Step 4. The estimate of the th percentile of is used to test the goodness-of-fit of a model having a specified significance level . The value of the deviance of (15) being greater than the estimate of the percentile indicates that the model fits poorly. Typically, the number of replication is in the range of .

2.4. Influential Analysis

Assessing the discrepancies between and at the th observation in (15), the influence measure provides guides and suggestions that may be carefully applied to a SVM [19]. The effect of the th observation on the deviance can be measured by computing where is the deviance with th observation deleted. The distribution of will be approximated by with d.f. = 1 when the fitted model is correct. An index plot is a reasonable rule of thumb for graphically presenting the information contained in the values of . The key idea behind this plot is not to focus on a global measure of goodness-of-fit but rather on local contributions to the fit. An influential observation is one that greatly changes the results of the statistical inference when deleted from the initial sample.

Platt [13] proposed the threefold CV for estimating the decision values in (9). However, the value of may be negative because three SVMs are trained on splitted three parts of training pairs . Therefore, in the present paper, we train a single SVM on the training pairs in order to evaluate the decision values and estimate probabilistic outputs according to [15].

3. Multiclass SVM

We consider the discriminant problem with classes and training pairs , where is an input vector and [10, 21, 22]. Let denote the response probabilities, with , for multiclass classification with The log-likelihood is given by

For multi-class classification, the one-against-one method (also called pairwise classification) is used to produce a vector of multi-class probabilities for each pair of classes, and then to couple the estimates together [10]. The earliest used implementation for multi-class SVM is probably the one-against-one method of [21]. This method constructs classifiers based on the training on data from the th and th classes of training set.

The SVM solves the primal formulation [3, 10, 23] Given classes of data for any , the goal is to estimate We first estimate pairwise class probabilities by using where and are estimated by minimizing the cross entropy using training data and the corresponding decision values .

Hastie and Tibshirani [21] proposed minimizing the Kullback-Leibler (KL) distance between and , where and is the number of training data in the th and th classes.

Wu et al. [10] propose the second approach to obtain from all these 's by optimizing

Thus, we can adopt the leaving-one-out CV similar to the method in Section 2.2.

Step 1. From the initial sample , are deleted in order to form the training sample .

Step 2. Using each training sample, fit a SVM in order to estimate by (22), and predict for the deleted th sample .

Step 3. Steps 1 and 2 are repeated for .

Step 4. The CV score is given by

Step 5. Tuning parameters with minimum CV can be determined as optimal by carrying out a grid search over and .

4. Examples

4.1. Mackerel-Egg Survey Data

We consider data consisting of 634 observations from a 1992 mackerel egg survey [24]. There are the following predictors of egg abundance: the location (longitude and latitude) at which samples were taken, depth of the ocean, distance from the 200 m seabed contour, and, finally, water temperature at a depth of 20 m. We first fit a SVM. In the same manner as described in [11], we determine tuning parameters and . The optimum values of the tuning parameters are . The bootstrap estimator of the percentile for the deviance is . A comparison with the deviance from (15) suggests that the SVM fits the data fairly well. For reference purposes, the histogram of the bootstrapped for is provided in Figure 1.

We can estimate the apparent errors rate of incorrectly predicting outcome and leaving-one-out CV error rates for several models as shown in Table 1. The smoothing parameters in generalized additive models (GAM) [24] and the number of hidden units in a neural network in Table 1 are determined using the leaving-one-out CV. From Table 1, the leaving-one-out CV error rate for the SVM is the smallest among all models, but the apparent error rate is the smallest for the neural network. The CV scores are 477.04, 509.44, and 541.61 for the SVM, logistic discriminant, and neural network, respectively. This implies that the SVM is the best among these three models from the point of view of CV. Figure 2 shows the index plot of , which indicates that no. 399 and no. 601 are influential observations at the level of significance.

4.2. Liver Disease Data

We apply the proposed method to laboratory data collected from 218 patients with liver disorders [2527]. Four liver diseases were observed: acute viral hepatitis (57 patients), persistent chronic hepatitis (44 patients), aggressive chronic hepatitis (40 patients), and postnecrotic cirrhosis (77 patients). The covariates consist of four liver enzymes: aspartate aminotransferase (AST), alanine aminotransferase (ALT), glutamate dehydrogenase (GIDH), and ornithine carbamyltransferase (OCT). For each pair, the CV performance is measured by training and testing the other of the data. Then, we train the whole training set by using the pair , which achieves the minimum CV score (=187.93) and predicts the test set. The apparent and leaving-one-out CV error rates for traing and test samples for several models as shown in Table 2. As shown, the apparent error rate for SVM of training sample and the error rate for SVM of test sample are the smallest among all models, but the leaving-one-out CV error rate for SVM of training sample is larger than that of the multinomial logistic discriminant model.

5. Concluding Remarks

We considered the application of resampling methods to SVMs. Statistical inference based on the likelihood approach for SVMs was discussed, and the leaving-one-out CV was suggested for determining the tuning of parameters and for estimating the bias of the excess error in prediction. Bootstrapping is used to focus on the evaluation of the overall goodness-of-fit with the optimum tuning parameters. Data from a mackerel-egg survey and a liver-disease study are used to evaluate the resampling methods.

There is one broad limitation to our approach: the SVM assumed the independence of the predictor variables. More generally, it may be preferable to visualize interactions between predictor variables. The smoothing spline ANOVA models [28] can provide an excellent means for handling data of mutually exclusive groups and a set of predictor variables. We expect that flexible methods for a discriminant model using machine learning theory [1], such as penalized smoothing splines, will be very useful in these real-world contexts.