Abstract
This paper considers the feed-forward neural network models for data of mutually exclusive groups and a set of predictor variables. We take into account the bootstrapping based on information criterion when selecting the optimum number of hidden units for a neural network model and the deviance in order to summarize the measure of goodness-of-fit on fitted neural network models. The bootstrapping is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples. Simulated data from known (true) models are analyzed in order to interpret the results using the neural network. In addition, the thyroid disease database, which compares estimated measures of predictive performance, is examined in both a pure training sample study and in a test sample study, in which the realized test sample apparent error rates associated with a constructed prediction rule are reported. Apartment house data of the metropolitan area station with four-class classification are also analyzed in order to assess the bootstrapping by comparing leaving-one-out cross-validation (CV).
1. Introduction
The neural network model is considered for the multiclass classification problem of assigning each observation into one of multiclass, which is referred to as a multiple-group neural discriminant model. As two-class problems are much easier to solve, we focus on neural networks for multiclass classification with respect to statistical techniques in order to derive the maximum likelihood estimators () [1β7]. Statistical techniques are formulated in terms of the principle of the likelihood of the neural discriminant model, in which the connection weights of the network are treated as unknown parameters.
Besides the theoretical and empirical properties of the bootstrapping [8, 9] in the multiple-group neural discriminant model, there are at least two other reasons to use a bootstrap procedure. First, the criterion based on bootstrapping is demonstrated to be favorable when selecting the optimum number of hidden units. A number of model selection procedures (i.e., methods for the selection of the optimum number of hidden units), such as Akaike information criterion (), (Baysesian information criterion) and cross-validation [10β13] have been proposed. The bootstrap method, however, provides the percentile for the deviance, allowing evaluation of the overall goodness-of-fit and estimation of the bias of the excess error in prediction based on the selected model. Therefore, there is no extra cost for subsequence inference via the bootstrap samples generated for model selection. If a model is selected by a cross-validation method and the bootstrap is used for the subsequence inference, the extra cost of computations is required in resampling for cross-validation. Second, the bootstrap procedures developed in the multiple-group neural discriminant model can be extended, without any theoretical derivation, to more complicated problems such as the generalized additive models (GAM) [14, 15], support vector machines (SVM) [16β19], and vector generalized additive models (VGAM) [20].
The remainder of this paper is organized as follows. In Section 2 we focus on the selection of the optimum number of hidden units and evaluation of the overall goodness-of-fit with the optimum number of hidden units. A neural network can approximate any reasonable function with arbitrary precision if the number of hidden units tends to infinity [21]. The output of the network fits the training sample too closely if the number of hidden units is increased and the noise is modeled in addition to the desired underlying function. The bootstrapping is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples [22, 23]. Simulated data from known (true) models are used to demonstrate the approximate realization of continuous mapping by neural networks in Section 3. In Section 4 the methods are illustrated using a thyroid disease database in order to show that the overfitting leads poor generalization. Apartment house data of the metropolitan area station with four-class classification are also analyzed in order to assess the bootstrapping by comparing leaving-one-out CV. Finally, in Section 5 we discuss the relative merits and limitations of the methods.
2. Materials and Methods
2.1. Multiple-Group Neural Discriminant Model
2.1.1. Statistical Inference
The functional representation of the neural network model is considered, as shown in Figure 1. The connection weight between the th unit in the input layer and the th unit in the hidden layer ( is . Similarly, the weight between the th unit in the hidden layer () and the th unit in the output layer () is . The input to the th hidden unit is a linear projection of the input vector , that is, where is a bias. This is the same idea as incorporating the constant term in the design matrix of a regression by including a column of 1βs [1]. The output of the th hidden unit is where is a nonlinear activation function. The most commonly used activation function is the logistic (sigmoid) function: The input to the th output unit is where is a bias. The activation function of network outputs for the mutually exclusive groups can be achieved using the softmax activation (normalized exponential) function: which can be regarded as a multiclass generalization of logistic function.
From (1)β(6), can be written in the form The output to the th group can be calculated as . For example, in the case of , it follows that From , . Thus the number of unit for output layer is 2 (= ).
By setting the teach value the log likelihood function for the total sample size is where and are the teach and output vectors, respectively, for the th observation, , , , and . As usual, the negative log likelihood gives the cross-entropy error function. The unknown parameters can be estimated by maximizing the log likelihood ((10) with output (7)) by use of batch backpropagation including momentum, in which the training values for unknown parameters are chosen at random. The number of parameters included in the multiple-group neural discriminant model is .
2.1.2. Determination of the Optimum Number of Hidden Units
The criterion based on bootstrapping is demonstrated to be favorable when selecting the optimum number of hidden units. In conventional statistics, various criteria have been developed for assessing the generalization performance. provides us with a decision as to which of several competing network architectures are best for a given problem. However, the usage of may not be justified theoretically when considering a neural network as an approximation to an underlying model [7, 24]. A bootstrap type nonparametric resampling estimator of Kullback-Leibler information by Ishiguro and Sakamoto [25], Konishi and Kitagawa [26], Ishiguro et al. [27], Kullback and Leibler [28], and Shibata [29] and Shao [30] can provide an alternative to computed from a skewed discrete distribution.
Let the training samples ,, , and for be independently distributed in an unknown distribution . Let be the empirical distribution function that places a mass equal to at each point . We propose the bootstrap sampling algorithm given as follows.
Step 1. Generate samples , each of size , drawn with replacement from the training sample . Denote the th sample as .
Step 2. For each bootstrap sample , fit a model to obtain the estimator .
Step 3. The bootstrap estimator of bias is given as
where is the average of differences between log likelihood on the bootstrap sample and that on the training sample , given .
Thus, Extended Information Criterion () proposed by Ishiguro et al. [27] is defined as
approach selects the number of hidden units with the minimum value of (12) as Shibata [29] and Shao [30] point out that this method is asymptotically equivalent to leaving-one-out CV and AIC.
Note that the bootstrap algorithm requires refitting of the model (retraining the network) times [31]. The number of replications is in the range , and so bootstrap replications are used. The competing networks share the same architecture with the only exception being the number of hidden units.
2.1.3. Bootstrapping the Deviance
No standard procedure by which to assess the overall goodness-of-fit of the multiple-group neural discriminant model has been proposed. By introducing the maximum likelihood principle, the deviance allows us to test the overall goodness-of-fit of the model: where denotes the maximized log likelihood under a current neural discriminant model. Since the log likelihood for the full model is zero by using the definition 0, we have Note that the deviance is two times log likelihood Equation (10). The greater the deviance, the poorer the fit of the model. However, the deviance given in (14) is not even approximately distributed as for the case in which binary (Bernoulli) responses are available [32β35]. We therefore provide the bootstrap estimator of the percentile (i.e., the critical point) for the deviance given in (14) according to the following algorithm.
Step 1. Generate (= 200) bootstrap samples drawn with the replacement from the training sample with the optimum number of hidden units which was determined by the way in Section 2.1.2.
Step 2. For the bootstrap sample , the deviance given in (14) is computed as This process is independently repeated times, and the computed values are arranged in ascending order.
Step 3. The value of the th order statistic of the replications can be taken as an estimator of the quantile of order .
Step 4. The estimator of the -th percentile (i.e., the % critical point) of is used to test the goodness-of-fit of the model using a specified significance level . If the value of the deviance given in (14) is greater than the estimate of the percentile, then the model fits poorly.
2.1.4. Excess Error Estimation
Let error rate be the probability of incorrectly predicting the outcome of a new observation drawn from an unknown distribution , given a prediction rule on a training sample . This error rate is defined as the actual error rate, which is of interest in performance assessment of prediction rules. Let be the empirical distribution function that places a mass equal to at each point . We apply a prediction rule to this training sample and form the realized prediction rule for a new observation . Let indicate the discrepancy between an observed value and its predicted value . Let error rate , referred to as the apparent error rate, be the probability of incorrectly predicting the outcome for the sample drawn from the empirical distribution of the training sample, . Because the training sample is used for both forming and assessing the prediction rule, this proportion (i.e., apparent error rate) underestimates the actual error rate. The difference is the excess error. The expected excess error (i.e., bias) of a given prediction rule [22, 23, 36, 37] is
When the prediction rule by multiple-group neural discriminant model is allowed to be complicated, overfitting becomes a real danger, and excess error estimation becomes important. Thus we will consider the bootstrapping to estimate the expected excess error when fitting a multiple-group neural discriminant model to the data. The algorithm can be summarized as follows.
Step 1. Generate bootstrap samples from as described in Section 2.1.2. Let be the empirical distribution of .
Step 2. For each bootstrap sample , fit a model to obtain the estimator and construct the realized prediction rule based on .
Step 3. The bootstrap estimator of the expected excess error in (16) is given by where
Step 4. Repeat Step 1βStep 3 for bootstrap samples = 200 to get . The bootstrap estimator of the expected excess error can be obtained as
Step 5. The actual error rate with bootstrap bias correction is
3. Simulation Study
Since the model generally does not encompass unknown functions, but rather only approximations thereof, the model is inherently misspecified. Therefore, we demonstrate results from some Monte Carlo simulations to evaluate the performance. The criterion based on bootstrapping is demonstrated to be favorable when selecting the optimum number of hidden units [38β41]. Vach et al. [41] investigated how regression functions can be approximate specific regression from the class and pointed out that the comparison using members of this is a little bit unfair. We thus show the superiority of neural network model by using the function of the existence of several local extrema. The influence can be illustrated through a simple simulation using a neural network model with two inputs and , because we can visualize the contour plot of unknown population.
3.1. Two-Class Classification
The influence can be illustrated through a simple simulation using a neural network model with two inputs, one output and a varying number of hidden units. For two independent continuous covariates and , we simulated the following known (true) model:
Training and test samples of size 1000 were considered in the present study. Input data are chosen from data that are uniformly distributed over , and the binary response is labeled with 1 if and otherwise with 0. Figure 2 shows the distribution of the covariates and the class membership indicator in the training sample.
(a) Contour plot
(b) Class membership indicator
values with replications based on bootstrapping pairs for the training sample are shown in Figure 3 after fitting the neural discriminant models having one to five hidden units. For the purpose of comparison, the values of and are also provided. In the case of the simulation study, the known (true) model given in (22) is included in the population. Thus, the differences between and values are slight.
Using the simulated training sample, the feed-forward neural networks were fit to the known (true) model given in (22). The tendency of mapping performed by neural networks with hidden units to implausibly fit the function given in (22) can also be illustrated.
The bootstrap estimate of the percentile (i.e., the 5% critical point) for the training sample with four hidden units is . Comparison to the deviance given in (14) suggests that the multiple-group neural discriminant model fits the data fairly well because is far from the 5% critical point .
The actual error rate with the bootstrap bias correction given in (20) for the multiple-group neural discriminant models with four hidden units is calculated as . Figure 4 illustrates the apparent error rates observed in the training sample- and test sample-based error rates. The apparent error rates for both samples decreased with the increase in the number of hidden units from to and then remained constant.
Figures 3 and 4 are based on only one simulated data set. However, the efficacy of the bootstrap procedures would be more convincingly illustrated in a simulation study based on multiple samples. Figure 5 shows the average values of , , and BIC based on multiple samples with 100 replications after fitting the neural discriminant models having one to five hidden units. Figure 6 shows the box-and-whisker plots for EIC in order to evaluate the standard errors and other statistics. Figure 7 illustrates the mean apparent error rates observed in multiple test samples with 100 replicates. Figure 8 also shows the box-and-whisker plots for the mean apparent error rates in multiple test samples with 100 replicates. For the purpose of comparison, the estimates of the actual error rates with bootstrap bias correction for the training sample [42] are also shown in Figure 7. It is concluded that EIC identifies the optimal number of hidden units (i.e., 4) more often than AIC. In addition, the differences between the average values of EIC and AIC are somewhat similar to Figure 3, and the average values of the bootstrap-corrected estimate of the prediction error rate vary around the average apparent error rates for the multiple test samples.
3.2. Multiclass Classification
Input data are generated from uniformly distributed over . By substituting into can be grouped into four classes in the case of . As nonlinear function , can be used with [41, 43].
By use of definition of in (9), the observations can be divided into four class by multinomial random number with , .
In this paper, training and test samples of size 400 were considered. The apparent error rates for training and test samples of several models are given in Table 1. From Table 1, it is found that the apparent error rates of training and test sample for multiple-group neural discriminant model is the smallest.
4. Results and Discussion
Prediction accuracy (error rate) is the most important consideration in the development of prediction model. The assessment of goodness-of-fit is a useful exercise. In particular the goodness-of-fit and error rate from the training data are meaningful because of overfitting issue. The main purpose is to predict the future samples accurately. In other words, in real applications, the test sample population may be different from the training samples. A benchmark data set is thus used to illustrate the advantages of the models and methods developed herein. A multiple-group neural discriminant model having a single hidden layer was applied to a data set of 3,772 training instances and 3,428 testing instances of a thyroid disease database. All of these data sets are available on the World Wide Web at http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease. The present study considered three groups: hypothyroid, hyperthyroid, and normal. The laboratory profiles on which the differential diagnosis is made consist of 21 attributes (15 attributes are binary, and six attributes are continuous):
Table 2 is a list of the first-five observations for the 21 attributes and the group with respect to the training sample. The training sample is used to determine the neural network model structure. Table 3 is a list of the first-five observations for 21 attributes and the group with respect to the test sample. The goal of discrimination is to assign new observations to one of the mutually exclusive groups. The data in Tables 2 and 3 include six continuous and 15 binary attributes. Fisherβs discriminant model assumed that the inputs are normal distributed. However, it is worth noting that the posterior class probabilities for neural discriminant model can be given by maximizing log likelihood Equation (10) without the normal distributed assumption for inputs.
A thyroid disease database has been used as a benchmark test for the neural network model shown in Figure 1 with and . EIC values are shown in Figure 9 after fitting the multiple-group neural discriminant models having one to four hidden units. In this case, the true model is not included in the population. For the purpose of comparison, AIC and BIC values are also provided.
Figure 9 indicates that the minimum EIC value is obtained for the model having two hidden units, which has an apparent error rate of 0.0090. Figure 10 shows a histogram of the bootstrap replications that are used to estimate the expected excess error. The values of the mean and standard deviation of are β0.0033 and 0.0022, respectively. The actual error rate with the bootstrap bias correction given in (20) for the multiple-group neural discriminant models with two hidden units is calculated as .
The histogram of the bootstrapped for is provided in Figure 11. The bootstrap estimate of the percentile (i.e., the 5% critical point) for the thyroid disease training sample with two hidden units is . Comparison to the deviance given in (14) suggests that the multiple-group neural discriminant model fits the data fairly well. For reference, the - plot of the bootstrapped for is shown in Figure 12.
Alternatively, if the deviance Equation (14) asymptotically follows the distribution with degrees of freedom under the null hypothesis that the model is correct, the probability density function of the distribution with 3772 degrees of freedom is shown in Figure 13. However, because of large sample size , the distribution is extremely skewed. By comparing Figure 13 with Figure 11, it is found that the distribution of deviance Equation (14) can not be approximated by distribution. Furthermore, the mean and deviance of bootstrapped are and , respectively, which are not close to those of the distribution with ., that is, and . It should be noted that the deviance asymptotically follows distribution for grouped binary (i.e., binomial) response and a set of predictor variables, as described in Tsujitani and Aoki [44].
The apparent error rates after fitting the multiple-group neural discriminant models having one to four hidden units are shown in Figure 14. Figure 14 indicates that (i) the multilayer feedforward neural network can approximate virtually any function up to some desired level of approximation with the number of hidden units increased ad libitum for the training sample, (ii) the actual error rate for the test sample is the smallest when the number of hidden units is two, and (iii) a neural network with a large number of hidden units has a higher error rate for the test sample, because the noise is modeled in addition to the underlying function.
Although the model fits the training sample as well as possible by increasing the number of hidden units, the model does not generalize very well to the test sample, which is the goal. The apparent error rates for training and test samples of several models are given in Table 4: (i) the multigroup logistic discriminant model with linear effect [6, 45] by use of library in free software R [15], (ii) multiple-group logistic discrimination models with linear + quadratic effects, (iii) the tree-based model with mincut = 5, minsize = 10, mindev = 0.01 as tuning parameters [46] by use of library in R, (iv) the nearest neighbor smoother using a nonparametric method to derive the classification criterion [6, 47] by use of library (knn) in R, (v) the kernel smoother [47] using normal distribution and a radius to specify a kernel density by use of library in R, (vi) the support vector machine using the βone-against-oneβ approach [48, 49] by use of library in R, (vii) the proportional odds model [14], and (viii) VGAM based on the proportional odds model with optimum smoothing parameters selected by leaving-one-out cross-validation [20] by use of library in R.
From Table 4, it is found that multiple-group neural discriminant model ( = 2) has the smallest error rate for test sample preserving relatively small error rate for training sample. In order to overcome the stringent assumption of the additive and purely linear effects of the covariates, multiple-group logistic discrimination models with linear and quadratic effects were included. The improvement obtained by the inclusion of the quadratic effect is slight. It should be noted that the apparent error rates for training of VGAM are the smallest, but that for test samples are large. This overfitting leads poor generalization. For example, the estimated smooth function of the covariate βageβ for VGAM in Figure 15 shows the overfitting.
Table 5 is apartment house data for assessment of land value by the metropolitan area stations, of the metropolitan area stations with four-class classification [50]. By using the four covariates (average price of house built for sale, average house rent, yield, assessment of station value by the number of passengers getting on and off), and assessment of land value by the metropolitan area stations may be grouped into four categories:(i)the most comfortable,(ii)very comfortable,(iii)a little comfortable,(iv)not comfortable.
Figure 16 indicates the values of , , and leaving-one-out CV (See the Appendix). The leaving-one-out CV is also included in order to assess the bootstrapping. The minimum and leaving-one-out CV values are obtained for the model having two hidden units. However, the number of hidden unit with the minimum AIC value is three. The actual error rates in the case using EIC and leaving-one-out CV with two hidden units are 0.276 and 0.273, respectively. The bootstrapping is assessed from the point of leaving-one-out CV. The apparent error rates for training samples of several models are given in Table 6. From Table 6, it is found that multiple-group neural discriminant model ( = 2) has the smallest error rate.
5. Conclusions
We discussed the learning algorithm by maximizing the log likelihood function. Statistical inference based on the likelihood approach for the multiple-group neural discriminant model was discussed, and a method for estimating bias on the expected log likelihood in order to determine the optimum number of hidden units was suggested. The key idea behind bootstrapping is to focus on the optimum tradeoff between the unbiased approximation of the underlying model and the loss in accuracy caused by increasing the number of hidden units. In the context of applying bootstrap methods to a multiple-group neural discriminant model, this paper considered three methods and performed experiments using two data sets to evaluate the methods. The three methods are bootstrap pairs sampling algorithm, goodness-of-fit statistical test, and excess error estimation algorithm.
There are two broad limitations to our approach. First, the use of batch backpropagation algorithm including momentum prevents an maximum likelihood estimates from getting trapped in a local minimum, not global minimum. So far, our discussion of neural networks has focused on the maximum likelihood to determine the network parameters (weights and biases). However, a Bayesian neural network approach [51] might provide a more formal framework in which to incorporate a prior parameter distribution. Second, our neural network models assumed the independence of the predictor variables . More generally, it may be preferable to visualize interactions between predictor variables. The smoothing spline ANOVA models can provide an excellent means for data of mutually exclusive groups and a set of predictor variables [43, 52]. We expect that flexible methods for discriminant model using machine learning theory [47, 53β55] such as penalized smoothing splines and support vector machine [17β19] will be very useful in these real-world contexts.
Appendix
Leaving-One-Out CV
An alternative model selection strategy for the bias correction Equation (14) of the log likelihood is leaving-one-out CV for a multiple-group neural discriminant model, which is asymptotically equivalent to TIC [29]. Let the training sample be independently distributed in an unknown distribution. We then obtain the leaving-one-out CV algorithm.
Step 1. Generate the training samples ,,. The subscript [] of a quantity indicates the deletion of the th data point from the training sample .
Step 2. Using each training sample, fit a model. Then, estimate unknown parameters denoted by and predict the output for the deleted sample point .
Step 3. The average predictive log likelihood of the deleted sample point is
As a matter of convention, the cross-validation criterion is often stated as that of minimizing
The leaving-one-out CV criterion finds an appropriate degree of complexity by comparing the predictive probability density for different model specifications. Anders and Korn [24] have shown that the CV criterion does not rely on any probabilistic assumption based on the properties of maximum likelihood estimators for misspecified models and is not affected by identification problems.