About this Journal Submit a Manuscript Table of Contents
ISRN Artificial Intelligence
Volume 2012 (2012), Article ID 820364, 12 pages
http://dx.doi.org/10.5402/2012/820364
Research Article

Neural Discriminant Models, Bootstrapping, and Simulation

1Department of Engineering Informatics, Osaka Electro-Communication University, 18-8 Hatsu-chou, Neyagawa, Osaka 572-8530, Japan
2Department of Clinical Research and Development, Otsuka Pharmaceutial Co., Ltd., Osaka, Japan
3Clinical Information Division Data Science Center, EPS Corporation, Japan

Received 8 October 2011; Accepted 30 November 2011

Academic Editor: J. J. Chen

Copyright © 2012 Masaaki Tsujitani et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This paper considers the feed-forward neural network models for data of mutually exclusive groups and a set of predictor variables. We take into account the bootstrapping based on information criterion when selecting the optimum number of hidden units for a neural network model and the deviance in order to summarize the measure of goodness-of-fit on fitted neural network models. The bootstrapping is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples. Simulated data from known (true) models are analyzed in order to interpret the results using the neural network. In addition, the thyroid disease database, which compares estimated measures of predictive performance, is examined in both a pure training sample study and in a test sample study, in which the realized test sample apparent error rates associated with a constructed prediction rule are reported. Apartment house data of the metropolitan area station with four-class classification are also analyzed in order to assess the bootstrapping by comparing leaving-one-out cross-validation (CV).

1. Introduction

The neural network model is considered for the multiclass classification problem of assigning each observation into one of multiclass, which is referred to as a multiple-group neural discriminant model. As two-class problems are much easier to solve, we focus on neural networks for multiclass classification with respect to statistical techniques in order to derive the maximum likelihood estimators (MLE) [17]. Statistical techniques are formulated in terms of the principle of the likelihood of the neural discriminant model, in which the connection weights of the network are treated as unknown parameters.

Besides the theoretical and empirical properties of the bootstrapping [8, 9] in the multiple-group neural discriminant model, there are at least two other reasons to use a bootstrap procedure. First, the criterion based on bootstrapping is demonstrated to be favorable when selecting the optimum number of hidden units. A number of model selection procedures (i.e., methods for the selection of the optimum number of hidden units), such as Akaike information criterion (AIC), BIC (Baysesian information criterion) and cross-validation [1013] have been proposed. The bootstrap method, however, provides the percentile for the deviance, allowing evaluation of the overall goodness-of-fit and estimation of the bias of the excess error in prediction based on the selected model. Therefore, there is no extra cost for subsequence inference via the bootstrap samples generated for model selection. If a model is selected by a cross-validation method and the bootstrap is used for the subsequence inference, the extra cost of computations is required in resampling for cross-validation. Second, the bootstrap procedures developed in the multiple-group neural discriminant model can be extended, without any theoretical derivation, to more complicated problems such as the generalized additive models (GAM) [14, 15], support vector machines (SVM) [1619], and vector generalized additive models (VGAM) [20].

The remainder of this paper is organized as follows. In Section 2 we focus on the selection of the optimum number of hidden units and evaluation of the overall goodness-of-fit with the optimum number of hidden units. A neural network can approximate any reasonable function with arbitrary precision if the number of hidden units tends to infinity [21]. The output of the network fits the training sample too closely if the number of hidden units is increased and the noise is modeled in addition to the desired underlying function. The bootstrapping is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples [22, 23]. Simulated data from known (true) models are used to demonstrate the approximate realization of continuous mapping by neural networks in Section 3. In Section 4 the methods are illustrated using a thyroid disease database in order to show that the overfitting leads poor generalization. Apartment house data of the metropolitan area station with four-class classification are also analyzed in order to assess the bootstrapping by comparing leaving-one-out CV. Finally, in Section 5 we discuss the relative merits and limitations of the methods.

2. Materials and Methods

2.1. Multiple-Group Neural Discriminant Model
2.1.1. Statistical Inference

The functional representation of the neural network model is considered, as shown in Figure 1. The connection weight between the 𝑖th unit in the input layer (𝑖=0,,𝐼) and the 𝑗th unit in the hidden layer (𝑗=1,,𝐻) is 𝛼𝑖𝑗. Similarly, the weight between the 𝑗th unit in the hidden layer (𝑗=0,,𝐻) and the 𝑘th unit in the output layer (𝑘=1,,𝐾) is 𝛽𝑗𝑘. The input to the 𝑗th hidden unit is a linear projection of the input vector 𝐱=(𝑥1,,𝑥𝐼), that is,𝑢𝑗=𝐼𝑖=0𝛼𝑖𝑗𝑥𝑖,𝑥01,(1) where 𝛼0𝑗 is a bias. This is the same idea as incorporating the constant term in the design matrix of a regression by including a column of 1’s [1]. The output of the 𝑗th hidden unit is𝑦𝑗𝑢=𝑓𝑗=𝑓𝐼𝑖=0𝛼𝑖𝑗𝑥𝑖,(2) where 𝑓() is a nonlinear activation function. The most commonly used activation function is the logistic (sigmoid) function:𝑦𝑗=11+exp𝑢𝑗.(3) The input to the 𝑘th output unit is𝑣𝑘=𝐻𝑗=0𝛽𝑗𝑘𝑦𝑗,𝑦01,(4) where 𝛽0𝑗 is a bias. The activation function of network outputs for the mutually exclusive groups can be achieved using the softmax activation (normalized exponential) function:𝑜𝑘=𝑣exp𝑘𝐾𝑘=1𝑣exp𝑘=11+exp𝑉𝑘,𝑉(5)𝑘=𝑣𝑘ln𝐾𝑘𝑘𝑣exp𝑘,(6) which can be regarded as a multiclass generalization of logistic function.

820364.fig.001
Figure 1: Single hidden layer neural network model.

From (1)–(6), 𝑜𝑘 can be written in the form𝑜𝑘=𝑣exp𝑘𝐾𝑘=1𝑣exp𝑘=exp𝐻𝑗=1𝛽𝑗𝑘𝑦𝑗𝐾𝑘=1exp𝐻𝑗=0𝛽𝑗𝑘𝑦𝑗=exp𝐻𝑗=0𝛽𝑗𝑘/1+exp𝐼𝑖=0𝛼𝑖𝑗𝑥𝑖𝐾𝑘=1exp𝐻𝑗=0𝛽𝑗𝑘/1+exp𝐼𝑖=0𝛼𝑖𝑗𝑥𝑖.(7) The output 𝑜𝐾 to the 𝐾th group can be calculated as 𝑜𝐾=1𝐾1𝑘=1𝑜𝑘. For example, in the case of 𝐾=3, it follows that𝑜1=𝑒𝑣1𝑒𝑣1+𝑒𝑣2+𝑒𝑣3,𝑜2=𝑒𝑣2𝑒𝑣1+𝑒𝑣2+𝑒𝑣3.(8) From 𝑜1+𝑜2+𝑜3=1, 𝑜3=1(𝑜1+𝑜2). Thus the number of unit for output layer is 2 (= 𝐾1).

By setting the teach value𝑡𝑘𝑑=1𝑑thinputvector𝐱𝑑=𝑥1𝑑,𝑥2𝑑,,𝑥𝐼𝑑,isfromthe𝑘-group0o.w.(9) the log likelihood function for the total sample size 𝐷 is=ln𝐿(𝜽;𝐱,𝐭)𝐷𝐾𝑑=1𝑘=1𝑡𝑘𝑑ln𝑜𝑘𝑑,0𝑜𝑘𝑑1,𝐾𝑘=1𝑡𝑘𝑑=1,(10) where 𝐭𝑑=(𝑡1𝑑,𝑡2𝑑,,𝑡𝐾𝑑) and 𝐨𝑑=(𝑜1𝑑,𝑜2𝑑,,𝑜𝐾𝑑) are the teach and output vectors, respectively, for the 𝑑th observation, 𝐭=(𝐭1,𝐭2,,𝐭𝐷), 𝐱=(𝐱1,𝐱2,,𝐱𝐷), 𝜶={𝛼𝑖𝑗}, and 𝜷={𝛽𝑗𝑘}. As usual, the negative log likelihood gives the cross-entropy error function. The unknown parameters 𝜽={𝜶,𝜷} can be estimated by maximizing the log likelihood ((10) with output (7)) by use of batch backpropagation including momentum, in which the training values for unknown parameters are chosen at random. The number of parameters included in the multiple-group neural discriminant model is 𝑝=𝐻(𝐼+𝐾1)+𝐻+𝐾1.

2.1.2. Determination of the Optimum Number of Hidden Units

The criterion based on bootstrapping is demonstrated to be favorable when selecting the optimum number of hidden units. In conventional statistics, various criteria have been developed for assessing the generalization performance. AIC provides us with a decision as to which of several competing network architectures are best for a given problem. However, the usage of AIC may not be justified theoretically when considering a neural network as an approximation to an underlying model [7, 24]. A bootstrap type nonparametric resampling estimator of Kullback-Leibler information by Ishiguro and Sakamoto [25], Konishi and Kitagawa [26], Ishiguro et al. [27], Kullback and Leibler [28], and Shibata [29] and Shao [30] can provide an alternative to AIC computed from a skewed discrete distribution.

Let the training samples 𝐗={𝐗1,𝐗2,,𝐗𝑑,,𝐗𝐷}, 𝐗𝑑={𝐱𝑑,𝐭𝑑}, and 𝐱𝑑={𝑥1𝑑,𝑥2𝑑,,𝑥𝐼𝑑} for 𝑑=1,2,,𝐷 be independently distributed in an unknown distribution 𝐹. Let 𝐹 be the empirical distribution function that places a mass equal to 1/𝐷 at each point 𝐗1,𝐗2,,𝐗𝑑,,𝐗𝐷. We propose the bootstrap sampling algorithm given as follows.

Step 1. Generate 𝐵 samples 𝐗, each of size 𝐷, drawn with replacement from the training sample 𝐗={𝐗1,𝐗2,,𝐗𝑑,,𝐗𝐷}. Denote the 𝑏th sample as 𝐗𝑏,𝑏=1,2,,𝐵.

Step 2. For each bootstrap sample 𝐗𝑏,𝑏=1,2,,𝐵, fit a model to obtain the estimator 𝜽(𝐗𝑏).

Step 3. The bootstrap estimator of bias 𝐶 is given as 𝐶1𝐵𝐵𝑏=1𝐗ln𝐿𝑏;𝜽𝐗𝑏𝜽𝐗ln𝐿𝐗;𝑏,(11) where 𝐶 is the average of differences between log likelihood on the bootstrap sample ln𝐿{𝐗𝑏;𝜽(𝐗𝑏)} and that on the training sample ln𝐿{𝐗;𝜽(𝐗𝑏)}, given 𝜽(𝐗𝑏).
Thus, Extended Information Criterion (EIC) proposed by Ishiguro et al. [27] is defined as EIC=2ln𝐿𝐗;𝜽(𝐗)+2𝐶.(12)𝐸𝐼𝐶 approach selects the number of hidden units with the minimum value of (12) as Shibata [29] and Shao [30] point out that this method is asymptotically equivalent to leaving-one-out CV and AIC.
Note that the bootstrap algorithm requires refitting of the model (retraining the network) 𝐵 times [31]. The number of replications 𝐵 is in the range 20𝐵200, and so 𝐵=200 bootstrap replications are used. The competing networks share the same architecture with the only exception being the number of hidden units.

2.1.3. Bootstrapping the Deviance

No standard procedure by which to assess the overall goodness-of-fit of the multiple-group neural discriminant model has been proposed. By introducing the maximum likelihood principle, the deviance allows us to test the overall goodness-of-fit of the model:Dev=2ln𝐿𝑓𝜽ln𝐿𝐗;=2𝐷𝐾𝑑=1𝑘=1𝑡𝑘𝑑𝑡ln𝑘𝑑̂𝑜𝑘𝑑,(13) where ln𝐿(𝐗;𝜽) denotes the maximized log likelihood under a current neural discriminant model. Since the log likelihood for the full model ln𝐿𝑓=2𝐷𝑑=1𝐾𝑘=1{𝑡𝑘𝑑ln𝑡𝑘𝑑} is zero by using the definition 0ln0=0, we haveDev=2𝐷𝐾𝑑=1𝑘=1𝑡𝑘𝑑ln̂𝑜𝑘𝑑.(14) Note that the deviance is two times log likelihood Equation (10). The greater the deviance, the poorer the fit of the model. However, the deviance given in (14) is not even approximately distributed as 𝜒2 for the case in which binary (Bernoulli) responses are available [3235]. We therefore provide the bootstrap estimator of the percentile (i.e., the critical point) for the deviance given in (14) according to the following algorithm.

Step 1. Generate 𝐵 (= 200) bootstrap samples 𝐗 drawn with the replacement from the training sample 𝐗 with the optimum number of hidden units which was determined by the way in Section 2.1.2.

Step 2. For the bootstrap sample 𝐗𝑏,𝑏=1,2,,𝐵, the deviance given in (14) is computed as 𝐗Dev(𝑏)=2ln𝐿𝑏;𝜽𝐗𝑏.(15) This process is independently repeated 𝐵 times, and the computed values are arranged in ascending order.

Step 3. The value of the 𝑗th order statistic Dev of the 𝐵 replications can be taken as an estimator of the quantile of order 𝑗/(𝐵+1).

Step 4. The estimator of the 100(1𝛼)-th percentile (i.e., the 100𝛼% critical point) of Dev is used to test the goodness-of-fit of the model using a specified significance level 𝛼=1𝑗/(𝐵+1). If the value of the deviance given in (14) is greater than the estimate of the percentile, then the model fits poorly.

2.1.4. Excess Error Estimation

Let error rate 𝑒(𝐹;𝐗) be the probability of incorrectly predicting the outcome of a new observation drawn from an unknown distribution 𝐹, given a prediction rule on a training sample 𝐗. This error rate is defined as the actual error rate, which is of interest in performance assessment of prediction rules. Let 𝐹 be the empirical distribution function that places a mass equal to 1/𝐷 at each point 𝐗1,𝐗2,,𝐗𝑑,,𝑋𝐷. We apply a prediction rule 𝜂 to this training sample 𝐗 and form the realized prediction rule 𝜂𝐹(𝐱0) for a new observation 𝐗0={𝐱0,𝐭0}. Let 𝑄(𝐭0𝐹,𝜂(𝐱0)) indicate the discrepancy between an observed value 𝐭0 and its predicted value 𝜂𝐹(𝐱0). Let error rate 𝑒(𝐹;𝐗), referred to as the apparent error rate, be the probability of incorrectly predicting the outcome for the sample drawn from the empirical distribution of the training sample, 𝐹. Because the training sample is used for both forming and assessing the prediction rule, this proportion (i.e., apparent error rate) underestimates the actual error rate. The difference 𝑒(𝐹;𝐗)𝑒(𝐹;𝐗) is the excess error. The expected excess error (i.e., bias) of a given prediction rule [22, 23, 36, 37] is𝑒𝑏(𝐹)=𝐸𝐹;𝐗𝑒(𝐹;𝐗).(16)

When the prediction rule by multiple-group neural discriminant model is allowed to be complicated, overfitting becomes a real danger, and excess error estimation becomes important. Thus we will consider the bootstrapping to estimate the expected excess error when fitting a multiple-group neural discriminant model to the data. The algorithm can be summarized as follows.

Step 1. Generate bootstrap samples 𝐗 from 𝐹 as described in Section 2.1.2. Let 𝐹 be the empirical distribution of 𝐗1,𝐗2,,𝐗𝑑,,𝐗𝐷.

Step 2. For each bootstrap sample 𝐗, fit a model to obtain the estimator 𝜽(𝐗) and construct the realized prediction rule 𝜂𝐹 based on 𝐗.

Step 3. The bootstrap estimator of the expected excess error in (16) is given by 𝑅=1𝐷𝐷𝑑=1𝑄𝐭𝑑𝐹,𝜂𝐱𝑑1𝐷𝐷𝑑=1𝑄𝐭𝑑𝐹,𝜂𝐱𝑑,(17) where 𝑄𝐭𝑑𝐹,𝜂𝐱𝑑=1incorrectdiscriminant.0o.w.(18)

Step 4. Repeat Step 1–Step 3 for bootstrap samples 𝐗𝑏,𝑏=1,2,,𝐵(= 200) to get 𝑅𝑏. The bootstrap estimator of the expected excess error can be obtained as 𝑏𝐹1𝐵𝐵𝑏=1𝑅𝑏.(19)

Step 5. The actual error rate with bootstrap bias correction is 𝑒boot𝐹=𝑒𝐹;𝐗𝑏.(20)

3. Simulation Study

Since the model generally does not encompass unknown functions, but rather only approximations thereof, the model is inherently misspecified. Therefore, we demonstrate results from some Monte Carlo simulations to evaluate the performance. The criterion based on bootstrapping is demonstrated to be favorable when selecting the optimum number of hidden units [3841]. Vach et al. [41] investigated how regression functions can be approximate specific regression from the class𝑓(𝑥)=𝛽0+𝐼𝑖=1𝛽𝑖𝑥𝑖+𝐼𝑖=1𝛾𝑖𝑥2𝑖+𝑖<𝑖𝛿𝑖𝑖𝑥𝑖𝑥𝑖,(21) and pointed out that the comparison using members of this is a little bit unfair. We thus show the superiority of neural network model by using the function of the existence of several local extrema. The influence can be illustrated through a simple simulation using a neural network model with two inputs 𝑥1 and 𝑥2, because we can visualize the contour plot of unknown population.

3.1. Two-Class Classification

The influence can be illustrated through a simple simulation using a neural network model with two inputs, one output and a varying number of hidden units. For two independent continuous covariates 𝑥1 and 𝑥2, we simulated the following known (true) model:𝑓𝑥1,𝑥2=11+expsin2𝜋𝑥1𝑥1𝑥2sin2𝜋𝑥2.(22)

Training and test samples of size 1000 were considered in the present study. Input data (𝑥1,𝑥2) are chosen from data that are uniformly distributed over [0,1]×[0,1], and the binary response 𝑦 is labeled with 1 if 𝑓(𝑥1,𝑥2)>(1/2) and otherwise with 0. Figure 2 shows the distribution of the covariates (𝑥1,𝑥2) and the class membership indicator in the training sample.

fig2
Figure 2: Contour plots: (a) darker grey scale levels represent lower probabilities of 𝑦=0 and (b) and show the class membership indicators 𝑦=0 and 𝑦=1, respectively, for the covariates (𝑥1,𝑥2).

EIC values with 𝐵=200 replications based on bootstrapping pairs for the training sample are shown in Figure 3 after fitting the neural discriminant models having one to five hidden units. For the purpose of comparison, the values of AIC and BIC are also provided. In the case of the simulation study, the known (true) model given in (22) is included in the population. Thus, the differences between EIC and AIC values are slight.

820364.fig.003
Figure 3: EIC, AIC, and BIC values for the simulation using only one training sample (note that the series of EIC and AIC are indistinguishable).

Using the simulated training sample, the feed-forward neural networks were fit to the known (true) model given in (22). The tendency of mapping performed by neural networks with hidden units =1,2,3 to implausibly fit the function given in (22) can also be illustrated.

The bootstrap estimate of the 95th percentile Dev (i.e., the 5% critical point) for the training sample with four hidden units is Dev=203.10. Comparison to the deviance given in (14) (Dev=39.40) suggests that the multiple-group neural discriminant model fits the data fairly well because Dev=39.40 is far from the 5% critical point Dev=203.10.

The actual error rate with the bootstrap bias correction given in (20) for the multiple-group neural discriminant models with four hidden units is calculated as 𝑒boot=0.009. Figure 4 illustrates the apparent error rates observed in the training sample- and test sample-based error rates. The apparent error rates for both samples decreased with the increase in the number of hidden units from =1 to =4 and then remained constant.

820364.fig.004
Figure 4: Apparent error rates for simulated data after fitting neural networks with one to five hidden units.

Figures 3 and 4 are based on only one simulated data set. However, the efficacy of the bootstrap procedures would be more convincingly illustrated in a simulation study based on multiple samples. Figure 5 shows the average values of EIC, AIC, and BIC based on multiple samples with 100 replications after fitting the neural discriminant models having one to five hidden units. Figure 6 shows the box-and-whisker plots for EIC in order to evaluate the standard errors and other statistics. Figure 7 illustrates the mean apparent error rates observed in multiple test samples with 100 replicates. Figure 8 also shows the box-and-whisker plots for the mean apparent error rates in multiple test samples with 100 replicates. For the purpose of comparison, the estimates of the actual error rates with bootstrap bias correction for the training sample [42] are also shown in Figure 7. It is concluded that EIC identifies the optimal number of hidden units (i.e., 4) more often than AIC. In addition, the differences between the average values of EIC and AIC are somewhat similar to Figure 3, and the average values of the bootstrap-corrected estimate of the prediction error rate vary around the average apparent error rates for the multiple test samples.

820364.fig.005
Figure 5: Average values of EIC, AIC and BIC for the simulated sample with 100 replications.
820364.fig.006
Figure 6: Box-and-whisker plots for EIC.
820364.fig.007
Figure 7: Error rates of multiple test samples with 100 replicates and actual error rate with bootstrap bias.
820364.fig.008
Figure 8: Box-and-whisker plots for the mean apparent error rates in multiple test samples with 100 replicates.
3.2. Multiclass Classification

Input data (𝑥1,𝑥2) are generated from uniformly distributed over 𝐱=(𝑥1,𝑥2)[1,1]×[1,1]. By substituting 𝐱=(𝑥1,𝑥2) intoPr𝑘=𝑓exp𝑘𝑥1,𝑥21+𝐾1𝑘=1𝑓exp𝑘𝑥1,𝑥2,𝑘=1,2,,𝐾1(23)𝐱 can be grouped into four classes in the case of 𝐾=4. As nonlinear function 𝑓𝑘(𝑥1,𝑥2),𝑓1𝑥1,𝑥2=5sin2𝜋𝑥13sin2𝜋𝑥2,𝑓2𝑥1,𝑥2=sin2𝜋𝑥1+2cos2𝜋𝑥2,𝑓3𝑥1,𝑥2𝑥=281𝑥0.51𝑥0.582𝑥0.520.5(24) can be used with 𝐾𝑘=1Pr𝑘=1 [41, 43].

By use of definition of 𝑡𝑘 in (9), the observations can be divided into four class by multinomial random number𝐭Multinorm(1,𝐏𝐫)Class1𝐭=(1,0,0,0)Class2𝐭=(0,1,0,0)Class3𝐭=(0,0,1,0)Class4𝐭=(0,0,0,1)(25) with 𝐭=(𝑡1,𝑡2,𝑡3,𝑡4), 𝐏𝐫=(Pr1,Pr2,Pr3,Pr4).

In this paper, training and test samples of size 400 were considered. The apparent error rates for training and test samples of several models are given in Table 1. From Table 1, it is found that the apparent error rates of training and test sample for multiple-group neural discriminant model is the smallest.

tab1
Table 1: Comparison of various discriminant methods for simulated data.

4. Results and Discussion

Prediction accuracy (error rate) is the most important consideration in the development of prediction model. The assessment of goodness-of-fit is a useful exercise. In particular the goodness-of-fit and error rate from the training data are meaningful because of overfitting issue. The main purpose is to predict the future samples accurately. In other words, in real applications, the test sample population may be different from the training samples. A benchmark data set is thus used to illustrate the advantages of the models and methods developed herein. A multiple-group neural discriminant model having a single hidden layer was applied to a data set of 3,772 training instances and 3,428 testing instances of a thyroid disease database. All of these data sets are available on the World Wide Web at http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease. The present study considered three groups: hypothyroid, hyperthyroid, and normal. The laboratory profiles on which the differential diagnosis is made consist of 21 attributes (15 attributes are binary, and six attributes are continuous):

𝑥1: age, 𝑥2: sex, 𝑥3: on thyroxine, 𝑥4: query thyroxine, 𝑥5: on antithyroid, 𝑥6: sick, 𝑥7: pregnant, 𝑥8: thyroid surgery, 𝑥9: I131 treatment, 𝑥10: query hypothyroid, 𝑥11: query hyperthyroid, 𝑥12: lithium, 𝑥13: goitre, 𝑥14: tumour, 𝑥15: hypopituitary, 𝑥16: psych, 𝑥17: TSH, 𝑥18: T3, 𝑥19: TT4, 𝑥20: T4U, 𝑥21: FTI.

Table 2 is a list of the first-five observations for the 21 attributes and the group with respect to the training sample. The training sample is used to determine the neural network model structure. Table 3 is a list of the first-five observations for 21 attributes and the group with respect to the test sample. The goal of discrimination is to assign new observations to one of the mutually exclusive groups. The data in Tables 2 and 3 include six continuous and 15 binary attributes. Fisher’s discriminant model assumed that the inputs are normal distributed. However, it is worth noting that the posterior class probabilities for neural discriminant model can be given by maximizing log likelihood Equation (10) without the normal distributed assumption for inputs.

tab2
Table 2: List of the first-five observations for 21 attributes and the group with respect to the training sample.
tab3
Table 3: List of the first-five observations for 21 attributes and the group with respect to the test sample.

A thyroid disease database has been used as a benchmark test for the neural network model shown in Figure 1 with 𝐼=21 and 𝐾=3. EIC values are shown in Figure 9 after fitting the multiple-group neural discriminant models having one to four hidden units. In this case, the true model is not included in the population. For the purpose of comparison, AIC and BIC values are also provided.

820364.fig.009
Figure 9: EIC, AIC and BIC values for the training sample of a thyroid disease database.

Figure 9 indicates that the minimum EIC value is obtained for the model having two hidden units, which has an apparent error rate 𝑒(𝐹;𝑋) of 0.0090. Figure 10 shows a histogram of the bootstrap replications 𝑅𝑏 that are used to estimate the expected excess error. The values of the mean and standard deviation of 𝑅𝑏 are −0.0033 and 0.0022, respectively. The actual error rate with the bootstrap bias correction given in (20) for the multiple-group neural discriminant models with two hidden units is calculated as 𝑒boot=0.012.

820364.fig.0010
Figure 10: Histogram of bootstrapped replications 𝑅𝑏.

The histogram of the bootstrapped Dev(𝑏) for 𝐵=200 is provided in Figure 11. The bootstrap estimate of the 95th percentile Dev (i.e., the 5% critical point) for the thyroid disease training sample with two hidden units is Dev=344.41. Comparison to the deviance given in (14) (Dev=204.03) suggests that the multiple-group neural discriminant model fits the data fairly well. For reference, the 𝑄-𝑄 plot of the bootstrapped Dev(𝑏) for 𝐵=200 is shown in Figure 12.

820364.fig.0011
Figure 11: Histogram of bootstrapped deviance Dev(𝑏) for 𝐵=200.
820364.fig.0012
Figure 12: 𝑄-𝑄 plot of bootstrapped deviance Dev(𝑏) for 𝐵=200.

Alternatively, if the deviance Equation (14) asymptotically follows the 𝜒2 distribution with 𝐷𝑝=3772 degrees of freedom under the null hypothesis that the model is correct, the probability density function of the 𝜒2 distribution with 3772 degrees of freedom is shown in Figure 13. However, because of large sample size 𝐷=3772, the distribution is extremely skewed. By comparing Figure 13 with Figure 11, it is found that the distribution of deviance Equation (14) can not be approximated by 𝜒2 distribution. Furthermore, the mean and deviance of bootstrapped Dev(𝑏) are 𝐸[Dev(𝑏)]=205.55 and Var[Dev(𝑏)]=4436.90, respectively, which are not close to those of the 𝜒2 distribution with 3772d.f., that is, 𝐸[𝜒2]=d.f.=3772 and Var[𝜒2]=2×d.f.=7544. It should be noted that the deviance asymptotically follows 𝜒2 distribution for grouped binary (i.e., binomial) response and a set of predictor variables, as described in Tsujitani and Aoki [44].

820364.fig.0013
Figure 13: Probability density function of 𝜒2 distribution on 3772d.f.

The apparent error rates after fitting the multiple-group neural discriminant models having one to four hidden units are shown in Figure 14. Figure 14 indicates that (i) the multilayer feedforward neural network can approximate virtually any function up to some desired level of approximation with the number of hidden units increased ad libitum for the training sample, (ii) the actual error rate for the test sample is the smallest when the number of hidden units is two, and (iii) a neural network with a large number of hidden units has a higher error rate for the test sample, because the noise is modeled in addition to the underlying function.

820364.fig.0014
Figure 14: Apparent error rates for the thyroid disease database after fitting neural networks with one to hidden units.

Although the model fits the training sample as well as possible by increasing the number of hidden units, the model does not generalize very well to the test sample, which is the goal. The apparent error rates for training and test samples of several models are given in Table 4: (i) the multigroup logistic discriminant model with linear effect [6, 45] by use of library{VGAM} in free software R [15], (ii) multiple-group logistic discrimination models with linear + quadratic effects, (iii) the tree-based model with mincut = 5, minsize = 10, mindev = 0.01 as tuning parameters [46] by use of library{rpart} in R, (iv) the nearest neighbor smoother using a nonparametric method to derive the classification criterion [6, 47] by use of library (knn) in R, (v) the kernel smoother [47] using normal distribution and a radius 𝑟=1.1 to specify a kernel density by use of library{ks} in R, (vi) the support vector machine using the “one-against-one” approach [48, 49] by use of library{e1071} in R, (vii) the proportional odds model [14], and (viii) VGAM based on the proportional odds model with optimum smoothing parameters selected by leaving-one-out cross-validation [20] by use of library{VGAM} in R.

tab4
Table 4: Comparison of various discriminant methods for a thyroid disease database.

From Table 4, it is found that multiple-group neural discriminant model ( = 2) has the smallest error rate for test sample preserving relatively small error rate for training sample. In order to overcome the stringent assumption of the additive and purely linear effects of the covariates, multiple-group logistic discrimination models with linear and quadratic effects were included. The improvement obtained by the inclusion of the quadratic effect is slight. It should be noted that the apparent error rates for training of VGAM are the smallest, but that for test samples are large. This overfitting leads poor generalization. For example, the estimated smooth function of the covariate “age” for VGAM in Figure 15 shows the overfitting.

820364.fig.0015
Figure 15: the estimated smooth function of the covariate “age” for VGAM.

Table 5 is apartment house data for assessment of land value by the metropolitan area stations, of the metropolitan area stations with four-class classification [50]. By using the four covariates (average price of house built for sale, average house rent, yield, assessment of station value by the number of passengers getting on and off), and assessment of land value by the metropolitan area stations may be grouped into four categories:(i)the most comfortable,(ii)very comfortable,(iii)a little comfortable,(iv)not comfortable.

tab5
Table 5: Apartment house data for assessment of land value by the metropolitan area stations.

Figure 16 indicates the values of EIC, AIC, and leaving-one-out CV (See the Appendix). The leaving-one-out CV is also included in order to assess the bootstrapping. The minimum EIC and leaving-one-out CV values are obtained for the model having two hidden units. However, the number of hidden unit with the minimum AIC value is three. The actual error rates in the case using EIC and leaving-one-out CV with two hidden units are 0.276 and 0.273, respectively. The bootstrapping is assessed from the point of leaving-one-out CV. The apparent error rates for training samples of several models are given in Table 6. From Table 6, it is found that multiple-group neural discriminant model ( = 2) has the smallest error rate.

tab6
Table 6: Comparison of various discriminant methods for apartment house data.
820364.fig.0016
Figure 16: EIC, AIC and (leaving-one-out) CV values for the training sample of apartment house data.

5. Conclusions

We discussed the learning algorithm by maximizing the log likelihood function. Statistical inference based on the likelihood approach for the multiple-group neural discriminant model was discussed, and a method for estimating bias on the expected log likelihood in order to determine the optimum number of hidden units was suggested. The key idea behind bootstrapping is to focus on the optimum tradeoff between the unbiased approximation of the underlying model and the loss in accuracy caused by increasing the number of hidden units. In the context of applying bootstrap methods to a multiple-group neural discriminant model, this paper considered three methods and performed experiments using two data sets to evaluate the methods. The three methods are bootstrap pairs sampling algorithm, goodness-of-fit statistical test, and excess error estimation algorithm.

There are two broad limitations to our approach. First, the use of batch backpropagation algorithm including momentum prevents an maximum likelihood estimates from getting trapped in a local minimum, not global minimum. So far, our discussion of neural networks has focused on the maximum likelihood to determine the network parameters (weights and biases). However, a Bayesian neural network approach [51] might provide a more formal framework in which to incorporate a prior parameter distribution. Second, our neural network models assumed the independence of the predictor variables 𝐱=(𝑥1,,𝑥𝐼). More generally, it may be preferable to visualize interactions between predictor variables. The smoothing spline ANOVA models can provide an excellent means for data of mutually exclusive groups and a set of predictor variables [43, 52]. We expect that flexible methods for discriminant model using machine learning theory [47, 5355] such as penalized smoothing splines and support vector machine [1719] will be very useful in these real-world contexts.

Appendix

Leaving-One-Out CV

An alternative model selection strategy for the bias correction Equation (14) of the log likelihood is leaving-one-out CV for a multiple-group neural discriminant model, which is asymptotically equivalent to TIC [29]. Let the training sample 𝐗={𝐗1,𝐗2,,𝐗𝑑,,𝐗𝐷} be independently distributed in an unknown distribution. We then obtain the leaving-one-out CV algorithm.

Step 1. Generate the training samples 𝐗[𝑑]={𝐗1,𝐗2,,𝐗𝑑1,𝐗𝑑+1,,𝐗𝐷},𝑑=1,2,,𝐷. The subscript [𝑑] of a quantity indicates the deletion of the 𝑑th data point 𝑋𝑑 from the training sample 𝐗.

Step 2. Using each training sample, fit a model. Then, estimate unknown parameters denoted by 𝜃(𝑋[𝑑]) and predict the output 𝑜𝑑𝑘[𝑑] for the deleted sample point 𝑋[𝑑].

Step 3. The average predictive log likelihood of the deleted sample point is 1𝐷ln𝐷𝐾𝑑=1𝑘=1𝑜𝑘[𝑑]𝑑𝑡𝑘𝑑.(A.1) As a matter of convention, the cross-validation criterion is often stated as that of minimizing CV=2ln𝐷𝐾𝑑=1𝑘=1𝑜𝑘[𝑑]𝑑𝑡𝑘𝑑.(A.2)
The leaving-one-out CV criterion finds an appropriate degree of complexity by comparing the predictive probability density 𝐾𝑘=1{𝑜𝑖𝑘[𝑖]}𝑡𝑘𝑖 for different model specifications. Anders and Korn [24] have shown that the CV criterion does not rely on any probabilistic assumption based on the properties of maximum likelihood estimators for misspecified models and is not affected by identification problems.

References

  1. C. M. Bishop, Pattern Regression and Machine Learning, Springer, New York, NY, USA, 2006.
  2. J. S. Bridle, “Probabilistic interpretation of feed-forward classification network outputs, with relationships to statistical pattern recognition,” in Neurocomputing: Algorithms, Architectures and Applications, F. F. Soulie and J. Herault, Eds., pp. 227–236, Springer, New York, NY, USA, 1990.
  3. B. Cheng and D. M. Titterington, “Neural networks: a review from statistical perspective,” Statistical Science, vol. 9, pp. 2–54, 1994.
  4. H. Gish, “Maximum likelihood training of neural networks,” in Artificial Intelligence Frontiers in Statistics, D. J. Hand, Ed., pp. 241–255, Chapman & Hall, New York, NY, USA, 1993.
  5. M. D. Richard and R. P. Lippmann, “Neural network classifiers estimate Bayesian a posteriori probabilities,” Neural Computation, vol. 3, pp. 461–483, 1991.
  6. B. D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, New York, NY, USA, 1996.
  7. H. White, “Some asymptotic results for learning in single hidden-layer feedforward network models,” Journal of the American Statistical Association, vol. 84, pp. 1003–1013, 1989.
  8. M. Aitkin and R. Foxall, “Statistical modelling of artificial neural networks using the multi-layer perceptron,” Statistics and Computing, vol. 13, no. 3, pp. 227–239, 2003. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  9. B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall, New York, NY, USA, 1993.
  10. H. Akaike, “Information theory and an extension of the maximum likelihood principle,” in Proceedings of the 2nd International Symposium on Information Theory, B. N. Petrov and F. Csaki, Eds., pp. 267–281, Akademia Kaido, Budapest, Hungary, 1973.
  11. G. Schwarz, “Estimating the dimension of a model,” Annals of Statistics, vol. 6, pp. 461–464, 1978.
  12. J. Shao, “Linear model selection by cross-validation,” Journal of the American Statistical Association, vol. 88, pp. 486–494, 1993.
  13. P. Zhang, “Model selection via multifold cross validation,” Annals of Statistics, vol. 21, pp. 299–313, 1993.
  14. T. J. Hastie and R. J. Tibshirani, Generalized Additive Models, Chapman & Hall, New York, NY, USA, 1990.
  15. S. N. Wood, Generalized Additive Models: An Introduction with R, Chapman & Hall, New York, NY, USA, 2006.
  16. N. Cristianini and J. Shawe-Tylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Method, Cambridge University Press, Cambridge, UK, 2000.
  17. Y. J. Lee and S. Y. Huang, “Reduced support vector machines: a statistical theory,” IEEE Transactions on Neural Networks, vol. 18, no. 1, pp. 1–13, 2007. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  18. E. Romero and D. Toppo, “Comparing support vector machines and feedforward neural networks with similar hidden-layer weights,” IEEE Transactions on Neural Networks, vol. 18, no. 3, pp. 959–963, 2007. View at Publisher · View at Google Scholar · View at PubMed
  19. Q. Tao, D. Chu, and J. Wang, “Recursive support vector machines for dimensionality reduction,” IEEE Transactions on Neural Networks, vol. 19, no. 1, pp. 189–193, 2008. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  20. T. W. Yee and C. J. Wild, “Vector generalized additive models,” Journal of the Royal Statistical Society Series B, vol. 58, pp. 481–493, 1996.
  21. K. I. Funahashi, “On the approximate realization of continuous mappings by neural networks,” Neural Networks, vol. 2, no. 3, pp. 183–192, 1989. View at Scopus
  22. G. Gong, “Cross-validation, the jackknife, and the bootstrap: excess error estimation in forward logistic regression,” Journal of the American Statistical Association, vol. 81, pp. 108–113, 1986.
  23. M. C. Wang, “Re-sampling procedures for reducing bias of error rate estimation in multinomial classification,” Computational Statistics and Data Analysis, vol. 4, no. 1, pp. 15–39, 1986. View at Scopus
  24. U. Anders and O. Korn, “Model selection in neural networks,” Neural Networks, vol. 12, no. 2, pp. 309–323, 1999. View at Publisher · View at Google Scholar · View at Scopus
  25. M. Ishiguro and Y. Sakamoto, “WIC: an estimation-free information criterion,” Research Memorandum of the Institute of Statistical Mathematics, Tokyo, Japan, 1991.
  26. S. Konishi and G. Kitagawa, “Generalised information criteria in model selection,” Biometrika, vol. 83, no. 4, pp. 875–890, 1996. View at Scopus
  27. M. Ishiguro, Y. Sakamoto, and G. Kitagawa, “Bootstrapping log likelihood and EIC, an extension of AIC,” Annals of the Institute of Statistical Mathematics, vol. 49, no. 3, pp. 411–434, 1997. View at Scopus
  28. S. Kullback and R. A. Leibler, “On information and sufficiency,” Annals of Mathematical Statistics, vol. 22, pp. 79–86, 1951.
  29. R. Shibata, “Bootstrap estimate of Kullback-Leibler information for model selection,” Statistica Sinica, vol. 7, no. 2, pp. 375–394, 1997. View at Scopus
  30. J. Shao, “Bootstrap Model Selection,” Journal of the American Statistical Association, vol. 91, no. 434, pp. 655–665, 1996. View at Scopus
  31. R. Tibshirani, “A comparison of some error estimates for neural network models,” Neural Computation, vol. 8, no. 1, pp. 152–163, 1996. View at Scopus
  32. D. Collett, Modeling Binary Data, Chapman & Hall, New York, NY, USA, 2nd edition, 2003.
  33. D. E. Jennings, “Outliers and residual distributions in logistic regression,” Journal of the American Statistical Association, vol. 81, pp. 987–990, 1986.
  34. J. M. Landwehr, D. Pregibon, and A. C. Shoemaker, “Graphical methods for assessing logistic regression models,” Journal of the American Statistical Association, vol. 79, pp. 61–71, 1984.
  35. D. Pregibon, “Logistic regression diagnostics,” Annals of Statistics, vol. 9, pp. 705–724, 1981.
  36. B. Efron, “Estimating the error rate of a prediction rule: improvement on cross-validation,” Journal of the American Statistical Association, vol. 78, pp. 316–331, 1983.
  37. B. Efron, “How biases is the apparent error rate of a prediction rule?” Journal of the American Statistical Association, vol. 81, pp. 461–470, 1986.
  38. S. Eguchi and J. Copas, “A class of logistic-type discriminant functions,” Biometrika, vol. 89, no. 1, pp. 1–22, 2002. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  39. O. Intrator and N. Intrator, “Interpreting neural-network results: a simulation study,” Computational Statistics and Data Analysis, vol. 37, no. 3, pp. 373–393, 2001. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  40. G. Schwarzer, W. Vach, and M. Schumacher, “On the misuses of artificial neural networks for prognostic and diagnostic classification in oncology,” Statistics in Medicine, vol. 19, no. 4, pp. 541–561, 2000. View at Publisher · View at Google Scholar
  41. W. Vach, R. Roßner, and M. Schumacher, “Neural networks and logistic regression: Part II,” Computational Statistics and Data Analysis, vol. 21, no. 6, pp. 683–701, 1996. View at Publisher · View at Google Scholar · View at Scopus
  42. M. Tsujitani and T. Koshimizu, “Neural discriminant analysis,” IEEE Transactions on Neural Networks, vol. 11, no. 6, pp. 1394–1401, 2000. View at Scopus
  43. X. Lin, Smoothing spline analysis of variance for polychotomous response data, Ph.D. thesis, University of Wisconsin, Madison, Wis, USA, 1998.
  44. M. Tsujitani and M. Aoki, “Neural regression model, resampling and diagnosis,” Systems and Computers in Japan, vol. 37, no. 6, pp. 13–20, 2006. View at Publisher · View at Google Scholar · View at Scopus
  45. E. Lesaffre and A. Albert, “Multiple-group logistic regression diagnosis,” Journal of Applied Statistics, vol. 38, pp. 425–440, 1989.
  46. J. M. Chambers and T. J. Hasti, Statistical Models in S, Chapman & Hall, New York, NY, USA, 1992.
  47. T. J. Hastie, R. J. Tibshirani, and J. Friedman, The Elements of Statistical Learning-Data Mining, Inference and Prediction, Springer, New York, NY, USA, 2001.
  48. V. N. Vapnik, Statistical Learning Theory, Wiley, New York, NY, USA, 1998.
  49. T. S. Lim, W. Y. Loh, and Y. S. Shih, “Comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms,” Machine Learning, vol. 40, no. 3, pp. 203–228, 2000. View at Publisher · View at Google Scholar · View at Scopus
  50. Y. Sakurai and Y. Yashiki, “Assessment of land value by the metropolitan area stations,” Weekly Takarajima, no. 572, pp. 24–42, 2002.
  51. R. M. Neal, Bayesian Learning for Neural Networks, Springer, New York, NY, USA, 1996.
  52. C. Gu, Smoothing Spline ANOVA Models, Springer, New York, NY, USA, 2002.
  53. B. Baesens, T. Van Gestel, M. Stepanova, D. Van Den Poel, and J. Vanthienen, “Neural network survival analysis for personal loan data,” Journal of the Operational Research Society, vol. 56, no. 9, pp. 1089–1098, 2005. View at Publisher · View at Google Scholar · View at Scopus
  54. D. R. Mani, J. Drew, A. Betz, and P. Datta, “Statistics and data mining techniques for life-time value modeling,” in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 94–103, San Diego, Calif, USA, 1999.
  55. W. N. Street, “A neural network model for prognostic prediction,” in Proceedings of the 15th International Conference on Machine Learning, pp. 540–546, Wisconsin, Wis, USA, 1998.