Abstract

This paper considers the feed-forward neural network models for data of mutually exclusive groups and a set of predictor variables. We take into account the bootstrapping based on information criterion when selecting the optimum number of hidden units for a neural network model and the deviance in order to summarize the measure of goodness-of-fit on fitted neural network models. The bootstrapping is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples. Simulated data from known (true) models are analyzed in order to interpret the results using the neural network. In addition, the thyroid disease database, which compares estimated measures of predictive performance, is examined in both a pure training sample study and in a test sample study, in which the realized test sample apparent error rates associated with a constructed prediction rule are reported. Apartment house data of the metropolitan area station with four-class classification are also analyzed in order to assess the bootstrapping by comparing leaving-one-out cross-validation (CV).

1. Introduction

The neural network model is considered for the multiclass classification problem of assigning each observation into one of multiclass, which is referred to as a multiple-group neural discriminant model. As two-class problems are much easier to solve, we focus on neural networks for multiclass classification with respect to statistical techniques in order to derive the maximum likelihood estimators (MLE) [1–7]. Statistical techniques are formulated in terms of the principle of the likelihood of the neural discriminant model, in which the connection weights of the network are treated as unknown parameters.

Besides the theoretical and empirical properties of the bootstrapping [8, 9] in the multiple-group neural discriminant model, there are at least two other reasons to use a bootstrap procedure. First, the criterion based on bootstrapping is demonstrated to be favorable when selecting the optimum number of hidden units. A number of model selection procedures (i.e., methods for the selection of the optimum number of hidden units), such as Akaike information criterion (AIC), BIC (Baysesian information criterion) and cross-validation [10–13] have been proposed. The bootstrap method, however, provides the percentile for the deviance, allowing evaluation of the overall goodness-of-fit and estimation of the bias of the excess error in prediction based on the selected model. Therefore, there is no extra cost for subsequence inference via the bootstrap samples generated for model selection. If a model is selected by a cross-validation method and the bootstrap is used for the subsequence inference, the extra cost of computations is required in resampling for cross-validation. Second, the bootstrap procedures developed in the multiple-group neural discriminant model can be extended, without any theoretical derivation, to more complicated problems such as the generalized additive models (GAM) [14, 15], support vector machines (SVM) [16–19], and vector generalized additive models (VGAM) [20].

The remainder of this paper is organized as follows. In Section 2 we focus on the selection of the optimum number of hidden units and evaluation of the overall goodness-of-fit with the optimum number of hidden units. A neural network can approximate any reasonable function with arbitrary precision if the number of hidden units tends to infinity [21]. The output of the network fits the training sample too closely if the number of hidden units is increased and the noise is modeled in addition to the desired underlying function. The bootstrapping is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples [22, 23]. Simulated data from known (true) models are used to demonstrate the approximate realization of continuous mapping by neural networks in Section 3. In Section 4 the methods are illustrated using a thyroid disease database in order to show that the overfitting leads poor generalization. Apartment house data of the metropolitan area station with four-class classification are also analyzed in order to assess the bootstrapping by comparing leaving-one-out CV. Finally, in Section 5 we discuss the relative merits and limitations of the methods.

2. Materials and Methods

2.1. Multiple-Group Neural Discriminant Model
2.1.1. Statistical Inference

The functional representation of the neural network model is considered, as shown in Figure 1. The connection weight between the 𝑖th unit in the input layer (𝑖=0,…,𝐼) and the 𝑗th unit in the hidden layer (𝑗=1,…,𝐻) is 𝛼𝑖𝑗. Similarly, the weight between the 𝑗th unit in the hidden layer (𝑗=0,…,𝐻) and the π‘˜th unit in the output layer (π‘˜=1,…,𝐾) is π›½π‘—π‘˜. The input to the 𝑗th hidden unit is a linear projection of the input vector 𝐱=(π‘₯1,…,π‘₯𝐼), that is,𝑒𝑗=𝐼𝑖=0𝛼𝑖𝑗π‘₯𝑖,π‘₯0≑1,(1) where 𝛼0𝑗 is a bias. This is the same idea as incorporating the constant term in the design matrix of a regression by including a column of 1’s [1]. The output of the 𝑗th hidden unit is𝑦𝑗𝑒=𝑓𝑗=𝑓𝐼𝑖=0𝛼𝑖𝑗π‘₯𝑖ξƒͺ,(2) where 𝑓(β‹…) is a nonlinear activation function. The most commonly used activation function is the logistic (sigmoid) function:𝑦𝑗=1ξ€·1+expβˆ’π‘’π‘—ξ€Έ.(3) The input to the π‘˜th output unit isπ‘£π‘˜=𝐻𝑗=0π›½π‘—π‘˜π‘¦π‘—,𝑦0≑1,(4) where 𝛽0𝑗 is a bias. The activation function of network outputs for the mutually exclusive groups can be achieved using the softmax activation (normalized exponential) function:π‘œπ‘˜=𝑣expπ‘˜ξ€Έβˆ‘πΎπ‘˜=1𝑣expπ‘˜ξ€Έ=1ξ€·1+expβˆ’π‘‰π‘˜ξ€Έ,𝑉(5)π‘˜=π‘£π‘˜βŽ§βŽͺ⎨βŽͺβŽ©βˆ’lnπΎξ“π‘˜β€²β‰ π‘˜ξ€·π‘£expπ‘˜β€²ξ€ΈβŽ«βŽͺ⎬βŽͺ⎭,(6) which can be regarded as a multiclass generalization of logistic function.

From (1)–(6), π‘œπ‘˜ can be written in the formπ‘œπ‘˜=𝑣expπ‘˜ξ€Έβˆ‘πΎπ‘˜β€²=1𝑣expπ‘˜β€²ξ€Έ=ξ‚€βˆ‘exp𝐻𝑗=1π›½π‘—π‘˜π‘¦π‘—ξ‚βˆ‘πΎπ‘˜β€²=1ξ‚€βˆ‘exp𝐻𝑗=0π›½π‘—π‘˜β€²π‘¦π‘—ξ‚=ξ‚†βˆ‘exp𝐻𝑗=0π›½π‘—π‘˜/ξ‚€ξ‚€βˆ’βˆ‘1+exp𝐼𝑖=0𝛼𝑖𝑗π‘₯π‘–ξ‚ξ‚ξ‚‡βˆ‘πΎπ‘˜β€²=1ξ‚†βˆ‘exp𝐻𝑗=0π›½π‘—π‘˜β€²/ξ‚€ξ‚€βˆ’βˆ‘1+exp𝐼𝑖=0𝛼𝑖𝑗π‘₯𝑖.(7) The output π‘œπΎ to the 𝐾th group can be calculated as π‘œπΎβˆ‘=1βˆ’πΎβˆ’1π‘˜=1π‘œπ‘˜. For example, in the case of 𝐾=3, it follows thatπ‘œ1=𝑒𝑣1𝑒𝑣1+𝑒𝑣2+𝑒𝑣3,π‘œ2=𝑒𝑣2𝑒𝑣1+𝑒𝑣2+𝑒𝑣3.(8) From π‘œ1+π‘œ2+π‘œ3=1, π‘œ3=1βˆ’(π‘œ1+π‘œ2). Thus the number of unit for output layer is 2 (= πΎβˆ’1).

By setting the teach valueπ‘‘π‘˜βŸ¨π‘‘βŸ©=⎧βŽͺ⎨βŽͺ⎩1βˆΆπ‘‘thinputvectorπ±βŸ¨π‘‘βŸ©=ξ‚€π‘₯1βŸ¨π‘‘βŸ©,π‘₯2βŸ¨π‘‘βŸ©,…,π‘₯πΌβŸ¨π‘‘βŸ©ξ‚,isfromtheπ‘˜-group0∢o.w.(9) the log likelihood function for the total sample size 𝐷 is=ln𝐿(𝜽;𝐱,𝐭)𝐷𝐾𝑑=1ξ“π‘˜=1π‘‘π‘˜βŸ¨π‘‘βŸ©lnπ‘œπ‘˜βŸ¨π‘‘βŸ©,0β‰€π‘œπ‘˜βŸ¨π‘‘βŸ©β‰€1,πΎξ“π‘˜=1π‘‘π‘˜βŸ¨π‘‘βŸ©=1,(10) where π­βŸ¨π‘‘βŸ©=(𝑑1βŸ¨π‘‘βŸ©,𝑑2βŸ¨π‘‘βŸ©,…,π‘‘πΎβŸ¨π‘‘βŸ©) and π¨βŸ¨π‘‘βŸ©=(π‘œ1βŸ¨π‘‘βŸ©,π‘œ2βŸ¨π‘‘βŸ©,…,π‘œπΎβŸ¨π‘‘βŸ©) are the teach and output vectors, respectively, for the 𝑑th observation, 𝐭=(𝐭⟨1⟩,𝐭⟨2⟩,…,𝐭⟨𝐷⟩), 𝐱=(𝐱⟨1⟩,𝐱⟨2⟩,…,𝐱⟨𝐷⟩), 𝜢={𝛼𝑖𝑗}, and 𝜷={π›½π‘—π‘˜}. As usual, the negative log likelihood gives the cross-entropy error function. The unknown parameters 𝜽={𝜢,𝜷} can be estimated by maximizing the log likelihood ((10) with output (7)) by use of batch backpropagation including momentum, in which the training values for unknown parameters are chosen at random. The number of parameters included in the multiple-group neural discriminant model is 𝑝=𝐻(𝐼+πΎβˆ’1)+𝐻+πΎβˆ’1.

2.1.2. Determination of the Optimum Number of Hidden Units

The criterion based on bootstrapping is demonstrated to be favorable when selecting the optimum number of hidden units. In conventional statistics, various criteria have been developed for assessing the generalization performance. AIC provides us with a decision as to which of several competing network architectures are best for a given problem. However, the usage of AIC may not be justified theoretically when considering a neural network as an approximation to an underlying model [7, 24]. A bootstrap type nonparametric resampling estimator of Kullback-Leibler information by Ishiguro and Sakamoto [25], Konishi and Kitagawa [26], Ishiguro et al. [27], Kullback and Leibler [28], and Shibata [29] and Shao [30] can provide an alternative to AIC computed from a skewed discrete distribution.

Let the training samples 𝐗={𝐗1,𝐗2,…,𝐗𝑑,…,𝐗𝐷}, 𝐗𝑑={π±βŸ¨π‘‘βŸ©,π­βŸ¨π‘‘βŸ©}, and π±βŸ¨π‘‘βŸ©={π‘₯1βŸ¨π‘‘βŸ©,π‘₯2βŸ¨π‘‘βŸ©,…,π‘₯πΌβŸ¨π‘‘βŸ©} for 𝑑=1,2,…,𝐷 be independently distributed in an unknown distribution 𝐹. Let 𝐹 be the empirical distribution function that places a mass equal to 1/𝐷 at each point 𝐗1,𝐗2,…,𝐗𝑑,…,𝐗𝐷. We propose the bootstrap sampling algorithm given as follows.

Step 1. Generate 𝐡 samples π—βˆ—, each of size 𝐷, drawn with replacement from the training sample 𝐗={𝐗1,𝐗2,…,𝐗𝑑,…,𝐗𝐷}. Denote the 𝑏th sample as π—βˆ—π‘,𝑏=1,2,…,𝐡.

Step 2. For each bootstrap sample π—βˆ—π‘,𝑏=1,2,…,𝐡, fit a model to obtain the estimator 𝜽(π—βˆ—π‘).

Step 3. The bootstrap estimator of bias πΆβˆ— is given as πΆβˆ—β‰…1𝐡𝐡𝑏=1𝐗lnπΏβˆ—π‘;ξπœ½ξ€·π—βˆ—π‘ξ€Έξ‚‡ξ‚†ξπœ½ξ€·π—βˆ’ln𝐿𝐗;βˆ—π‘ξ€Έ,(11) where πΆβˆ— is the average of differences between log likelihood on the bootstrap sample ln𝐿{π—βˆ—π‘;𝜽(π—βˆ—π‘)} and that on the training sample ln𝐿{𝐗;𝜽(π—βˆ—π‘)}, given 𝜽(π—βˆ—π‘).
Thus, Extended Information Criterion (EIC) proposed by Ishiguro et al. [27] is defined as EIC=βˆ’2ln𝐿𝐗;𝜽(𝐗)+2πΆβˆ—.(12)𝐸𝐼𝐢 approach selects the number of hidden units with the minimum value of (12) as Shibata [29] and Shao [30] point out that this method is asymptotically equivalent to leaving-one-out CV and AIC.
Note that the bootstrap algorithm requires refitting of the model (retraining the network) 𝐡 times [31]. The number of replications 𝐡 is in the range 20≀𝐡≀200, and so 𝐡=200 bootstrap replications are used. The competing networks share the same architecture with the only exception being the number of hidden units.

2.1.3. Bootstrapping the Deviance

No standard procedure by which to assess the overall goodness-of-fit of the multiple-group neural discriminant model has been proposed. By introducing the maximum likelihood principle, the deviance allows us to test the overall goodness-of-fit of the model:Dev=2lnπΏπ‘“ξ‚€ξπœ½ξƒ¬βˆ’ln𝐿𝐗;=2𝐷𝐾𝑑=1ξ“π‘˜=1ξƒ―π‘‘π‘˜βŸ¨π‘‘βŸ©ξƒ©π‘‘lnπ‘˜βŸ¨π‘‘βŸ©Μ‚π‘œπ‘˜βŸ¨π‘‘βŸ©,ξƒͺξƒ°ξƒ­(13) where ln𝐿(𝐗;𝜽) denotes the maximized log likelihood under a current neural discriminant model. Since the log likelihood for the full model lnπΏπ‘“βˆ‘=2𝐷𝑑=1βˆ‘πΎπ‘˜=1{π‘‘π‘˜βŸ¨π‘‘βŸ©lnπ‘‘π‘˜βŸ¨π‘‘βŸ©} is zero by using the definition 0ln0=0, we haveDev=βˆ’2𝐷𝐾𝑑=1ξ“π‘˜=1ξ‚†π‘‘π‘˜βŸ¨π‘‘βŸ©lnΜ‚π‘œπ‘˜βŸ¨π‘‘βŸ©ξ‚‡.(14) Note that the deviance is two times log likelihood Equation (10). The greater the deviance, the poorer the fit of the model. However, the deviance given in (14) is not even approximately distributed as πœ’2 for the case in which binary (Bernoulli) responses are available [32–35]. We therefore provide the bootstrap estimator of the percentile (i.e., the critical point) for the deviance given in (14) according to the following algorithm.

Step 1. Generate 𝐡 (= 200) bootstrap samples π—βˆ— drawn with the replacement from the training sample 𝐗 with the optimum number of hidden units which was determined by the way in Section 2.1.2.

Step 2. For the bootstrap sample π—βˆ—π‘,𝑏=1,2,…,𝐡, the deviance given in (14) is computed as 𝐗Dev(𝑏)=βˆ’2lnπΏβˆ—π‘;ξπœ½ξ€·π—βˆ—π‘ξ€Έξ‚.(15) This process is independently repeated 𝐡 times, and the computed values are arranged in ascending order.

Step 3. The value of the 𝑗th order statistic Devβˆ— of the 𝐡 replications can be taken as an estimator of the quantile of order 𝑗/(𝐡+1).

Step 4. The estimator of the 100(1βˆ’π›Ό)-th percentile (i.e., the 100𝛼% critical point) of Devβˆ— is used to test the goodness-of-fit of the model using a specified significance level 𝛼=1βˆ’π‘—/(𝐡+1). If the value of the deviance given in (14) is greater than the estimate of the percentile, then the model fits poorly.

2.1.4. Excess Error Estimation

Let error rate 𝑒(𝐹;𝐗) be the probability of incorrectly predicting the outcome of a new observation drawn from an unknown distribution 𝐹, given a prediction rule on a training sample 𝐗. This error rate is defined as the actual error rate, which is of interest in performance assessment of prediction rules. Let 𝐹 be the empirical distribution function that places a mass equal to 1/𝐷 at each point 𝐗1,𝐗2,…,𝐗𝑑,…,𝑋𝐷. We apply a prediction rule πœ‚ to this training sample 𝐗 and form the realized prediction rule πœ‚ξπΉ(𝐱⟨0⟩) for a new observation 𝐗0={𝐱⟨0⟩,𝐭⟨0⟩}. Let 𝑄(𝐭⟨0⟩𝐹,πœ‚(𝐱⟨0⟩)) indicate the discrepancy between an observed value 𝐭⟨0⟩ and its predicted value πœ‚ξπΉ(𝐱⟨0⟩). Let error rate 𝑒(𝐹;𝐗), referred to as the apparent error rate, be the probability of incorrectly predicting the outcome for the sample drawn from the empirical distribution of the training sample, 𝐹. Because the training sample is used for both forming and assessing the prediction rule, this proportion (i.e., apparent error rate) underestimates the actual error rate. The difference 𝑒(𝐹;𝐗)βˆ’π‘’(𝐹;𝐗) is the excess error. The expected excess error (i.e., bias) of a given prediction rule [22, 23, 36, 37] is𝑒𝑏(𝐹)=𝐸𝐹;π—βˆ’π‘’(𝐹;𝐗).(16)

When the prediction rule by multiple-group neural discriminant model is allowed to be complicated, overfitting becomes a real danger, and excess error estimation becomes important. Thus we will consider the bootstrapping to estimate the expected excess error when fitting a multiple-group neural discriminant model to the data. The algorithm can be summarized as follows.

Step 1. Generate bootstrap samples π—βˆ— from 𝐹 as described in Section 2.1.2. Let ξπΉβˆ— be the empirical distribution of π—βˆ—1,π—βˆ—2,…,π—βˆ—π‘‘,…,π—βˆ—π·.

Step 2. For each bootstrap sample π—βˆ—, fit a model to obtain the estimator 𝜽(π—βˆ—) and construct the realized prediction rule πœ‚ξπΉβˆ— based on π—βˆ—.

Step 3. The bootstrap estimator of the expected excess error in (16) is given by π‘…βˆ—=1𝐷𝐷𝑑=1π‘„ξ‚€π­βŸ¨π‘‘βŸ©βˆ—ξπΉ,πœ‚βˆ—ξ€·π±βŸ¨π‘‘βŸ©βˆ—ξ€Έξ‚βˆ’1𝐷𝐷𝑑=1π‘„ξ‚€π­βŸ¨π‘‘βŸ©ξπΉ,πœ‚βˆ—ξ€·π±βŸ¨π‘‘βŸ©ξ€Έξ‚,(17) where π‘„ξ‚€π­βŸ¨π‘‘βŸ©βˆ—ξπΉ,πœ‚βˆ—ξ€·π±βŸ¨π‘‘βŸ©βˆ—ξ€Έξ‚=ξ‚»1∢incorrectdiscriminant.0∢o.w.(18)

Step 4. Repeat Step 1–Step 3 for bootstrap samples π—βˆ—π‘,𝑏=1,2,…,𝐡(= 200) to get π‘…βˆ—π‘. The bootstrap estimator of the expected excess error can be obtained as 𝑏𝐹≅1𝐡𝐡𝑏=1π‘…βˆ—π‘.(19)

Step 5. The actual error rate with bootstrap bias correction is 𝑒boot𝐹=𝑒𝐹;π—βˆ’π‘.(20)

3. Simulation Study

Since the model generally does not encompass unknown functions, but rather only approximations thereof, the model is inherently misspecified. Therefore, we demonstrate results from some Monte Carlo simulations to evaluate the performance. The criterion based on bootstrapping is demonstrated to be favorable when selecting the optimum number of hidden units [38–41]. Vach et al. [41] investigated how regression functions can be approximate specific regression from the class𝑓(π‘₯)=𝛽0+𝐼𝑖=1𝛽𝑖π‘₯𝑖+𝐼𝑖=1𝛾𝑖π‘₯2𝑖+𝑖<𝑖′𝛿𝑖𝑖′π‘₯𝑖π‘₯𝑖′,(21) and pointed out that the comparison using members of this is a little bit unfair. We thus show the superiority of neural network model by using the function of the existence of several local extrema. The influence can be illustrated through a simple simulation using a neural network model with two inputs π‘₯1 and π‘₯2, because we can visualize the contour plot of unknown population.

3.1. Two-Class Classification

The influence can be illustrated through a simple simulation using a neural network model with two inputs, one output and a varying number of hidden units. For two independent continuous covariates π‘₯1 and π‘₯2, we simulated the following known (true) model:𝑓π‘₯1,π‘₯2ξ€Έ=1ξ€Ίξ€·1+expβˆ’sin2πœ‹π‘₯1ξ€Έβˆ’π‘₯1π‘₯2ξ€·βˆ’sin2πœ‹π‘₯2.ξ€Έξ€»(22)

Training and test samples of size 1000 were considered in the present study. Input data (π‘₯1,π‘₯2) are chosen from data that are uniformly distributed over [0,1]Γ—[0,1], and the binary response 𝑦 is labeled with 1 if 𝑓(π‘₯1,π‘₯2)>(1/2) and otherwise with 0. Figure 2 shows the distribution of the covariates (π‘₯1,π‘₯2) and the class membership indicator in the training sample.

EIC values with 𝐡=200 replications based on bootstrapping pairs for the training sample are shown in Figure 3 after fitting the neural discriminant models having one to five hidden units. For the purpose of comparison, the values of AIC and BIC are also provided. In the case of the simulation study, the known (true) model given in (22) is included in the population. Thus, the differences between EIC and AIC values are slight.

Using the simulated training sample, the feed-forward neural networks were fit to the known (true) model given in (22). The tendency of mapping performed by neural networks with hidden units β„Ž=1,2,3 to implausibly fit the function given in (22) can also be illustrated.

The bootstrap estimate of the 95th percentile Devβˆ— (i.e., the 5% critical point) for the training sample with four hidden units is Devβˆ—=203.10. Comparison to the deviance given in (14) (Dev=39.40) suggests that the multiple-group neural discriminant model fits the data fairly well because Dev=39.40 is far from the 5% critical point Devβˆ—=203.10.

The actual error rate with the bootstrap bias correction given in (20) for the multiple-group neural discriminant models with four hidden units is calculated as 𝑒boot=0.009. Figure 4 illustrates the apparent error rates observed in the training sample- and test sample-based error rates. The apparent error rates for both samples decreased with the increase in the number of hidden units from β„Ž=1 to β„Ž=4 and then remained constant.

Figures 3 and 4 are based on only one simulated data set. However, the efficacy of the bootstrap procedures would be more convincingly illustrated in a simulation study based on multiple samples. Figure 5 shows the average values of EIC, AIC, and BIC based on multiple samples with 100 replications after fitting the neural discriminant models having one to five hidden units. Figure 6 shows the box-and-whisker plots for EIC in order to evaluate the standard errors and other statistics. Figure 7 illustrates the mean apparent error rates observed in multiple test samples with 100 replicates. Figure 8 also shows the box-and-whisker plots for the mean apparent error rates in multiple test samples with 100 replicates. For the purpose of comparison, the estimates of the actual error rates with bootstrap bias correction for the training sample [42] are also shown in Figure 7. It is concluded that EIC identifies the optimal number of hidden units (i.e., 4) more often than AIC. In addition, the differences between the average values of EIC and AIC are somewhat similar to Figure 3, and the average values of the bootstrap-corrected estimate of the prediction error rate vary around the average apparent error rates for the multiple test samples.

3.2. Multiclass Classification

Input data (π‘₯1,π‘₯2) are generated from uniformly distributed over 𝐱=(π‘₯1,π‘₯2)∈[βˆ’1,1]Γ—[βˆ’1,1]. By substituting 𝐱=(π‘₯1,π‘₯2) intoPrπ‘˜=𝑓expπ‘˜ξ€·π‘₯1,π‘₯2ξ€Έξ€Ύβˆ‘1+πΎβˆ’1π‘˜=1𝑓expπ‘˜ξ€·π‘₯1,π‘₯2ξ€Έξ€Ύ,π‘˜=1,2,…,πΎβˆ’1(23)𝐱 can be grouped into four classes in the case of 𝐾=4. As nonlinear function π‘“π‘˜(π‘₯1,π‘₯2),𝑓1ξ€·π‘₯1,π‘₯2ξ€Έξ€·=5sin2πœ‹π‘₯1ξ€Έξ€·βˆ’3sin2πœ‹π‘₯2ξ€Έ,𝑓2ξ€·π‘₯1,π‘₯2ξ€Έξ€·=sin2πœ‹π‘₯1ξ€Έξ€·+2cos2πœ‹π‘₯2ξ€Έ,𝑓3ξ€·π‘₯1,π‘₯2ξ€Έξ€·π‘₯=2βˆ’81π‘₯βˆ’0.5ξ€Έξ€·1ξ€Έξ€·π‘₯βˆ’0.5βˆ’82π‘₯βˆ’0.5ξ€Έξ€·2ξ€Έβˆ’0.5(24) can be used with βˆ‘πΎπ‘˜=1Prπ‘˜=1 [41, 43].

By use of definition of π‘‘π‘˜ in (9), the observations can be divided into four class by multinomial random number⎧βŽͺ⎨βŽͺ⎩𝐭∼Multinorm(1,𝐏𝐫)Class1∢𝐭=(1,0,0,0)Class2∢𝐭=(0,1,0,0)Class3∢𝐭=(0,0,1,0)Class4∢𝐭=(0,0,0,1)(25) with 𝐭=(𝑑1,𝑑2,𝑑3,𝑑4), 𝐏𝐫=(Pr1,Pr2,Pr3,Pr4).

In this paper, training and test samples of size 400 were considered. The apparent error rates for training and test samples of several models are given in Table 1. From Table 1, it is found that the apparent error rates of training and test sample for multiple-group neural discriminant model is the smallest.

4. Results and Discussion

Prediction accuracy (error rate) is the most important consideration in the development of prediction model. The assessment of goodness-of-fit is a useful exercise. In particular the goodness-of-fit and error rate from the training data are meaningful because of overfitting issue. The main purpose is to predict the future samples accurately. In other words, in real applications, the test sample population may be different from the training samples. A benchmark data set is thus used to illustrate the advantages of the models and methods developed herein. A multiple-group neural discriminant model having a single hidden layer was applied to a data set of 3,772 training instances and 3,428 testing instances of a thyroid disease database. All of these data sets are available on the World Wide Web at http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease. The present study considered three groups: hypothyroid, hyperthyroid, and normal. The laboratory profiles on which the differential diagnosis is made consist of 21 attributes (15 attributes are binary, and six attributes are continuous):

π‘₯1: age, π‘₯2: sex, π‘₯3: on thyroxine, π‘₯4: query thyroxine, π‘₯5: on antithyroid, π‘₯6: sick, π‘₯7: pregnant, π‘₯8: thyroid surgery, π‘₯9: I131 treatment, π‘₯10: query hypothyroid, π‘₯11: query hyperthyroid, π‘₯12: lithium, π‘₯13: goitre, π‘₯14: tumour, π‘₯15: hypopituitary, π‘₯16: psych, π‘₯17: TSH, π‘₯18: T3, π‘₯19: TT4, π‘₯20: T4U, π‘₯21: FTI.

Table 2 is a list of the first-five observations for the 21 attributes and the group with respect to the training sample. The training sample is used to determine the neural network model structure. Table 3 is a list of the first-five observations for 21 attributes and the group with respect to the test sample. The goal of discrimination is to assign new observations to one of the mutually exclusive groups. The data in Tables 2 and 3 include six continuous and 15 binary attributes. Fisher’s discriminant model assumed that the inputs are normal distributed. However, it is worth noting that the posterior class probabilities for neural discriminant model can be given by maximizing log likelihood Equation (10) without the normal distributed assumption for inputs.

A thyroid disease database has been used as a benchmark test for the neural network model shown in Figure 1 with 𝐼=21 and 𝐾=3. EIC values are shown in Figure 9 after fitting the multiple-group neural discriminant models having one to four hidden units. In this case, the true model is not included in the population. For the purpose of comparison, AIC and BIC values are also provided.

Figure 9 indicates that the minimum EIC value is obtained for the model having two hidden units, which has an apparent error rate 𝑒(𝐹;𝑋) of 0.0090. Figure 10 shows a histogram of the bootstrap replications π‘…βˆ—π‘ that are used to estimate the expected excess error. The values of the mean and standard deviation of π‘…βˆ—π‘ are βˆ’0.0033 and 0.0022, respectively. The actual error rate with the bootstrap bias correction given in (20) for the multiple-group neural discriminant models with two hidden units is calculated as 𝑒boot=0.012.

The histogram of the bootstrapped Dev(𝑏) for 𝐡=200 is provided in Figure 11. The bootstrap estimate of the 95th percentile Devβˆ— (i.e., the 5% critical point) for the thyroid disease training sample with two hidden units is Devβˆ—=344.41. Comparison to the deviance given in (14) (Dev=204.03) suggests that the multiple-group neural discriminant model fits the data fairly well. For reference, the 𝑄-𝑄 plot of the bootstrapped Dev(𝑏) for 𝐡=200 is shown in Figure 12.

Alternatively, if the deviance Equation (14) asymptotically follows the πœ’2 distribution with π·βˆ’π‘=3772 degrees of freedom under the null hypothesis that the model is correct, the probability density function of the πœ’2 distribution with 3772 degrees of freedom is shown in Figure 13. However, because of large sample size 𝐷=3772, the distribution is extremely skewed. By comparing Figure 13 with Figure 11, it is found that the distribution of deviance Equation (14) can not be approximated by πœ’2 distribution. Furthermore, the mean and deviance of bootstrapped Dev(𝑏) are 𝐸[Dev(𝑏)]=205.55 and Var[Dev(𝑏)]=4436.90, respectively, which are not close to those of the πœ’2 distribution with 3772d.f., that is, 𝐸[πœ’2]=d.f.=3772 and Var[πœ’2]=2Γ—d.f.=7544. It should be noted that the deviance asymptotically follows πœ’2 distribution for grouped binary (i.e., binomial) response and a set of predictor variables, as described in Tsujitani and Aoki [44].

The apparent error rates after fitting the multiple-group neural discriminant models having one to four hidden units are shown in Figure 14. Figure 14 indicates that (i) the multilayer feedforward neural network can approximate virtually any function up to some desired level of approximation with the number of hidden units increased ad libitum for the training sample, (ii) the actual error rate for the test sample is the smallest when the number of hidden units is two, and (iii) a neural network with a large number of hidden units has a higher error rate for the test sample, because the noise is modeled in addition to the underlying function.

Although the model fits the training sample as well as possible by increasing the number of hidden units, the model does not generalize very well to the test sample, which is the goal. The apparent error rates for training and test samples of several models are given in Table 4: (i) the multigroup logistic discriminant model with linear effect [6, 45] by use of library{VGAM} in free software R [15], (ii) multiple-group logistic discrimination models with linear + quadratic effects, (iii) the tree-based model with mincut = 5, minsize = 10, mindev = 0.01 as tuning parameters [46] by use of library{rpart} in R, (iv) the nearest neighbor smoother using a nonparametric method to derive the classification criterion [6, 47] by use of library (knn) in R, (v) the kernel smoother [47] using normal distribution and a radius π‘Ÿ=1.1 to specify a kernel density by use of library{ks} in R, (vi) the support vector machine using the β€œone-against-one” approach [48, 49] by use of library{e1071} in R, (vii) the proportional odds model [14], and (viii) VGAM based on the proportional odds model with optimum smoothing parameters selected by leaving-one-out cross-validation [20] by use of library{VGAM} in R.

From Table 4, it is found that multiple-group neural discriminant model (β„Ž = 2) has the smallest error rate for test sample preserving relatively small error rate for training sample. In order to overcome the stringent assumption of the additive and purely linear effects of the covariates, multiple-group logistic discrimination models with linear and quadratic effects were included. The improvement obtained by the inclusion of the quadratic effect is slight. It should be noted that the apparent error rates for training of VGAM are the smallest, but that for test samples are large. This overfitting leads poor generalization. For example, the estimated smooth function of the covariate β€œage” for VGAM in Figure 15 shows the overfitting.

Table 5 is apartment house data for assessment of land value by the metropolitan area stations, of the metropolitan area stations with four-class classification [50]. By using the four covariates (average price of house built for sale, average house rent, yield, assessment of station value by the number of passengers getting on and off), and assessment of land value by the metropolitan area stations may be grouped into four categories:(i)the most comfortable,(ii)very comfortable,(iii)a little comfortable,(iv)not comfortable.

Figure 16 indicates the values of EIC, AIC, and leaving-one-out CV (See the Appendix). The leaving-one-out CV is also included in order to assess the bootstrapping. The minimum EIC and leaving-one-out CV values are obtained for the model having two hidden units. However, the number of hidden unit with the minimum AIC value is three. The actual error rates in the case using EIC and leaving-one-out CV with two hidden units are 0.276 and 0.273, respectively. The bootstrapping is assessed from the point of leaving-one-out CV. The apparent error rates for training samples of several models are given in Table 6. From Table 6, it is found that multiple-group neural discriminant model (β„Ž = 2) has the smallest error rate.

5. Conclusions

We discussed the learning algorithm by maximizing the log likelihood function. Statistical inference based on the likelihood approach for the multiple-group neural discriminant model was discussed, and a method for estimating bias on the expected log likelihood in order to determine the optimum number of hidden units was suggested. The key idea behind bootstrapping is to focus on the optimum tradeoff between the unbiased approximation of the underlying model and the loss in accuracy caused by increasing the number of hidden units. In the context of applying bootstrap methods to a multiple-group neural discriminant model, this paper considered three methods and performed experiments using two data sets to evaluate the methods. The three methods are bootstrap pairs sampling algorithm, goodness-of-fit statistical test, and excess error estimation algorithm.

There are two broad limitations to our approach. First, the use of batch backpropagation algorithm including momentum prevents an maximum likelihood estimates from getting trapped in a local minimum, not global minimum. So far, our discussion of neural networks has focused on the maximum likelihood to determine the network parameters (weights and biases). However, a Bayesian neural network approach [51] might provide a more formal framework in which to incorporate a prior parameter distribution. Second, our neural network models assumed the independence of the predictor variables 𝐱=(π‘₯1,…,π‘₯𝐼). More generally, it may be preferable to visualize interactions between predictor variables. The smoothing spline ANOVA models can provide an excellent means for data of mutually exclusive groups and a set of predictor variables [43, 52]. We expect that flexible methods for discriminant model using machine learning theory [47, 53–55] such as penalized smoothing splines and support vector machine [17–19] will be very useful in these real-world contexts.

Appendix

Leaving-One-Out CV

An alternative model selection strategy for the bias correction Equation (14) of the log likelihood is leaving-one-out CV for a multiple-group neural discriminant model, which is asymptotically equivalent to TIC [29]. Let the training sample 𝐗={𝐗1,𝐗2,…,𝐗𝑑,…,𝐗𝐷} be independently distributed in an unknown distribution. We then obtain the leaving-one-out CV algorithm.

Step 1. Generate the training samples 𝐗[𝑑]={𝐗1,𝐗2,…,π—π‘‘βˆ’1,𝐗𝑑+1,…,𝐗𝐷},𝑑=1,2,…,𝐷. The subscript [𝑑] of a quantity indicates the deletion of the 𝑑th data point 𝑋𝑑 from the training sample 𝐗.

Step 2. Using each training sample, fit a model. Then, estimate unknown parameters denoted by πœƒ(𝑋[𝑑]) and predict the output π‘œβŸ¨π‘‘βŸ©π‘˜[𝑑] for the deleted sample point 𝑋[𝑑].

Step 3. The average predictive log likelihood of the deleted sample point is 1𝐷ln𝐷𝐾𝑑=1ξ‘π‘˜=1ξ‚†π‘œπ‘˜[𝑑]βŸ¨π‘‘βŸ©ξ‚‡π‘‘π‘˜βŸ¨π‘‘βŸ©ξƒ­.(A.1) As a matter of convention, the cross-validation criterion is often stated as that of minimizing CV=βˆ’2ln𝐷𝐾𝑑=1ξ‘π‘˜=1ξ‚†π‘œπ‘˜[𝑑]βŸ¨π‘‘βŸ©ξ‚‡π‘‘π‘˜βŸ¨π‘‘βŸ©ξƒ­.(A.2)
The leaving-one-out CV criterion finds an appropriate degree of complexity by comparing the predictive probability density βˆπΎπ‘˜=1{π‘œβŸ¨π‘–βŸ©π‘˜[𝑖]}π‘‘π‘˜βŸ¨π‘–βŸ© for different model specifications. Anders and Korn [24] have shown that the CV criterion does not rely on any probabilistic assumption based on the properties of maximum likelihood estimators for misspecified models and is not affected by identification problems.