Robust Wild Bootstrap for Stabilizing the Variance of Parameter Estimates in Heteroscedastic Regression Models in the Presence of Outliers

Rana, Sohel; Midi, Habshah; Imon, A. H. M. R.

doi:https://doi.org/10.1155/2012/730328

Mathematical Problems in Engineering

On this page

Abstract Introduction Numerical Example References Copyright Related Articles

Research Article | Open Access

Volume 2012 | Article ID 730328 | https://doi.org/10.1155/2012/730328

Robust Wild Bootstrap for Stabilizing the Variance of Parameter Estimates in Heteroscedastic Regression Models in the Presence of Outliers

Sohel Rana,^1,2Habshah Midi,^1,2and A. H. M. R. Imon³

Academic Editor: Ben T. Nohara

Received31 Jul 2011

Revised31 Oct 2011

Accepted02 Nov 2011

Published26 Feb 2012

Abstract

Nowadays bootstrap techniques are used for data analysis in many other fields like engineering, physics, meteorology, medicine, biology, and chemistry. In this paper, the robustness of Wu (1986) and Liu (1988)'s Wild Bootstrap techniques is examined. The empirical evidences indicate that these techniques yield efficient estimates in the presence of heteroscedasticity problem. However, in the presence of outliers, these estimates are no longer efficient. To remedy this problem, we propose a Robust Wild Bootstrap for stabilizing the variance of the regression estimates where heteroscedasticity and outliers occur at the same time. The proposed method is based on the weighted residuals which incorporate the MM estimator, robust location and scale, and the bootstrap sampling scheme of Wu (1986) and Liu (1988). The results of this study show that the proposed method outperforms the existing ones in every respect.

1. Introduction

Bootstrap technique was first proposed by Efron [1]. It is a computer intensive method that can replace theoretical formulation with extensive use of computer. The attractive feature of the bootstrap technique is that it does not rely on the normality or any other distributional assumptions and is able to estimate standard error of any complicated estimator without any theoretical calculations. These interesting properties of the bootstrap method have to be traded off with computational cost and time. There are considerable papers that deal with bootstrap methods in the literatures (see [2–5]). The classical bootstrap methods are known to be a good general procedure for estimating a sampling distribution under the independent and identically distributed (i.i.d.) models. Let us consider a standard linear regression model: where , and . In this equation is a vector of unknown parameters, Y is an vector, X is an data matrix of full rank , and is an vector of unobservable random errors with E () = 0 and . In practice the i.i.d. set-up is often violated, as, for example, the homoscedastic assumption of is often violated. Wu [6] proposed a weighted bootstrap technique which gives better performance under both the homoscedastic and heteroscedastic models. However, a better alternative approximation is developed by Liu [7] following the suggestions of Liu [7] and Beran [8]. This type of weighted bootstraps is called the wild bootstrap in the literature. Several attempts have been made to use the Wu and Liu wild bootstrap techniques to remedy the problem of heteroscedasticity (see [6, 7, 9, 10]).

Salibian-Barrera and Zamar [11] pointed out that the problem of classical bootstrap is that the proportion of outliers in the bootstrap sample might be greater than that of the original data. Hence, the entire inferential procedure of bootstrap would be erroneous in the presence of outliers. As an alternative, robust bootstrap technique has been drawn a greater attention to the statisticians (see [11–15]). However, not much work is devoted to bootstrap technique when both outliers and heteroscedasticity are present in a data. Those wild bootstrap techniques can only rectify the problem of heteroscedasticity and not resistant to outliers. Moreover, these procedures are based on the OLS estimate which is very sensitive to outliers. We introduce the classical wild bootstrap in Section 2. In Section 3, we discuss the newly proposed robust wild bootstrap methods. A numerical example and a simulation study are presented in Sections 4 and 5, respectively. The conclusion of the study is given in Section 6.

2. Wild Bootstrap Techniques

In regression analysis, the most popular and widely used bootstrap technique is the fixed-x resampling or bootstrapping the residuals [2]. This bootstrapping procedure is based on the ordinary least squares (OLS) residuals summarized as follows.

Step 1. Fit a model by the OLS method to the original sample of observations to get and hence the fitted model is .

Step 2. Compute the OLS residuals = and each residual has equal probability, .

Step 3. Draw a random sample from with simple random sampling with replacement and attached to for obtaining fixed- bootstrap values where.

Step 4. Fit the OLS to the bootstrapped values on the fixed- to obtain .

Step 5. Repeat Steps 3 and 4 for B times to get where B is the bootstrap replications.

We call this bootstrap scheme Boot_ols since it is based on the OLS method.

When heteroscedasticity is present in the data, the variances of the data are different and neither of these bootstrap schemes can yield efficient estimates of the parameters. Wu [6] showed that they are inconsistent and asymptotically biased under the heteroscedasticity. Wu [6] proposed a wild bootstrap (weighted bootstrap) that can be used to obtain the standard error which is asymptotically correct under heteroscedasticity of unknown form. Wu slightly modified Step 3 of the OLS bootstrap and kept the other steps unchanged. For each i, draw a value , with replacement, from a distribution with zero mean and unit variance and attached to for obtaining fixed- bootstrap values , where and is the th leverage. Note that the variance of is not constant when the original errors are not homoscedastic. Therefore, this bootstrap scheme takes into consideration the nonconstancy of the error variances. As an alternative [6], can be chosen, with replacement, from , where with . For a regression model with intercept term, approximately equals zero. This is nonparametric implementation of Wu’s bootstrap since the resampling is done from the empirical distribution function of the (normalized) residuals. We call this method Wu’s bootstrap and denote it by Boot_wu.

Following the idea of Wu [6], another wild bootstrap technique was proposed by Liu [7] in which is randomly selected from a population that has third central moment equal to one with zero mean and unit variance. Such kind of selection is used to correct the skewness term in the Edgeworth expansion of the sampling distribution of , where I is an n-vector of ones. Liu’s bootstrap can be conducted by drawing random numbers in the following two ways.(1), and are independently and identically distributed having density , where and .(2), where are independently and identically distributed normal distribution with mean and variance 1/2. are also independently and identically distributed normal distribution with mean and variance 1/2. ’s and ’s are independent.

It is worth mentioning that selecting random numbers by procedure 1 or procedure 2 of Liu [7] will produce third central moment equal to one. Following Cribari-Neto and Zarkos [16], we consider the second procedure of drawing the random sample . We call this bootstrap scheme as Boot_liu.

3. Proposed Robust Wild Bootstrap Techniques

We have discussed the classical wild bootstrap procedures which are based on the OLS residuals. It is now evident that the OLS suffers a huge setback in the presence of outliers since it has 0% breakdown [17]. Since the wild bootstrap samples are based on the OLS residuals, it is not resistant to outliers. Hence, in this article we propose to use the high-breakdown and high-efficiency robust MM estimator [18] to obtain the robust residuals. It is expected that for good data point, the residuals of the MM estimator are approximately the same as the OLS residuals. On the other hand, the residuals of the MM estimator would be larger for outlier observation. We assign weights to the MM residuals. The standardized residuals are computed, where is the square root of the mean squares error of the residuals of the MM estimates (see [19]). Following the idea of Furno [20], weights equal to one and are assigned to and , respectively, where is an arbitrary constant which is chosen between 2 and 3. We multiply the new weights with the residuals of the MM estimates and the resultants are denoted by . It is now expected that not only the residuals corresponding to the good data points but also the residuals corresponding to the bad data point of the MM residuals tend to be similar to the OLS residuals with no outliers. Based on the new weighted residuals , we propose to robustify Boot_ols, Boot_wu, and Boot_liu. We call the resulting robust bootstraps RBoot_ols, RBoot_wu, and RBoot_liu.

We propose to replace the OLS residuals by in Step 3 of the Boot_ols. That is, the bootstrap sample is drawn from with simple random sampling and the other steps remain unchanged. We call this bootstrap scheme Rboot_ols. Now we will discuss the formulation of robust wild bootstrap based on Wu’s procedures. The algorithm is summarized as follows.

Step 1. Fit a model by the MM estimator to the original sample of observations to get the robust parameters and hence the fitted model is .

Step 2. Compute the residuals of the MM estimate . Then assign weight to each residual, , such that the weight equals 1 if and equals if .

Step 3. The final weighted residuals of the MM estimates denoted by are formulated by multiplying the weights obtained in Step 2 with the residuals of the MM estimates. That is, if the observation corresponds to good data point (no outliers) and if the observation corresponds to outliers.

Step 4. Construct a bootstrap sample , where and is a random sample following Wu [6] procedure.

Step 5. The OLS procedure is then applied to the bootstrap sample, and the resultant estimate is denoted by . Here, the robust estimates are very reliable since the bootstrap sample is constructed based on the robust weighted residuals, .

Step 6. Repeat Steps 4 and 5 for B times, where B is the bootstrap replications.

As discussed earlier, in the classical scheme of Wu’s bootstrap, the quantity is drawn from a population that has mean zero and variance equal to one or, can be drawn from normalized residuals, that is, However, following Maronna et al. [21], we suggest computing the robust normalized residuals based on median and normalized median absolute deviations (NMADs) instead of mean and standard deviation which are not robust. Thus, where . We call this proposed robust nonparametric bootstrap as RBoot_wu.

In this paper we also want to robustify the wild bootstrap based on the Liu [7] algorithm. It is important to note that the only difference between the Wu and Liu implementation of wild bootstrap is the choice of the random sample . In the proposed robust bootstrap based on the Liu wild bootstrap, we choose the random sample exactly the same manner as the classical Liu bootstrap. We call this bootstrap scheme as RBoot_liu.

4. Numerical Example

In this section, a numerical example is presented to assess the performance of the robust wild bootstrap methods. In order to compare the robustness of the classical and robust wild bootstrap in the presence of outliers, the Concrete Compressive Strength data is taken from Yeh [22]. Concrete is the most important material in civil engineering. The concrete compressive strength is a function of the eight output such as cement (Kg/m³), blast furnace slag (Kg/m³), fly ash (Kg/m³), water (Kg/m³), superplasticizer (Kg/m³), coarse aggregate (Kg/m³), fine aggregate (Kg/m³), and age of testing (days). The residuals versus fitted values are plotted in Figure 1 that show a funnel shape suggesting a heterogeneous error variances for the data (see [19]).

We checked whether this data set contain any outliers or not by using Least trimmed of Squares (LTSs) residuals. It is found that 61 observations (about 6% of the sample of size 1030) appear to be outliers. The robust and non-robust (Classical) wild bootstrap methods were then applied to the data by considering two types of situations, namely, the data with outliers and data without outliers (omitted the outlying data points). The results are based on 500 bootstraps and are given in Table 1.

The standard errors of the parameter estimates from robust and nonrobust wild bootstrap methods are exhibited in Table 1. The average standard errors of the parameter estimates are also shown. When there are no outliers, the standard errors of the classical wild bootstrap are reasonably closed to the standard errors of the robust wild bootstrap. It is interesting to note that the classical wild bootstrap methods provide larger standard errors compared to the wild bootstrap methods when outliers are present in the data.

We cannot make a final conclusion yet, just by observing the results of the real data, but a reasonable interpretation up to this stage is that the classical wild bootstrap is affected by outliers.

5. Simulation Study

In this section, the performances of the proposed robust wild bootstrap estimators are evaluated based on a simulation study. At first we generate some artificial data to see the performance of proposed bootstrap techniques. The final investigation of the performance of the proposed estimators is verified by the simulation approach on bootstrap samples.

5.1. Artificial Data

We follow the data generation technique of Cribari-Neto and Zarkos [16] and MacKinnon and White [23]. The design of this experiment involves a linear model with two covariates: We consider the sample sizes n = 20, 60, 100. For n = 20 the covariate values were obtained from and the covariate values were obtained from N(0,1). These observations were replicated three and five times for creating the sample of size n = 60 and n = 100, respectively. The data generation was performed using . For all i under the homoscedasticity, . However, the main interest here is to find the heteroscedastic model. In this respect, we create a heteroscedastic generating mechanism following Cribari-Neto [24]’s work, where The degree of heteroscedasticity was measured by The degree of heteroscedasticity remains constant for different sample sizes since the covariate values are replicated for generating different sample sizes. In our study the degree of heterogeneity was approximately . We focus on the situation where regression design would include outliers. To generate a certain percentages of outliers in Model (5.1), some i.i.d. normal errors ’s were replaced by N(5, 10). Hence the contaminated heteroscedastic model becomes where and is chosen according to level of percentage of outliers. In this study we choose the 5%, 10%, 15%, and 20% outliers in the model; that is, is 0.95, 0.80, 0.85, and 0.80, respectively. Now for each sample size, the OLS, the classical, and the proposed robust wild bootstrap were then applied to the data. The replications of the bootstrap were 500 in each model for the different sample sizes. It is noteworthy that the bootstrap is extremely computer intensive, and S-plus programming language was used for computing the bootstrap estimates.

The wild bootstrap standard errors of the estimates for different sample sizes and different percentage of contaminations are computed. The bootstrap standard errors of Boot_ols, Boot_wu, and Boot_liu are obtained by taking the square root to the main diagonal of the covariance matrix: where . On the other hand, the bootstrap standard errors of RBoot_ols, RBoot_wu, and RBoot_liu are obtained by taking the square root to the main diagonal of the covariance matrix as given in (5.5); the only essential difference is, however, we replace the usual bootstrap estimates by the robust bootstrap estimates.

The influences of outliers on the standard errors of the estimates are visible in Figures 2, 3, and 4. In these plots, the average standard errors of the parameters estimates are plotted at different levels of outliers for different bootstrap methods. The results presented in Figures 2–4 show that the performances of the wild bootstrap estimates are fairly close to the classical estimates at the 0% level of contamination. It emerges that the average standard errors of the RBoot_wu and RBoot_liu are closer to the average standard errors of the classical Boot_wu and Boot_liu, respectively, in “clean” data, regardless of the percentage of outliers. However, at the 5%, 10%, 15%, and 20% levels of contaminations, the classical standard errors of the bootstrap estimates become unduly large. On the contrary, it is interesting to see that not much influence is visible for the robust wild bootstrap techniques of RBoot_wu and RBoot_liu, at the different percentage levels of outliers. It is also observed that the performance of RBoot_liu is the best overall followed by RBoot_wu.

5.2. Simulation Approach on Bootstrap Sample

In the previous section, we used artificial data sets for different sample sizes. Now we would like to investigate the performances of different bootstrap estimators where data sets are generated by Monte Carlo simulations. Let us consider a heteroscedastic model which is given by The covariate values of and are generated from for sample sizes 20, 60, and 100. We have also considered as the true parameters in this model and the heteroscedasticity generating function was . In this study the level of heteroscedasticity is set as .

In each simulation run and for the different sample size, ’s were generated from N(0, 1) for the data with no outliers. However, for generating the 5% and 10% outliers, the 95% and 90% of’s were generated from N(0, 1) and the 5% and 10% were generated from N(0, 20). It is worth mentioning that although such simulations are extremely computer intensive, the simulation for each sample size entails a total of 250000 replications with 500 replications and 500 bootstrap samples each. This simulation procedure was performed following the design of Cribari-Neto and Zarkos [16] and Furno [20].

The simulation results for the different bootstrap methods are presented in Tables 2–4. Table 2 shows the biasness measures of the non-robust and robust wild bootstrap techniques. It is observed that for the different sample sizes, the biasness of the Boot_ols, the Boot_liu, and the Boot_wu increases with the increase in the percentage of outliers. On the other hand, the RBoot_wu, and the RBoot_liu are slightly biased with the increase in the percentage of outliers. We can draw the same conclusion from the mean of the biasness of the estimates. The standard errors of the non-robust and robust wild bootstrap are presented in Table 3. It is observed that the standard errors of the classical bootstrap estimates increase with the increase in the percentage of outliers for different sample sizes. However, the robust bootstrap estimates are slightly affected by these outliers. By investigating the average standard errors of the estimates, it is also observed that the robust wild bootstrap techniques provide less standard error of the estimates in the presence of outliers. Finally, the robustness of different bootstrapping techniques are evaluated based on robustness measures defined in (5.5). Here the percentage robustness measure, that is, the ratio of the RMSEs of the estimators compared with the RMSEs of the OLS estimator for good data is presented in Table 4. From this table we see that the OLS and the classical bootstrap methods perform poorly. In the presence of outliers, the efficiency of the classical bootstrap estimates is very low. However, the efficiency of the robust bootstrap estimates is fairly closed to 100%.

6. Concluding Remarks

This paper examines the performance of classical wild bootstrap techniques which were proposed by Wu [6] and Liu [7] in the presence of heteroscedasticity and outliers. Both the artificial example and simulation study show that the classical bootstrap techniques perform poorly in the presence of outliers in the heteroscedastic model although they perform superbly for “clean” data. We attempt to robustify those classical bootstrap techniques to gain better efficiency in the presence of outliers. The numerical results show that the newly proposed robust wild bootstrap techniques, namely, the RBoot_wu and RBoot_liu outperform the classical wild bootstrap techniques when both outliers and heteroscedasticity are present in the data. RBoot_liu performs slightly better than RBoot_wu. Another advantage of using the RBoot_wu and the RBoot_liu is that no diagnosis for the data is required before the application of these methods.

Acknowledgment

The authors are grateful to the referees for valuable suggestions and comments that help them to improve the paper.

References

B. Efron, “Bootstrap methods: another look at the jackknife,” The Annals of Statistics, vol. 7, no. 1, pp. 1–26, 1979.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
B. Efron and R. Tibshirani, “Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy,” Statistical Science, vol. 1, no. 1, pp. 54–77, 1986.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
B. Efron, “Better bootstrap confidence intervals,” Journal of the American Statistical Association, vol. 82, no. 397, pp. 171–200, 1987.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
B. Efron and R. Tibshiriani, An Introduction to the Bootstrap, CRC Press, 6th edition, 1993.
H. Midi, “Bootstrap methods in a class of non-linear regression models,” Pertanika Journal of Science and Technology, vol. 8, pp. 175–189, 2002.
View at: Google Scholar
C.-F. J. Wu, “Jackknife, bootstrap and other resampling methods in regression analysis,” The Annals of Statistics, vol. 14, no. 4, pp. 1261–1350, 1986.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
R. Y. Liu, “Bootstrap procedures under some non-i.i.d. models,” The Annals of Statistics, vol. 16, no. 4, pp. 1696–1708, 1988.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
R. Beran, “Prepivoting test statistics: a bootstrap view of asymptotic refinements,” Journal of the American Statistical Association, vol. 83, no. 403, pp. 687–697, 1988.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
E. Mammen, “Bootstrap and wild bootstrap for high-dimensional linear models,” The Annals of Statistics, vol. 21, no. 1, pp. 255–285, 1993.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
R. Davidson and E. Flachaire, “The wild bootstrap, tamed at last,” Working Paper IER#1000, Queen’s University, 2001.
View at: Google Scholar
M. Salibian-Barrera and R. H. Zamar, “Bootstrapping robust estimates of regression,” The Annals of Statistics, vol. 30, no. 2, pp. 556–582, 2002.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
G. Willems and S. Van Aelst, “Fast and robust bootstrap for LTS,” Computational Statistics & Data Analysis, vol. 48, no. 4, pp. 703–715, 2005.
View at: Publisher Site | Google Scholar
A. H. M. R. Imon and M. M. Ali, “Bootstrapping regression residuals,” Journal of Korean Data and Information Science Society, vol. 16, pp. 665–682, 2005.
View at: Google Scholar
M. Salibián-Barrera, S. Van Aelst, and G. Willems, “Fast and robust bootstrap,” Statistical Methods & Applications, vol. 17, no. 1, pp. 41–71, 2008.
View at: Publisher Site | Google Scholar
M. R. Norazan, H. Midi, and A. H. M. R. Imon, “Estimating regression coefficients using weighted bootstrap with probability,” WSEAS Transactions on Mathematics, vol. 8, no. 7, pp. 362–371, 2009.
View at: Google Scholar
F. Cribari-Neto and S. G. Zarkos, “Bootstrap methods for heteroskedastic regression models: evidence on estimation and testing,” Econometric Reviews, vol. 18, pp. 211–228, 1999.
View at: Google Scholar
P. J. Rousseeuw and A. M. Leroy, Robust Regression and Outlier Detection, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, John Wiley & Sons, New York, NY, USA, 1987.
V. J. Yohai, “High breakdown-point and high efficiency robust estimates for regression,” The Annals of Statistics, vol. 15, no. 2, pp. 642–656, 1987.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
A. H. M. R. Imon, “Deletion residuals in the detection of heterogeneity of variances in linear regression,” Journal of Applied Statistics, vol. 36, no. 3-4, pp. 347–358, 2009.
View at: Publisher Site | Google Scholar
M. Furno, “A robust heteroskedasticity consistent covariance matrix estimator,” Statistics, vol. 30, no. 3, pp. 201–219, 1997.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
R. A. Maronna, R. D. Martin, and V. J. Yohai, Robust Statistics: Theory and Methods, Wiley Series in Probability and Statistics, John Wiley & Sons, Chichester, UK, 2006.
I.-C. Yeh, “Modeling of strength of high-performance concrete using artificial neural networks,” Cement and Concrete Research, vol. 28, no. 12, pp. 1797–1808, 1998.
View at: Google Scholar
J. G. MacKinnon and H. White, “Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties,” Journal of Econometrics, vol. 29, no. 3, pp. 305–325, 1985.
View at: Google Scholar
F. Cribari-Neto, “Asymptotic inference under hetroskedasticity of unknown form,” Computational Statistics & Data Analysis, vol. 45, no. 2, pp. 215–233, 2004.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2012 Sohel Rana et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

3048

Downloads

1400

Citations