Abstract

Stability selection (multisplit) approach is a variable selection procedure which relies on multisplit data to overcome the shortcomings that may occur to single-split data. Unfortunately, this procedure yields very poor results in the presence of outliers and other contamination in the original data. The problem becomes more complicated when the regression residuals are serially correlated. This paper presents a new robust stability selection procedure to remedy the combined problem of autocorrelation and outliers. We demonstrate the good performance of our proposed robust selection method using real air quality data and simulation study.

1. Introduction

The approach of splitting data into two parts is not new in the statistical inference and data analysis. Wasserman and Roeder [1] suggested combining the single-split approach with variable selection procedure. The variable selection algorithm is carried out in the first part (random half of data), followed by testing the significance of each selected variable based on value of regression coefficient in the second part of data (the remaining half of data). However, this procedure does not guarantee reproducible results due to choosing arbitrary split [2].

A stability selection or multisplit approach is put forward to enhance and improve the performance of single-split variable selection method. The modern approaches of stability selection which rely on subsampling technique are proposed by [2, 3] for high dimensional data. The data is repeatedly split randomly into two parts with equal size of . Unlike bootstrap, the stability selection approach repeatedly selects (without replacement) two subsamples with equal size from the original data. There is a possibility that any part of the split data may contain more outliers than the other parts of the split data. As a consequence, the existing classical linear regression stability selection procedure is easily affected by outliers, hence resulting in unreliable variables that are selected to the final model. This problem can be rectified by incorporating robust estimator in the selection procedure. However, this approach may not be adequate since robust estimation is expected to perform well only up to a certain percentage of outliers (Imon and Ali [4], Norazan et al. [5]). Since the selection procedure of the stability selection method is fairly closed to bootstrap [6], the idea of robust bootstrap may be used in stability selection procedure.

Following the idea of [4], in this paper, we propose diagnostic method before subsampling. The proposed diagnostic method is based on the Reweighted Fast Consistent and High (RFCH) breakdown estimator which is developed by [7] (cited by Alkenani and Yu [8], Özdemir and Wilcox [9], and Zhang et al. [10]). The suspected outliers are identified and deleted and random subsampling is performed from the remaining (clean) set of observations.

The proposed variable selection procedure also takes into consideration the autocorrelation problem. This problem, if not remedied, may provide misleading conclusions about the statistical significance of the regression coefficients [11]. Hence, the existing variable selection procedure may select the wrong model. Appropriate remedial measures must be taken after detecting the presence of autocorrelation problems. One often used the Cochrane-Orcutt or Prais-Winsten methods (Greene [12], Gujarati and Porter [11]) to rectify autocorrelation problem. Nonetheless, these procedures are based on the OLS estimates, which are not robust and are therefore easily affected by outliers. Ann and Midi [13] proposed the Robust Cochrane-Orcutt Prais-Winsten (RCOPW) iterative method, based on high breakdown point and high efficiency MM-estimator [14], to overcome the combined problem of outliers and autocorrelated errors.

Hence, the main objective of this paper is to develop reliable, robust stability all-subset selection procedure in the presence of outliers and autocorrelation problem. The proposed method is formulated by rectifying the autocorrelation problem at the outset and subsequently the Reweighted Fast Consistent High (RFCH) breakdown estimator is incorporated in the algorithm. Upon convergence, the concentrated (clean) dataset is identified and all possible subsets procedures, namely, the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) methods, were applied to the concentrated dataset in the last steps of the RFCH method. This approach is called concentrating all-subset selection and can be considered as a trade-off between the quality of data and the interpretability of a model.

2. The Consistency of Robust Stability Selection

Olive and Hawkins [7] showed that the RFCH estimator is Fast Consistent and High breakdown. The RFCH estimator is constructed using concentration algorithm in which the convergence is achieved after ten steps. At convergence, outliers are identified and deleted from the dataset. The remaining data will be used in the robust stability selection method whereby the former can be considered a source of consistency having the following properties:(1)The all-subset selection of single-split data is consistent based on [7, Theorem ].(2)The multisplit procedure in which single-split data is repeated times is also consistent based on [2, Corollary ].

3. Robust Stability All-Subset Selection Method

Let a multivariate location and scatter model be a joint distribution of the th case of a random vector that is completely specified by a population location vector and a symmetric positive definite population scatter matrix . Assume that cases are collected in an matrix , such that are independent. Consider a linear regression model , where is an vector of response variables, is an vector of regression parameters, is an matrix of independent variables, and is an vector of random errors, where . The algorithm of our proposed robust and fast consistent variable selection consists of three main stages that are summarized as follows.

Stage 1 (rectifying the autocorrelation problem). We follow a simple procedure of Robust Cochrane-Orcutt method which is proposed by Ann and Midi [13] to rectify the problem of autocorrelation in the presence of both types of outlying observations, vertical outliers, and leverage points. The procedure can be summarized as follows:(1)Estimate the robust regression coefficients using the MM-estimator to get the residuals .(2)Regress with using the MM-estimator, to find the robust parameter .(3)Use in the equations below to remedy the autocorrelation problem, and obtain a new design matrix and response variable :where .

Stage 2 (concentrating the data). The concentrating algorithm assumes that the normality assumption for a linear regression is violated due to outliers or other contamination. The RFCH algorithm is employed to clean the data. This procedure uses the Devlin, Gnanadesikan and Kettenring (DGK) [15], and Median Ball (MB) [16]. These algorithms are summarized as follows.
Suppose the matrix is a combination of the response vector and the covariates matrix .

(i) The DGK Algorithm

Step 1. Begin by computing the classical estimator of the original dataset to give the initial or starting point , and find the initial Mahalanobis distance:

Step 2. Arrange the initial Mahalanobis distances in increasing order to compute their median. Those observations in the original dataset whose Mahalanobis distances are less than the median of all the Mahalanobis distances will be in the remaining set (half dataset) and will be denoted by :

Step 3. Let be equal to , where is the variance-covariance matrix of the original data. Calculate the average and the variance-covariance estimators of to get the first attractor .

Step 4. If the diagonal elements of are equal to , then stop the algorithm. Otherwise, repeat Steps until convergence, to get the final attractor and , where is the convergence step.

(ii) The Median Ball (MB) Algorithm

Step 1. Suppose the initial variance-covariance matrix of the identity matrix and suppose that Med is the median vector of the matrix . Then, the Mahalanobis distance based on the median is defined as follows:

Step 2. The location criterion cut-off point is the median of and is denoted by :where . The cut-off point should be the quantile of whose probability equals 0.5. For the concentration of , find the half dataset with only nonoutlying observations whose Mahalanobis distances are less than or equal to the median:

Step 3. Compute the average and the variance-covariance matrix of .

Step 4. For more concentrations, compute the Mahalanobis distances again, and repeat Steps until convergence at the final attractor and , where is the convergence step.

(iii) The Reweighted Fast and Consistent High (RFCH) Breakdown Algorithm. Olive and Hawkins [7] developed the MB estimator by adding the location criterion or cut-off point to select the attractor and proposed the so-called Fast Consistent and High (FCH) breakdown estimator. Olive and Hawkins [7] noted that the FCH estimator uses the attractors with the smallest determinant.

Step 1. Following the same approach as Olive and Hawkins [7], define the final attractors as follows:where is the 50th percentile of a Chi-square distribution with degrees of freedom.
According to [7, Theorem ], as long as the start is a consistent estimator of either or , the FCH attractor is a consistent estimator of , where and are positive constants and or based on the criterion cut-off point.

Step 2. Obtain the Reweighted FCH attractors by isolating the observation with , and using the classical estimator to obtain fromCompute the new cut-off point as . The new variance-covariance matrix is

Step 3. Repeat Steps - with the new cut-off point until convergence, to get the final attractors and .

Stage 3 (robust stability selection based on all-subset selection). The concentrated data involves the concentrated response vector and the concentrated design matrix . Assume that is a single random subsample that is drawn from , and is the remaining subsample, where such that is the number of rows in the concentrated design matrix .
All-subset regression method guarantees that all possible potential covariates will be included in the submodels. The classical BIC criteria have the ability to determine the best model. We propose that all-subset procedure be applied to the first part of data . The best model is the one that has coefficients with values less than , where is the number of all candidate covariates. Repeat this procedure times until convergence to get best subsets such that , where ; is number of parameters estimation in subset , where .
Following Meinshausen and Bühlmann [2], the threshold is defined aswhere is the expected number of variables falsely selected, is the number of covariates in the specific subset, and is the highest chosen selection probability with the most selected covariates in the hole path of solution. In this study, we used . Let be the number of ’s repeated in ; then, the selected variables are those that belong to such that . We multiply by to create the threshold measured by percentage; that is, , where is the number of covariates in certain subset.

4. Simulation Study

Here, we report a simulation study that was designed to assess the performance of our proposed robust variable selection technique under two different outlier scenarios. In this experiment, we consider multiple linear regression model with the following relation: where .

A design matrix was generated from a multivariate normal distribution with covariance structure , where , , and .

The random errors were drawn from a standard normal distribution. To create the autocorrelation problem, we considered the following setting: where .

As in [17], two outlier scenarios were added to the data. The first scenario contaminated the residuals by symmetric outliers with the slash distribution, where , and the random errors were generated as . The second outlier scenario was generated by replacing 10% of the original values with high leverage points and vertical outliers. The vertical outliers were generated as asymmetric outliers, where , and the errors were generated as . To create the leverage points, each covariate was contaminated with 10% outlying observations generated from .

For each case, we generated 500 independent simulated datasets. The problem of autocorrelated errors first is rectified and then randomly split each of the dataset into training (70%) and test (30%) sets. The proposed robust stability selections (R. multisplit-AIC and R. multisplit-BIC), the existing stability selections (multisplit-AIC and multisplit-BIC), and the single-split all-subsets-AIC and the single-split all-subsets-BIC methods were then applied to the training datasets. This process was repeated 500 times. The average Root Mean Squares Errors (RMSE) of the test sets over 500 simulation runs and the percentage chances for each variable of the training sets being selected in the final model over 500 simulation runs are presented in Tables 13. The potential variables being selected are also exhibited in the tables. The best method is the one that has the lowest RMSE and selects the correct variables (variables , , , , ) in the final model with no noise variable. The results in Table 1 show that when there is no outlier in the data, all the six methods are reasonably closed to each other. The results indicate that our proposed method is comparable with other existing methods.

Nevertheless, the results change dramatically in the presence of both outliers scenarios. It can be observed from Table 2 that the classical multisplit-AIC and multisplit-BIC methods are much affected in the presence of high leverage and vertical outliers. Both methods have the highest RMSEs and tend to be underfitting. In this situation, both the single-split-AIC and single-split-BIC variable selection techniques also fail to select the correct variables. Both methods tend to be overfitting because they also select noise variables in the final model. The presence of symmetric outliers as can be seen from Table 3 changes things amazingly. The RMSEs of the single-split-AIC and single-split-BIC are relatively larger than those of the other methods and both tend to be underfitting. Surprisingly, the multisplit-AIC and multisplit-BIC methods select the correct variables in this situation. Nonetheless, their RMSEs are slightly larger than the R. multisplit-AIC and the R. multisplit-BIC. On the other hand, the RMSEs of the R. multisplit-AIC and the R. multisplit-BIC are consistently the smallest among the six methods. Both methods select the correct variables with no noise variables, when no contamination occurs in the model and also in both outliers scenarios. Hence, it can be concluded that our proposed R. multisplit-AIC and the R. multisplit-BIC methods are the best variable selection methods in the linear regression model with autocorrelated errors because they are stable and consistently select the correct variables without choosing any noise variable.

5. Air Quality Data

In this study, hourly air pollution data which are taken from the Department of Environment (DoE), Malaysia, is used to further assess the performance of our method. The data consists of the PM10 concentration and ten independent variables, of which six are pollutant variables (sulphur dioxide (SO2), nitrogen dioxide (NO2), nitrogen monoxide (NO), nitrogen oxide (), carbon monoxide (CO), and ozone (O3)) and four are meteorological variables (wind speed (WS), wind direction (WD), temperature (Temp), and relative humidity (Hum)). PM10 is a particulate matter 10 micrometers or less in diameter of solid or semisolid material found in the air. The value of each variable was recorded from the monitoring station at Seberang Perai, Penang (Figure 1), on an hourly basis every day from January 2005 to December 2013.

For the purpose of the statistical analysis, the hourly data were converted to a daily average, giving 3,287 readings. Missing values and calibration hours of certain variables are replaced by the coordinate medians for these variables. Let us first observe the plots in Figure 2. Both the histogram (b) and the quantile-quantile (Q-Q) plot (c) of Figure 2 show that the residuals are contaminated with a heavy-tailed mixture distribution. Since some points in the Q-Q plot do not fall on the straight line and the histogram is skewed to the right, this indicates that this data is not normal. Thus, we suspect that there are outliers in this dataset. Figure 2(d) also shows that there are some leverage points in each covariate.

Figure 2(a) indicates the existence of autocorrelation or serial correlation between the residuals, and it seems that there is high order autoregression AR().

Our proposed robust stability all-subset selection procedures and the existing methods were then applied to the data (3287 observations) to investigate which important variables influenced PM10. The dataset consists of 3287 observations, which include the PM10 as the response variable and the ten independent variables already mentioned. Since the air quality data are taken in time sequence, the Durbin Watson (DW) test is applied to the data to check the existence of autocorrelation problem. The results of Durbin Watson statistics for the original air quality data () confirmed the existence of autocorrelation and no autocorrelation () after treating the autocorrelation problem.

After correcting the autocorrelation problem, the data is then randomly divided into training (70%) and test (30%) sets.

This process is repeated 3,000 times. The RFCH is used to concentrate the training and test set data. Following Meinshausen and Bühlmann [2], each training set and each test set are split randomly into two sets of equal size and this process is repeated 50 times. The six variable selection methods were then applied to the first part of the training dataset. The variables that are selected in the final model are determined. For cross validation, the coefficients of each training model are used to predict the response (PM10) using test set data. The model residuals and the RMSEs are then computed. Table 4 exhibits the selected variables and the percentage of each variable being selected for training set data and the average RMSE for test set data over 3,000 runs. The threshold value in Table 4 is calculated as follows:

A candidate variable is the one whose percentage of being selected in a model exceeds the threshold value. The best method is the one that has the lowest average of RMSE.

The results in Table 4 show that the RMSE of our proposed method, based on both AIC and BIC, is the smallest compared to the existing methods. This suggests that our proposed method correctly identified the potential variables, namely, WD, Temp, Hum, SO2, NO2, O3, and CO, to be included in the final model. The single-split-AIC method selects eight covariates, while the single-split-BIC method selects only six covariates. The classical multisplit-AIC selects seven covariates and multisplit-BIC selects five covariates.

It is interesting to observe that our proposed methods select all the pollutant variables except NOx and NO and all the meteorological variables except WS. From the results in Table 4, we can clearly infer that the R. multisplit-AIC and R. multisplit-BIC methods are more efficient than the classical methods, because the final model that is selected by these methods is sufficient to include all the nonzero covariates and has the smallest RMSE. The results of the model validation suggest that WD, Temp, Hum, SO2, NO2, O3, and CO should be included in the final model.

6. Conclusions and Recommendation

The main aim of this study was to develop a reliable alternative approach that is capable of selecting the correct variables in the final model for data having the combined problem of outliers and autocorrelated errors. We have considered the well known all-subsets-AIC and all-subsets-BIC, multisplit-AIC and multisplit-BIC variables selection methods in this regard. All the existing methods are not effective in choosing the correct variables in the final model. In this study, we proposed a robust stability selection method by incorporating a high efficient and high breakdown MM-estimator, the RFCH estimator, and applied the all-subset-BIC and the all-subset-AIC to the concentrated data. The real air quality data and simulation experiments show that our proposed methods successfully and consistently select the correct variables in the final model with the smallest RMSE. The commonly used methods failed to correctly select the correct variables in the final model. Hence, we can consider our proposed methods as better variable selection methods and strongly recommend using them especially when outliers and autocorrelated errors occur in the data.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors would like to thank the Ministry of Science, Technology & Innovation, Malaysia, that has supported this work under eScienceFund Research Grant no. 06-01-04-SF1764. Special thanks also go to Department of Environment, Ministry of Natural Resources & Environment, Malaysia, that has provided the air pollution data to be used in this research.