Journal of Probability and Statistics

Volume 2015 (2015), Article ID 432986, 8 pages

http://dx.doi.org/10.1155/2015/432986

## Robust Stability Best Subset Selection for Autocorrelated Data Based on Robust Location and Dispersion Estimator

^{1}Laboratory of Computational Statistics and Operations Research, INSPEM, University Putra Malaysia, 43400 Serdang, Malaysia^{2}Department of Statistics, College of Administration and Economics, University of Al-Qadisiyah, Diwaniyah, Iraq^{3}Faculty of Science and Institute for Mathematical Research, University Putra Malaysia, 43400 Serdang, Malaysia

Received 23 September 2015; Revised 7 December 2015; Accepted 8 December 2015

Academic Editor: Ramón M. Rodríguez-Dagnino

Copyright © 2015 Hassan S. Uraibi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Stability selection (multisplit) approach is a variable selection procedure which relies on multisplit data to overcome the shortcomings that may occur to single-split data. Unfortunately, this procedure yields very poor results in the presence of outliers and other contamination in the original data. The problem becomes more complicated when the regression residuals are serially correlated. This paper presents a new robust stability selection procedure to remedy the combined problem of autocorrelation and outliers. We demonstrate the good performance of our proposed robust selection method using real air quality data and simulation study.

#### 1. Introduction

The approach of splitting data into two parts is not new in the statistical inference and data analysis. Wasserman and Roeder [1] suggested combining the single-split approach with variable selection procedure. The variable selection algorithm is carried out in the first part (random half of data), followed by testing the significance of each selected variable based on value of regression coefficient in the second part of data (the remaining half of data). However, this procedure does not guarantee reproducible results due to choosing arbitrary split [2].

A stability selection or multisplit approach is put forward to enhance and improve the performance of single-split variable selection method. The modern approaches of stability selection which rely on subsampling technique are proposed by [2, 3] for high dimensional data. The data is repeatedly split randomly into two parts with equal size of . Unlike bootstrap, the stability selection approach repeatedly selects (without replacement) two subsamples with equal size from the original data. There is a possibility that any part of the split data may contain more outliers than the other parts of the split data. As a consequence, the existing classical linear regression stability selection procedure is easily affected by outliers, hence resulting in unreliable variables that are selected to the final model. This problem can be rectified by incorporating robust estimator in the selection procedure. However, this approach may not be adequate since robust estimation is expected to perform well only up to a certain percentage of outliers (Imon and Ali [4], Norazan et al. [5]). Since the selection procedure of the stability selection method is fairly closed to bootstrap [6], the idea of robust bootstrap may be used in stability selection procedure.

Following the idea of [4], in this paper, we propose diagnostic method before subsampling. The proposed diagnostic method is based on the Reweighted Fast Consistent and High (RFCH) breakdown estimator which is developed by [7] (cited by Alkenani and Yu [8], Özdemir and Wilcox [9], and Zhang et al. [10]). The suspected outliers are identified and deleted and random subsampling is performed from the remaining (clean) set of observations.

The proposed variable selection procedure also takes into consideration the autocorrelation problem. This problem, if not remedied, may provide misleading conclusions about the statistical significance of the regression coefficients [11]. Hence, the existing variable selection procedure may select the wrong model. Appropriate remedial measures must be taken after detecting the presence of autocorrelation problems. One often used the Cochrane-Orcutt or Prais-Winsten methods (Greene [12], Gujarati and Porter [11]) to rectify autocorrelation problem. Nonetheless, these procedures are based on the OLS estimates, which are not robust and are therefore easily affected by outliers. Ann and Midi [13] proposed the Robust Cochrane-Orcutt Prais-Winsten (RCOPW) iterative method, based on high breakdown point and high efficiency MM-estimator [14], to overcome the combined problem of outliers and autocorrelated errors.

Hence, the main objective of this paper is to develop reliable, robust stability all-subset selection procedure in the presence of outliers and autocorrelation problem. The proposed method is formulated by rectifying the autocorrelation problem at the outset and subsequently the Reweighted Fast Consistent High (RFCH) breakdown estimator is incorporated in the algorithm. Upon convergence, the concentrated (clean) dataset is identified and all possible subsets procedures, namely, the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) methods, were applied to the concentrated dataset in the last steps of the RFCH method. This approach is called concentrating all-subset selection and can be considered as a trade-off between the quality of data and the interpretability of a model.

#### 2. The Consistency of Robust Stability Selection

Olive and Hawkins [7] showed that the RFCH estimator is Fast Consistent and High breakdown. The RFCH estimator is constructed using concentration algorithm in which the convergence is achieved after ten steps. At convergence, outliers are identified and deleted from the dataset. The remaining data will be used in the robust stability selection method whereby the former can be considered a source of consistency having the following properties:(1)The all-subset selection of single-split data is consistent based on [7, Theorem ].(2)The multisplit procedure in which single-split data is repeated times is also consistent based on [2, Corollary ].

#### 3. Robust Stability All-Subset Selection Method

Let a multivariate location and scatter model be a joint distribution of the th case of a random vector that is completely specified by a population location vector and a symmetric positive definite population scatter matrix . Assume that cases are collected in an matrix , such that are independent. Consider a linear regression model , where is an vector of response variables, is an vector of regression parameters, is an matrix of independent variables, and is an vector of random errors, where . The algorithm of our proposed robust and fast consistent variable selection consists of three main stages that are summarized as follows.

*Stage 1 (rectifying the autocorrelation problem). *
We follow a simple procedure of Robust Cochrane-Orcutt method which is proposed by Ann and Midi [13] to rectify the problem of autocorrelation in the presence of both types of outlying observations, vertical outliers, and leverage points. The procedure can be summarized as follows:(1)Estimate the robust regression coefficients using the MM-estimator to get the residuals .(2)Regress with using the MM-estimator, to find the robust parameter .(3)Use in the equations below to remedy the autocorrelation problem, and obtain a new design matrix and response variable :where .

*Stage 2 (concentrating the data). *
The concentrating algorithm assumes that the normality assumption for a linear regression is violated due to outliers or other contamination. The RFCH algorithm is employed to clean the data. This procedure uses the Devlin, Gnanadesikan and Kettenring (DGK) [15], and Median Ball (MB) [16]. These algorithms are summarized as follows.

Suppose the matrix is a combination of the response vector and the covariates matrix .

*(i) The DGK Algorithm*

*Step 1. *
Begin by computing the classical estimator of the original dataset to give the initial or starting point , and find the initial Mahalanobis distance:

*Step 2. *
Arrange the initial Mahalanobis distances in increasing order to compute their median. Those observations in the original dataset whose Mahalanobis distances are less than the median of all the Mahalanobis distances will be in the remaining set (half dataset) and will be denoted by :

*Step 3. *
Let be equal to , where is the variance-covariance matrix of the original data. Calculate the average and the variance-covariance estimators of to get the first attractor .

*Step 4. *
If the diagonal elements of are equal to , then stop the algorithm. Otherwise, repeat Steps until convergence, to get the final attractor and , where is the convergence step.

*(ii) The Median Ball (MB) Algorithm*

*Step 1. *
Suppose the initial variance-covariance matrix of the identity matrix and suppose that Med is the median vector of the matrix . Then, the Mahalanobis distance based on the median is defined as follows:

*Step 2. *
The location criterion cut-off point is the median of and is denoted by :where . The cut-off point should be the quantile of whose probability equals 0.5. For the concentration of , find the half dataset with only nonoutlying observations whose Mahalanobis distances are less than or equal to the median:

*Step 3. *
Compute the average and the variance-covariance matrix of .

*Step 4. *
For more concentrations, compute the Mahalanobis distances again, and repeat Steps until convergence at the final attractor and , where is the convergence step.

*(iii) The Reweighted Fast and Consistent High (RFCH) Breakdown Algorithm.* Olive and Hawkins [7] developed the MB estimator by adding the location criterion or cut-off point to select the attractor and proposed the so-called Fast Consistent and High (FCH) breakdown estimator. Olive and Hawkins [7] noted that the FCH estimator uses the attractors with the smallest determinant.

*Step 1. *
Following the same approach as Olive and Hawkins [7], define the final attractors as follows:where is the 50th percentile of a Chi-square distribution with degrees of freedom.

According to [7, Theorem ], as long as the start is a consistent estimator of either or , the FCH attractor is a consistent estimator of , where and are positive constants and or based on the criterion cut-off point.

*Step 2. *
Obtain the Reweighted FCH attractors by isolating the observation with , and using the classical estimator to obtain fromCompute the new cut-off point as . The new variance-covariance matrix is

*Step 3. *
Repeat Steps - with the new cut-off point until convergence, to get the final attractors and .

*Stage 3 (robust stability selection based on all-subset selection). *
The concentrated data involves the concentrated response vector and the concentrated design matrix . Assume that is a single random subsample that is drawn from , and is the remaining subsample, where such that is the number of rows in the concentrated design matrix .

All-subset regression method guarantees that all possible potential covariates will be included in the submodels. The classical BIC criteria have the ability to determine the best model. We propose that all-subset procedure be applied to the first part of data . The best model is the one that has coefficients with values less than , where is the number of all candidate covariates. Repeat this procedure times until convergence to get best subsets such that , where ; is number of parameters estimation in subset , where .

Following Meinshausen and Bühlmann [2], the threshold is defined aswhere is the expected number of variables falsely selected, is the number of covariates in the specific subset, and is the highest chosen selection probability with the most selected covariates in the hole path of solution. In this study, we used . Let be the number of ’s repeated in ; then, the selected variables are those that belong to such that . We multiply by to create the threshold measured by percentage; that is, , where is the number of covariates in certain subset.

#### 4. Simulation Study

Here, we report a simulation study that was designed to assess the performance of our proposed robust variable selection technique under two different outlier scenarios. In this experiment, we consider multiple linear regression model with the following relation: where .

A design matrix was generated from a multivariate normal distribution with covariance structure , where , , and .

The random errors were drawn from a standard normal distribution. To create the autocorrelation problem, we considered the following setting: where .

As in [17], two outlier scenarios were added to the data. The first scenario contaminated the residuals by symmetric outliers with the slash distribution, where , and the random errors were generated as . The second outlier scenario was generated by replacing 10% of the original values with high leverage points and vertical outliers. The vertical outliers were generated as asymmetric outliers, where , and the errors were generated as . To create the leverage points, each covariate was contaminated with 10% outlying observations generated from .

For each case, we generated 500 independent simulated datasets. The problem of autocorrelated errors first is rectified and then randomly split each of the dataset into training (70%) and test (30%) sets. The proposed robust stability selections (R. multisplit-AIC and R. multisplit-BIC), the existing stability selections (multisplit-AIC and multisplit-BIC), and the single-split all-subsets-AIC and the single-split all-subsets-BIC methods were then applied to the training datasets. This process was repeated 500 times. The average Root Mean Squares Errors (RMSE) of the test sets over 500 simulation runs and the percentage chances for each variable of the training sets being selected in the final model over 500 simulation runs are presented in Tables 1–3. The potential variables being selected are also exhibited in the tables. The best method is the one that has the lowest RMSE and selects the correct variables (variables , , , , ) in the final model with no noise variable. The results in Table 1 show that when there is no outlier in the data, all the six methods are reasonably closed to each other. The results indicate that our proposed method is comparable with other existing methods.