Robust Bayesian Regularized Estimation Based on Regression Model
The distribution is a useful extension of the normal distribution, which can be used for statistical modeling of data sets with heavy tails, and provides robust estimation. In this paper, in view of the advantages of Bayesian analysis, we propose a new robust coefficient estimation and variable selection method based on Bayesian adaptive Lasso regression. A Gibbs sampler is developed based on the Bayesian hierarchical model framework, where we treat the distribution as a mixture of normal and gamma distributions and put different penalization parameters for different regression coefficients. We also consider the Bayesian regression with adaptive group Lasso and obtain the Gibbs sampler from the posterior distributions. Both simulation studies and real data example show that our method performs well compared with other existing methods when the error distribution has heavy tails and/or outliers.
Following the pioneer work of Tibshirani , the Lasso (least absolute shrinkage and selection operator) has generated much interest in statistical literatures [2–5]. For the Gaussian linear regression,where ’s are either fixed or random covariates, , . The Lasso estimates are viewed as regularized least squares estimates, which are defined asfor and with vectors , matrix , and vector . The key advantage of Lasso lies in its ability to do simultaneous parameter estimation and variable selection and the entire path of Lasso estimates for all values of can be efficiently computed through a modification of the LARS (least angle regression) algorithm of Efron et al. .
The Gaussian assumption is not crucial in model (1), but it is useful to make connections to the likelihood framework for regularized regression in Section 2. The Lasso estimator in (2) is equivalent to minimizing the penalized negative log-likelihood as a function of the regression coefficients and using the -penalty : equivalence means here that we obtain the same estimator for a potentially different tuning parameter, but the Lasso estimator in (2) does not provide an estimate of the variance parameter . In fact, the variance parameter plays an important role especially when the error in the regression model has high variance.
On the other hand, the Lasso estimate in (2) based on least square loss may suffer from poor performance when the error distribution has heavy tail which is longer than normal distribution or the error has large variance or contains outliers in the linear regression model, which motivate many researches to consider regularized estimation based on robust method. The most existing robust regularized estimation methods mainly replace the least square loss function in (2) by some robust loss functions, such as Huber loss, loss, and quantile loss function. Li and Zhu  considered quantile regression with the Lasso penalty and developed its piecewise linear solution path. Wang et al.  investigated the least absolute deviance (LAD) estimate with adaptive Lasso penalty (LAD-Lasso) and proved its oracle property. Wu and Liu  further discussed the oracle properties of the SCAD (smoothly clipped absolute deviation) and adaptive Lasso regularized quantile regression. Zou and Yuan  proposed a new efficient estimation and variable selection procedure which is called composite quantile regression. Chen et al.  studied the robust Lasso in the presence of noise with large variance. They considered two editions of robust Lasso, which use the convex combined loss and Huber loss function criterion instead of least square criterion.
Besides the above existing robust regularized method, which, in essence, replace the loss by other loss criteria, we can also directly consider robust estimation from the standpoint of error distribution in model (1). It is well known that when the error in the linear regression follows a normal distribution, the estimations will be sensitive to outliers, which motivate us to consider using robust error distribution in the linear regression model in (1). The distribution as an alternative to the normal distribution has frequently been suggested in the literature as a robust extension of the traditional normal models. The distribution provides a convenient description for regression analysis when the residual term has density with heavy tails and excess kurtosis.
Since the Lasso estimate in (2) can be interpreted as a Bayesian posterior mode estimate when the regression parameters have independent Laplace priors, motivated by this connection, Park and Casella  discussed the Bayesian Lasso estimation for linear regression based on Gibbs sampler method. The Bayesian Lasso has several advantages; for example, it can conveniently provide interval estimates, that is, Bayesian credible intervals, that can guide variable selection, the structure of the hierarchical model provides methods for selecting the Lasso parameter, and sometimes its performances outperform non-Bayesian method. Therefore, several authors were subsequently devoted to the variable selection problem based on Bayesian method, such as Park and Casella , Hans , Li et al. , and Kyung et al. .
The main goal of this paper is to extend the Bayesian regularized method for regression model (4), which is a robust edition Bayesian regularized estimation method, and it is expected that the performances will be better than other regularized methods when the error distribution follows a symmetry distribution and there are outliers in regression model.
The rest of the paper is organized as follows. In Section 2, we introduce the regression with Lasso penalty. In Section 3, we build the framework of Bayesian regression model with adaptive Lasso penalty and then show the corresponding Bayesian hierarchical model and obtain the Gibbs sample estimation procedure. In Sections 4 and 5, we carry out some simulation studies and real data analysis to illustrate the performance of our new method. Some conclusion and remarks are contained in Section 6. We also discuss the Bayesian regression with group Lasso penalty and its Gibbs sampler method in the Appendices.
2. Regression Model and Norm Regularization
2.1. Regression Model
The distribution as an alternative to the normal distribution has frequently been suggested in the literatures; for example, Lange et al.  introduced distributed errors regression models as a robust extension of the traditional normal models; Liu and Rubin  discussed the estimations of distribution by EM algorithm and its extensions; and Lin et al.  studied heteroscedasticity diagnostics problem by score test for linear regression models. And it has often been used in many real applications, especially in finance.
The probability density function of a distributed variable with location parameter , dispersion parameter , and degree of freedom , denoted as , is where denotes the gamma function, , and .
The mean and variance of are and , respectively. When and 1, the distribution reduces to the normal and Cauchy distributions, respectively. The univariate linear Student- regression model can be expressed aswhere the ’s are known -vector of covariates and are the unknown -vectors of parameters.
2.2. Regression with Lasso Penalty
As discussed in Section 1, the loss in the Lasso model is not robust to heavy-tailed error distribution and/or outliers. This indicates that the Lasso is not an ideal goodness-of-fit measure criterion in the presence of noise with large variance. We consider following regularized regression, which is defined aswhere and is the log-likelihood function of :
Comparing (5) with (2), we take the parameter into the optimization of the penalized maximum likelihood estimator. Though we are penalizing only the parameter , the variance parameter estimate is influenced indirectly by the amount of shrinkage in (5).
There is main drawback of the estimator in (5) . It is not equivalent under scaling of the response . More precisely, consider the transformationwhich leaves model (2) invariant, but this is not the case for the estimator in (5). We address this drawback by using the penalty term , leading to the following estimator:Noting the form of the regularized estimation in (8), the Lasso estimates can be interpreted as posterior mode estimates when the regression parameters have independent Laplace (i.e., double-exponential) priors:Park and Casella  argued that conditioning on is important because it guarantees a unimodal full posterior.
3. Bayesian Formulation of Robust Adaptive Lasso
3.1. Bayesian Adaptive Lasso Regression
It is well known that the ordinary Lasso has a conflict between the consistency for model selection and optimal prediction. As a solution to achieve both estimation and model selection consistency, the adaptive Lasso  was proposed and its model selection consistency and asymptotic normality under certain rate of shrinkage parameter were proved. To extend the idea of the adaptive Lasso  to our proposed robust Lasso regression, we define the robust adaptive Lasso aswhere ’s allow unequal penalties for the coefficients and we define the precision parameter .
Motivated by the connection with Bayesian estimation and (9), we consider a fully Bayesian analysis using a conditional Laplace prior on :For any , by the following equality :the Laplace prior (11) on can be written as On the other hand, from Lange et al.  and Liu and Rubin , can be treated as a mixture of normal and gamma distributions; that is, if and , then . If we further put gamma priors on the parameter and use the improper prior density on , then we have the following Bayesian hierarchical model:where , , and
Remark 1. The amount of shrinkage parameters ’s depend on the value of hyperparameters and . Because larger and smaller lead to bigger penalization, it is important to treat and as unknown parameters to avoid giving specific values which affect the estimates of the regression coefficients.
Remark 2. Similar to Park and Casella , there is also another existing method to obtain the estimation of adaptive Lasso parameters beside sampling from the posterior distribution of ’s. We can use the empirical Bayes estimation method to update ’s without setting the prior of ’s in the Bayesian hierarchical model. Our little experience shows that the two methods perform well, but the empirical Bayes estimation method can save some computing time; for more details see Park and Casella  and Leng et al. .
3.2. Gibbs Sampler for All Parameters
Let be the parameter vector excluding the component , the vector excluding the component , and the variable excluding the component . Based on the Bayesian hierarchical model (14), the posterior distribution of all parameters can be given byBased on the expression (15), we can obtain the posterior full conditional distribution for all parameters, which can yield a tractable and efficient Gibbs sampler that works as follows.
The full conditional distribution is Thus, the full conditional distribution of is a gamma distribution.
The full conditional distribution is Thus, the full conditional distribution of is a generalized inverse Gaussian distribution. Now we consider the full conditional distribution of which is given by where . Let and . Then the full conditional distribution of is just the normal distribution . The full conditional distribution of is So the full conditional distribution of is a gamma distribution.
The full conditional distribution of is That is, the full conditional distribution of is a gamma distribution. At last, the full conditional distributions of and are Obviously, the full conditional distribution of is just a gamma distribution. However, the full conditional posterior distribution of does not have a closed form. Fortunately, we note that the full conditional posterior distribution of is a log-concave function, so the adaptive rejection sampling algorithm  can be used to sample from this distribution. The adaptive rejection sampling algorithm is an efficient sampling method from any univariate probability density function which is log-concave.
Remark 3. The Lasso was originally designed for variable selection for the posterior mode of ’s can be exactly zeros due to the nature of the laplace prior in (11). However, the posterior draws of ’s cannot be exact zeros since ’s are drawn from the continuous posterior distribution. A post hoc thresholding rule may overcome this difficulty. Alternatively, kyung et al.  recommended using the credible interval on the posterior mean to guide variable selection.
Remark 4. For variable selection, sometimes, several explanatory variables may be represented by a group of derived input variables. In this case the selection of important variables corresponds to the group of variables. The group Lasso penalty [13, 25] takes the group structure into account and can do variable selection at the group level. For the Bayesian regression model with adaptive group Lasso penalty, we give the details in the Appendices.
4. Simulation Studies
The main goal of this section is to assess the performance of our method (BAT.L) through some simulations with comparison to several Bayesian and non-Bayesian methods. The Bayesian methods include the Bayesian Lasso (BLS.L)  and Bayesian adaptive Lasso (BALS.L) . The non-Bayesian methods include Lasso (LASSO)  and adaptive Lasso (ALASSO) . The data in the simulation studies are generated by, where the covariance matrix is an matrix; that is, with correlation , and the error distributions of considered here are the following six situations: The first choice is a standard normal distribution, . The second choice is a normal distribution with variance 9, . The third choice is a distribution with degree of freedom , . The fourth choice is Laplace distribution, . The last two choices are the mixture of two normal distributions: and the mixture of two Laplace distributions: .
For each simulation study and each choice of the error distribution, we run 50 simulations. In each simulation, we generate a training set with 20 observations, a validation set with 20 observations, and a testing set with 200 observations. The validation set is used to choose the penalty parameters in LASSO and ALASSO. After the penalty parameters are selected, we combine the training set and validation set together to estimate . Since BAT.L, BLS.L, and BALS.L do not need the validation set since they estimate the penalty automatically, so we combine the training set and validation set together for estimation. The testing set is used to evaluate the performance for these methods. Within each simulation study, we consider three different settings for . Simulation 1: . Simulation 2: . Simulation 3: if , ; else, , .
Simulations 1, 2, and 3 are corresponding to the sparse case, very sparse case, and sparse recovery problem in the predictors, respectively.
Note that we intentionally choose error distributions that are different from the distribution and Laplace distribution to see how the Bayesian regression adaptive Lasso estimation depends on the error assumption. Our simulation results show that, in terms of parameter estimation accuracy, the Bayesian adaptive Lasso method still performs well even when this error distribution assumption is violated.
Unless otherwise specified, the freedom degree is used in all simulations for BAT.L. Since the true model is known, we can compute the median of mean absolute deviations (MMAD), that is, median where the median of mean absolute deviations is calculated based on 50 simulations. We present the results for simulation in Table 1. The parentheses in these tables are the standard deviations of the MMAD obtained by 500 bootstrap resamplings of the 50 mean absolute deviations.
Simulations show that, in terms of the MMAD, our new regularized method performs better than the other four methods in general, especially for the nonnormal error distributions.
5. Land Rent Data
In this section, we demonstrate the performance of BAT.L together with other methods by a real data set.
Weisberg  reported a data set about land rent data. The data was collected by Douglas Tiffany to study the variation in rent paid in 1977 for agricultural land planted with alfalfa. The variables are average rent per acre planted to alfalfa (), average rent paid for all tillable land (), density of dairy cows (number per square mile) (), and proportion of farmland used as pasture () and if liming is required to grow alfalfa; , otherwise. The unit of analysis is a county in Minnesota; the 67 counties with appreciable rented farmland are included. The response variable alfalfa, which has mean 42.1661 and standard deviation 22.5866, is a high protein crop that is suitable feed for dairy cows. It is thought that rent for land planted with alfalfa relative to rent for other agricultural purposes would be higher in areas with a high density of dairy cows and rents would be lower in counties where liming is required, since that would mean additional expense.
We first center all the variables so that the intercept is not considered and the predictors are standardized and then use above five regularized methods to fit the whole data set. For the three Bayesian methods, we use the Bayesian credible interval to guide variable selection. We found that the five different methods all select and as important variables. Furthermore, in order to evaluate the performance of BAT.L together with other methods, we randomly select 40 samples as train data set and the remaining 27 samples as test data set; the mean square errors for test data set are reported in the Table 2 for five different regularized methods.
As we can see from Table 2, the performance of BAT.L outperforms other variable selection procedures for this data set.
6. Conclusion Remarks
In this paper, we studied the regularized regression model based on Bayesian method. This new method extended the Bayesian Lasso  through replacing the least square loss function by the log-likelihood function of distribution and the Lasso penalty by adaptive Lasso penalty. Bayesian hierarchical models are developed and Gibbs samplers are derived for our new method. Both the simulation studies and the real data analysis show that BAT.L performs better than other existing Bayesian and non-Bayesian methods when the error distribution has heavy tails and/or outliers. We also discussed the Bayesian regression model with adaptive group Lasso penalty.
There is room to improve our methods. In this paper, we treat the dispersion parameter in regression model as constant. In fact, the dispersion parameter may be different for different observations, which need us to consider the problem of regularized regression with varying dispersion model; that is, where ’s are covariates, which constitute, in general, although not necessary, a subset of ’s. For the varying dispersion regression model, it is important to consider how to simultaneously do variable selection for both regression coefficients and based on Bayesian method or non-Bayesian method. Furthermore, it also be interesting to consider the problem of variable selection for varying dispersion regression models with high dimensional covariates. Research in this aspect is ongoing.
A. Bayesian Regression with Adaptive Group Lasso Penalty
In the Appendices, we consider regression problems where some explanatory variables may be represented by a group of derived input variables. In this case, the selection of important variables corresponds to the group of variables. The group Lasso penalty [13, 25] takes the group structure into account and can do variable selection at the group level.
Suppose that the predictors are grouped into groups and is the coefficient vector of the th group predictors . Then and . Let be the dimension of the vector and let be a known positive definite matrix (). Define . We consider the following adaptive group Lasso regression:Pick the prior of as the Laplace prior: where . By the equality (12), we have If we further put gamma priors on the parameter , then we have the following Bayesian hierarchical model:
B. The Gibbs Sampler for Bayesian Adaptive Group Lasso Regression
Based on the Bayesian hierarchical model (A.4), the posterior distribution of all parameters can be given by By the expression (B.1), we can obtain the posterior full conditional distribution for all parameters, which can yield a tractable and efficient Gibbs sampler. It is easy to see that the full conditional distribution of is the same as in the adaptive Lasso regression.
The full conditional distribution is Thus, the full conditional distribution of is a generalized inverse Gaussian distribution.
The full conditional distribution of is given bywhere .
Let and . Then the full conditional distribution of is just the normal distribution .
The full conditional distribution of is That is, the full conditional distribution of is a gamma distribution.
The full conditional distribution of is
At last, the full conditional distributions of and are Again, the full conditional distribution of is gamma distribution and we can use the adaptive rejection sampling algorithm (Gilks, 1992) to obtain the sampler of .
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The research was supported in part by the Research Project of Social Science and Humanity Fund of the Ministry of Education (14YJC910007).
C. Liu and D. Rubin, “ML estimation of the t distribution using EM and its extensions, ECM and ECME,” Statistica Sinica, vol. 5, no. 1, pp. 19–39, 1995.View at: Google Scholar
E. Lehmann, Theory of Point Estimation, Wadsworth and Brooks/Cole, Pacific Grove, Calif, USA, 1983.
S. Weisberg, Applied Linear Regression, Wiley, New York, NY, USA, 1985.