Research Article | Open Access
Penalized Maximum Likelihood Method to a Class of Skewness Data Analysis
An extension of some standard likelihood and variable selection criteria based on procedures of linear regression models under the skew-normal distribution or the skew- distribution is developed. This novel class of models provides a useful generalization of symmetrical linear regression models, since the random term distributions cover both symmetric as well as asymmetric and heavy-tailed distributions. A generalized expectation-maximization algorithm is developed for computing the penalized estimator. Efficacy of the proposed methodology and algorithm is demonstrated by simulated data.
In recent years, nonnormal distributions have received substantial interest in the statistics literature. The growing need for more flexible tools to analyze data that exhibit nonnormal features, including asymmetry, multimodality, and heavy tails, has led to intense development in nonnormal model-based methods. Many such distributions have been successfully applied to numerous datasets from a wide range of fields, including medical sciences, bioinformatics, environmetrics, engineering, economics, and financial sciences. Some recent applications of skew-normal and skew- models include [1–5].
The rich literature and active discussion of skew distributions as well as regression analysis of skew distributions were initiated by the pioneering work of , in which the univariate skew distribution was introduced. Following its generalization to the multivariate skew-normal distribution in , the number of contributions has grown rapidly. The concept of introducing additional parameters to regulate skewness in a distribution was subsequently extended to other parametric families. For a comprehensive survey of skew distributions, see, for example, the articles [4, 8–10].
Besides the skew-normal distribution, which plays a central role in these developments, the skew- distribution has also received much attention. Being a natural extension of the distribution, the skew- distribution retains reasonable tractability and is more robust against outliers than the skew-normal distribution.
On the other hand, the Lasso is a popular method for regression analysis that uses an penalty to achieve a spares solution. This idea has been broadly applied, for example, to generalized linear models by  and Cox’s proportional hazard models for survival data by . Inspired by an important work of , we will introduce a new approach to regression analysis based on a class skew distribution by taking advantage of sparse penalization methods. We propose to fit an penalized linear model with an penalty imposed on the location parameters of the skew distribution. Due to the sparse shrinkage property of the penalty, some components of regression coefficients are estimated by exact zero when the tuning parameter is properly chosen.
The rest of the paper is organized as follows. In Section 2, we introduce the skew-normal and skew- distribution and present some of their properties. Section 3 outlines the EM-type algorithm and a penalty estimate of the proposed model of the skew distribution. In Section 4, a method of choice of tuning parameter is developed. The methodology proposed is illustrated in Section 5 by analyzing a simulated study.
2. The Skew-Normal Distribution and the Skew- Distribution
2.1. The Skew-Normal Distribution
As developed by , a random variable follows a univariate skew-normal distribution with location parameter , scale parameter , and shape parameter , denoted by , if it has the density where and denote the standard normal density function and cumulative function, respectively. Note that if , the density of reduces to the density.
A stochastic representation of the distribution can be given by where and denotes the univariate truncated normal distribution with mean , variance in interval . , independent of , and ; hence .
There is an alternative definition as discussed by [2, 9], which can be viewed as a new parameterization of above parameters; that is, , ; furthermore, suppose and , independently; then (2) can be represented as which defines a univariate skew-normal distribution , where and , so (3) can be viewed as a reparametrization of (2). So it follows that the skew-normal distribution admits a convenient hierarchical characterization given by(i);(ii).
2.2. The Skew- Distribution
To achieve a higher degree of excess kurtosis and accommodate to data of a heavy tail, skew- distributions have been introduced by . A univariate random variable follows the skew- distribution, ; the density function reads where and and denote, respectively, the density function and distribution function of a standard Student’s distribution with degrees of freedom, and parameters , , and are the same as above. Clearly, if , it has the following stochastic representation: where and , independently and denotes the Gamma distribution with parameters and .
Similarly, the skew- distribution has the convolution-type representation as where , , independently, and and are the same as above. Meanwhile, the skew- distribution also has the hierarchical representation as(i);(ii);(iii).
3. Regression Model of Skew Distribution and EM Algorithm
In this section, we consider the linear regression model where the observation follows skew-normal distributions or skew- distributions. In general, a linear regression model is defined as or where are responses, parameter vector , is a vector of explanatory variable values, and the random error follows or , respectively, which corresponds to the regression model with an error distribution of which the mean is zero.
It follows that the log-likelihood function for original parameter or , given the observed sample from or , respectively, is given by where is the function of or in (1) or (4), according to the or distribution, respectively. Note that the Newton-Raphson method can be directly applied to get the ML estimate of the above log-likelihood, but in order to obtain a more efficient approach of estimation and variable selection of skew models, we discuss a more elaborate technique to find the ML estimate based on an EM algorithm.
Firstly, for skew-normal distributions, let ; then under the hierarchical representation of Section 2.1, it follows that the complete log-likelihood function of modified parameter associated with is where is a constant that is independent of . Then consider the penalized log-likelihood defined by where is a nonnegative penalty function and we use , which is the Lasso penalty, where is the tuning parameter not included in .
The Lasso estimator is then defined as ; that is,
Clearly, can be seen as “missing data”; in the following, we will derive a generalized expectation-maximization algorithm for computing the penalized estimator. It turns out that the penalized estimator can be computed by iterative Lasso-penalized least square.
Set as the -function over at iteration step and where denotes the conditional expectation of given .
At the -step, by using the properties of conditional expectation of the truncated normal distribution, we obtain where and .
Then, combining the -function and the Lasso penalty, we set as the modified -function.
At the -step update by maximizing over in a sequence of conditional maximization steps. Firstly, we get the following expression: then, we can obtain the closed expression of other parameters, So, the above procedures can be summarized as follows.Step 1:set initial values for , , , and tuning parameter .Step 2:compute the value of and at current iteration step as the “working observed value.”Step 3:solve the penalized least squares problem about parameter and obtain its new Lasso estimator.Step 4:compute the new estimator of and .Step 5:repeat steps 2–4 till convergence.
Secondly, for skew- distributions, let , , ; then, based on the hierarchical representation of Section 2.2, we can get the complete log-likelihood function associated with as
Similarly, setting then we can combine the -step, -step, and the Lasso estimator together as and parameter can be obtained via estimating equation as follows: where and , , , , and .
Similarly, we can also summarize the above procedures as follows.Step 1:set initial values for , , , , and tuning parameter .Step 2:compute the value of , and , at current iteration step as the “working observed value.”Step 3:solve the penalized least squares problem about parameter and obtain its new Lasso estimator.Step 4:compute the new estimates of and directive and then get the new estimates of .Step 5:get the tuning parameters by GCV.Step 6:repeat steps 2–5 till convergence.
4. Choice of Tuning Parameters
We now consider the choice of tuning parameters. In application, the cross-validation (CV), or generalized cross-validation (GCV), is often used for choosing tuning parameters. Following the examples of [11, 15], we develop a componentwise deviance-based GCV criterion for the proposed models.
Let be the EM estimates of parameters without penalty functions. For a given value of tuning parameter of , let be the maximum penalized likelihood estimates of the parameters of the model by fixing the rest of components of at . Denote the deviance function as Further, let be the second derivative of the log-likelihood function with respect to evaluated at . We define a GCV criterion of the model as where is the effective number of regression coefficients. It is given by where . The tuning parameter is chosen by minimizing .
5. Simulation Study
In order to investigate the experimental performance of our methodology, we undertake a simulation study to investigate the effect of estimation and variable selection. The dataset is generated as follows: let be covariate, where and are independent and distributed as uniform distribution . Moreover, assume response is formulated as where is distributed as the following five cases.
Case : ; Case : ; Case : ; Case : ; and Case : .
We take the sample size to be 100. The true values for are and the other components of are equal to zero. Further, let , , , , and .
Under above settings, we use three linear regression models based on normal, skew-normal, and skew- distribution to fit each dataset, respectively.
Table 1 shows the results of the simulation study, where No. denotes the number of nonzero components of , which can determine the effect of variable selection, and where AE denotes the average of absolute error; that is, , which can show the accuracy of the estimates, and the is the tuning parameter.
Examination of Table 1 reveals that when the true distribution is normal, the difference of the three models is very small and the coefficient can be found correctly; when the true distribution is nonnormal, the models based on normal fit the data badly; not only does the proportion of nonzero increase but also the average error becomes significant; for the cases of the true distribution being nonnormal, the models based on the skew distribution can improve the fitting effect better than the ones of the normal distribution.
Figures 1(a)–1(c) present the trace plots of coefficients versus tuning parameters: (a) Case fitted by normal distribution (N) model; (b) Case fitted by skew-normal distributions (SN) model; (c) Case fitted by skew- distributions (ST) model. Moreover, (d) shows profiles of the tuning parameter of (c).
From Figure 1(c), it is clear that the ST model is most appropriate for fitting this dataset.
In this paper, we have proposed a class of linear regression models based on the asymmetric distribution. An EM-type algorithm and the Lasso penalty estimates are developed by exploring the statistical properties of the model that can be implemented efficiently in types of software such as SAS, R, and Matlab. Meanwhile, we have shown that the penalized maximum likelihood estimation can be done via a generalized EM algorithm that is equivalent to iterative penalized least square for some parameters. We have observed that if the data are generated from a true model of the skew distribution, the penalized models based on EM algorithm not only identify zero coefficient but also are significantly more accurate than the general MLE models.
On the other hand, due to recent advances in computational technology, it is worthwhile to carry out Bayesian treatments via Markov chain Monte Carlo (MCMC) sampling methods in the context of this paper.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the National Natural Science Foundation of China under Grants 11171105 and 61273021 and in part by the Natural Science Foundation Project of CQ cstc2013jjB40008.
- T. I. Lin, J. C. Lee, and W. J. Hsieh, “Robust mixture modeling using the skew distribution,” Statistics and Computing, vol. 17, no. 2, pp. 81–92, 2007.
- S. K. Sahu, D. K. Dey, and M. D. Branco, “A new class of multivariate skew distributions with applications to Bayesian regression models,” The Canadian Journal of Statistics, vol. 31, no. 2, pp. 129–150, 2003.
- V. H. Lachos, P. Ghosh, and R. B. Arellano-Valle, “Likelihood based inference for skew-normal independent linear mixed models,” Statistica Sinica, vol. 20, no. 1, pp. 303–322, 2010.
- R. B. Arellano-Valle and A. Azzalini, “On the unification of families of skew-normal distributions,” Scandinavian Journal of Statistics: Theory and Applications, vol. 33, no. 3, pp. 561–574, 2006.
- I. Vrbik and P. D. McNicholas, “Analytic calculations for the EM algorithm for multivariate skew- mixture models,” Statistics & Probability Letters, vol. 82, no. 6, pp. 1169–1174, 2012.
- A. Azzalini, “A class of distributions which includes the normal ones,” Scandinavian Journal of Statistics: Theory and Applications, vol. 12, no. 2, pp. 171–178, 1985.
- A. Azzalini and A. Dalla-Valle, “The multivariate skew-normal distribution,” Biometrika, vol. 83, no. 4, pp. 715–726, 1996.
- A. Azzalini, “The skew-normal distribution and related multivariate families,” Scandinavian Journal of Statistics: Theory and Applications, vol. 32, no. 2, pp. 159–200, 2005.
- A. Gupta, “Multivariate skew t-distribution,” Statistics, vol. 37, no. 4, pp. 359–363, 2003.
- M. Genton, Skew-Elliptical Distributions and Their Applications: A Journey Beyond Normality, Chapman & Hall, Boca Raton, Fla, USA, 2004.
- R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B Methodological, vol. 58, no. 1, pp. 267–288, 1996.
- R. Tibshirani, “The lasso method for variable selection in the Cox’s model,” Statistics in Medicine, vol. 16, no. 2, pp. 385–395, 1997.
- J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.
- A. Azzalini and A. Capitanio, “Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution,” Journal of the Royal Statistical Society: Series B Methodological, vol. 65, no. 2, pp. 367–389, 2003.
- J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, vol. 96, no. 456, pp. 1348–1360, 2001.
Copyright © 2014 Xuedong Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.