Research Article  Open Access
Penalized Maximum Likelihood Method to a Class of Skewness Data Analysis
Abstract
An extension of some standard likelihood and variable selection criteria based on procedures of linear regression models under the skewnormal distribution or the skew distribution is developed. This novel class of models provides a useful generalization of symmetrical linear regression models, since the random term distributions cover both symmetric as well as asymmetric and heavytailed distributions. A generalized expectationmaximization algorithm is developed for computing the penalized estimator. Efficacy of the proposed methodology and algorithm is demonstrated by simulated data.
1. Introduction
In recent years, nonnormal distributions have received substantial interest in the statistics literature. The growing need for more flexible tools to analyze data that exhibit nonnormal features, including asymmetry, multimodality, and heavy tails, has led to intense development in nonnormal modelbased methods. Many such distributions have been successfully applied to numerous datasets from a wide range of fields, including medical sciences, bioinformatics, environmetrics, engineering, economics, and financial sciences. Some recent applications of skewnormal and skew models include [1–5].
The rich literature and active discussion of skew distributions as well as regression analysis of skew distributions were initiated by the pioneering work of [6], in which the univariate skew distribution was introduced. Following its generalization to the multivariate skewnormal distribution in [7], the number of contributions has grown rapidly. The concept of introducing additional parameters to regulate skewness in a distribution was subsequently extended to other parametric families. For a comprehensive survey of skew distributions, see, for example, the articles [4, 8–10].
Besides the skewnormal distribution, which plays a central role in these developments, the skew distribution has also received much attention. Being a natural extension of the distribution, the skew distribution retains reasonable tractability and is more robust against outliers than the skewnormal distribution.
On the other hand, the Lasso is a popular method for regression analysis that uses an penalty to achieve a spares solution. This idea has been broadly applied, for example, to generalized linear models by [11] and Cox’s proportional hazard models for survival data by [12]. Inspired by an important work of [13], we will introduce a new approach to regression analysis based on a class skew distribution by taking advantage of sparse penalization methods. We propose to fit an penalized linear model with an penalty imposed on the location parameters of the skew distribution. Due to the sparse shrinkage property of the penalty, some components of regression coefficients are estimated by exact zero when the tuning parameter is properly chosen.
The rest of the paper is organized as follows. In Section 2, we introduce the skewnormal and skew distribution and present some of their properties. Section 3 outlines the EMtype algorithm and a penalty estimate of the proposed model of the skew distribution. In Section 4, a method of choice of tuning parameter is developed. The methodology proposed is illustrated in Section 5 by analyzing a simulated study.
2. The SkewNormal Distribution and the Skew Distribution
2.1. The SkewNormal Distribution
As developed by [6], a random variable follows a univariate skewnormal distribution with location parameter , scale parameter , and shape parameter , denoted by , if it has the density where and denote the standard normal density function and cumulative function, respectively. Note that if , the density of reduces to the density.
A stochastic representation of the distribution can be given by where and denotes the univariate truncated normal distribution with mean , variance in interval . , independent of , and ; hence .
There is an alternative definition as discussed by [2, 9], which can be viewed as a new parameterization of above parameters; that is, , ; furthermore, suppose and , independently; then (2) can be represented as which defines a univariate skewnormal distribution , where and , so (3) can be viewed as a reparametrization of (2). So it follows that the skewnormal distribution admits a convenient hierarchical characterization given by(i);(ii).
2.2. The Skew Distribution
To achieve a higher degree of excess kurtosis and accommodate to data of a heavy tail, skew distributions have been introduced by [14]. A univariate random variable follows the skew distribution, ; the density function reads where and and denote, respectively, the density function and distribution function of a standard Student’s distribution with degrees of freedom, and parameters , , and are the same as above. Clearly, if , it has the following stochastic representation: where and , independently and denotes the Gamma distribution with parameters and .
Similarly, the skew distribution has the convolutiontype representation as where , , independently, and and are the same as above. Meanwhile, the skew distribution also has the hierarchical representation as(i);(ii);(iii).
3. Regression Model of Skew Distribution and EM Algorithm
In this section, we consider the linear regression model where the observation follows skewnormal distributions or skew distributions. In general, a linear regression model is defined as or where are responses, parameter vector , is a vector of explanatory variable values, and the random error follows or , respectively, which corresponds to the regression model with an error distribution of which the mean is zero.
It follows that the loglikelihood function for original parameter or , given the observed sample from or , respectively, is given by where is the function of or in (1) or (4), according to the or distribution, respectively. Note that the NewtonRaphson method can be directly applied to get the ML estimate of the above loglikelihood, but in order to obtain a more efficient approach of estimation and variable selection of skew models, we discuss a more elaborate technique to find the ML estimate based on an EM algorithm.
Firstly, for skewnormal distributions, let ; then under the hierarchical representation of Section 2.1, it follows that the complete loglikelihood function of modified parameter associated with is where is a constant that is independent of . Then consider the penalized loglikelihood defined by where is a nonnegative penalty function and we use , which is the Lasso penalty, where is the tuning parameter not included in .
The Lasso estimator is then defined as ; that is,
Clearly, can be seen as “missing data”; in the following, we will derive a generalized expectationmaximization algorithm for computing the penalized estimator. It turns out that the penalized estimator can be computed by iterative Lassopenalized least square.
Set as the function over at iteration step and where denotes the conditional expectation of given .
At the step, by using the properties of conditional expectation of the truncated normal distribution, we obtain where and .
Then, combining the function and the Lasso penalty, we set as the modified function.
At the step update by maximizing over in a sequence of conditional maximization steps. Firstly, we get the following expression: then, we can obtain the closed expression of other parameters, So, the above procedures can be summarized as follows.Step 1:set initial values for , , , and tuning parameter .Step 2:compute the value of and at current iteration step as the “working observed value.”Step 3:solve the penalized least squares problem about parameter and obtain its new Lasso estimator.Step 4:compute the new estimator of and .Step 5:repeat steps 2–4 till convergence.
Secondly, for skew distributions, let , , ; then, based on the hierarchical representation of Section 2.2, we can get the complete loglikelihood function associated with as
Similarly, setting then we can combine the step, step, and the Lasso estimator together as and parameter can be obtained via estimating equation as follows: where and , , , , and .
Similarly, we can also summarize the above procedures as follows.Step 1:set initial values for , , , , and tuning parameter .Step 2:compute the value of , and , at current iteration step as the “working observed value.”Step 3:solve the penalized least squares problem about parameter and obtain its new Lasso estimator.Step 4:compute the new estimates of and directive and then get the new estimates of .Step 5:get the tuning parameters by GCV.Step 6:repeat steps 2–5 till convergence.
4. Choice of Tuning Parameters
We now consider the choice of tuning parameters. In application, the crossvalidation (CV), or generalized crossvalidation (GCV), is often used for choosing tuning parameters. Following the examples of [11, 15], we develop a componentwise deviancebased GCV criterion for the proposed models.
Let be the EM estimates of parameters without penalty functions. For a given value of tuning parameter of , let be the maximum penalized likelihood estimates of the parameters of the model by fixing the rest of components of at . Denote the deviance function as Further, let be the second derivative of the loglikelihood function with respect to evaluated at . We define a GCV criterion of the model as where is the effective number of regression coefficients. It is given by where . The tuning parameter is chosen by minimizing .
5. Simulation Study
In order to investigate the experimental performance of our methodology, we undertake a simulation study to investigate the effect of estimation and variable selection. The dataset is generated as follows: let be covariate, where and are independent and distributed as uniform distribution . Moreover, assume response is formulated as where is distributed as the following five cases.
Case : ; Case : ; Case : ; Case : ; and Case : .
We take the sample size to be 100. The true values for are and the other components of are equal to zero. Further, let , , , , and .
Under above settings, we use three linear regression models based on normal, skewnormal, and skew distribution to fit each dataset, respectively.
Table 1 shows the results of the simulation study, where No. denotes the number of nonzero components of , which can determine the effect of variable selection, and where AE denotes the average of absolute error; that is, , which can show the accuracy of the estimates, and the is the tuning parameter.

Examination of Table 1 reveals that when the true distribution is normal, the difference of the three models is very small and the coefficient can be found correctly; when the true distribution is nonnormal, the models based on normal fit the data badly; not only does the proportion of nonzero increase but also the average error becomes significant; for the cases of the true distribution being nonnormal, the models based on the skew distribution can improve the fitting effect better than the ones of the normal distribution.
Figures 1(a)–1(c) present the trace plots of coefficients versus tuning parameters: (a) Case fitted by normal distribution (N) model; (b) Case fitted by skewnormal distributions (SN) model; (c) Case fitted by skew distributions (ST) model. Moreover, (d) shows profiles of the tuning parameter of (c).
(a)
(b)
(c)
(d)
From Figure 1(c), it is clear that the ST model is most appropriate for fitting this dataset.
6. Conclusion
In this paper, we have proposed a class of linear regression models based on the asymmetric distribution. An EMtype algorithm and the Lasso penalty estimates are developed by exploring the statistical properties of the model that can be implemented efficiently in types of software such as SAS, R, and Matlab. Meanwhile, we have shown that the penalized maximum likelihood estimation can be done via a generalized EM algorithm that is equivalent to iterative penalized least square for some parameters. We have observed that if the data are generated from a true model of the skew distribution, the penalized models based on EM algorithm not only identify zero coefficient but also are significantly more accurate than the general MLE models.
On the other hand, due to recent advances in computational technology, it is worthwhile to carry out Bayesian treatments via Markov chain Monte Carlo (MCMC) sampling methods in the context of this paper.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grants 11171105 and 61273021 and in part by the Natural Science Foundation Project of CQ cstc2013jjB40008.
References
 T. I. Lin, J. C. Lee, and W. J. Hsieh, “Robust mixture modeling using the skew $t$ distribution,” Statistics and Computing, vol. 17, no. 2, pp. 81–92, 2007. View at: Publisher Site  Google Scholar  MathSciNet
 S. K. Sahu, D. K. Dey, and M. D. Branco, “A new class of multivariate skew distributions with applications to Bayesian regression models,” The Canadian Journal of Statistics, vol. 31, no. 2, pp. 129–150, 2003. View at: Publisher Site  Google Scholar  MathSciNet
 V. H. Lachos, P. Ghosh, and R. B. ArellanoValle, “Likelihood based inference for skewnormal independent linear mixed models,” Statistica Sinica, vol. 20, no. 1, pp. 303–322, 2010. View at: Google Scholar  Zentralblatt MATH  MathSciNet
 R. B. ArellanoValle and A. Azzalini, “On the unification of families of skewnormal distributions,” Scandinavian Journal of Statistics: Theory and Applications, vol. 33, no. 3, pp. 561–574, 2006. View at: Publisher Site  Google Scholar  MathSciNet
 I. Vrbik and P. D. McNicholas, “Analytic calculations for the EM algorithm for multivariate skew$t$ mixture models,” Statistics & Probability Letters, vol. 82, no. 6, pp. 1169–1174, 2012. View at: Publisher Site  Google Scholar  MathSciNet
 A. Azzalini, “A class of distributions which includes the normal ones,” Scandinavian Journal of Statistics: Theory and Applications, vol. 12, no. 2, pp. 171–178, 1985. View at: Google Scholar  MathSciNet
 A. Azzalini and A. DallaValle, “The multivariate skewnormal distribution,” Biometrika, vol. 83, no. 4, pp. 715–726, 1996. View at: Publisher Site  Google Scholar  MathSciNet
 A. Azzalini, “The skewnormal distribution and related multivariate families,” Scandinavian Journal of Statistics: Theory and Applications, vol. 32, no. 2, pp. 159–200, 2005. View at: Publisher Site  Google Scholar  MathSciNet
 A. Gupta, “Multivariate skew tdistribution,” Statistics, vol. 37, no. 4, pp. 359–363, 2003. View at: Publisher Site  Google Scholar  MathSciNet
 M. Genton, SkewElliptical Distributions and Their Applications: A Journey Beyond Normality, Chapman & Hall, Boca Raton, Fla, USA, 2004. View at: MathSciNet
 R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B Methodological, vol. 58, no. 1, pp. 267–288, 1996. View at: Google Scholar  MathSciNet
 R. Tibshirani, “The lasso method for variable selection in the Cox’s model,” Statistics in Medicine, vol. 16, no. 2, pp. 385–395, 1997. View at: Google Scholar
 J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 A. Azzalini and A. Capitanio, “Distributions generated by perturbation of symmetry with emphasis on a multivariate skew tdistribution,” Journal of the Royal Statistical Society: Series B Methodological, vol. 65, no. 2, pp. 367–389, 2003. View at: Publisher Site  Google Scholar  MathSciNet
 J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, vol. 96, no. 456, pp. 1348–1360, 2001. View at: Publisher Site  Google Scholar  MathSciNet
Copyright
Copyright © 2014 Xuedong Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.