Research Article  Open Access
Prithish Banerjee, Broti Garai, Himel Mallick, Shrabanti Chowdhury, Saptarshi Chatterjee, "A Note on the Adaptive LASSO for ZeroInflated Poisson Regression", Journal of Probability and Statistics, vol. 2018, Article ID 2834183, 9 pages, 2018. https://doi.org/10.1155/2018/2834183
A Note on the Adaptive LASSO for ZeroInflated Poisson Regression
Abstract
We consider the problem of modelling count data with excess zeros using ZeroInflated Poisson (ZIP) regression. Recently, various regularization methods have been developed for variable selection in ZIP models. Among these, EM LASSO is a popular method for simultaneous variable selection and parameter estimation. However, EM LASSO suffers from estimation inefficiency and selection inconsistency. To remedy these problems, we propose a set of EM adaptive LASSO methods using a variety of dataadaptive weights. We show theoretically that the new methods are able to identify the true model consistently, and the resulting estimators can be as efficient as oracle. The methods are further evaluated through extensive synthetic experiments and applied to a German health care demand dataset.
1. Introduction
Modern research studies routinely collect information on a broad array of outcomes including count measurements with excess amount of zeros. Modeling such zeroinflated count outcomes is challenging for several reasons. First, traditional count models such as Poisson and Negative Binomial are suboptimal in accounting for excess variability due to zeroinflation [1, 2]. Second, alternative zeroinflated models such as the ZeroInflated Poisson (ZIP) [2] and ZeroInflated Negative Binomial (ZINB) [1] models are computationally prohibitive in the presence of highdimensional and collinear variables.
Regularization methods have been proposed as a powerful framework to mitigate these problems, which tend to exhibit significant advantages over traditional methods [3, 4]. Essentially all these methods enforce sparsity through a suitable penalty function and identify predictive features by means of a computationally efficient Expectation Maximization (EM) algorithm. Among these, EM LASSO is particularly attractive due to its capability to perform simultaneous model selection and stable effect estimation. However, recent research suggests that EM LASSO may not be fully efficient and its model selection result could be inconsistent [5, 6]. This led to a simple modification of the LASSO penalty, namely, the EM adaptive LASSO (EM AL). EM AL achieves “oracle selection consistency” by allowing different amounts of shrinkage for different regression coefficients.
Previous studies have not, however, investigated the EM AL at sufficient depth to evaluate its properties under diversified and realistic scenarios. It is not yet clear, for example, how reliable the resulting parameter estimates are in the presence of multicollinearity. In particular, the actual variable selection performance of EM AL depends on the proper construction of the dataadaptive weight vector. When the features to be associated possess an inherent collinearity, EM AL is expected to produce suboptimal results, a phenomenon that is especially evident when the sample size is limited [7]. Several remedies have been suggested for linear and generalized linear models (GLMs) such as the standard erroradjusted adaptive LASSO (SEAL) [7, 8]. However, there is a lack of similar published methods for zeroinflated count regression models. In addition, complete software packages of these methods have not been made available to the community.
We address these issues by providing a set of flexible variable selection approaches to efficiently identify correlated features associated with zeroinflated count outcomes in a ZIP regression framework. We have implemented this method as AMAZonn (A Multicollinearityadjusted Adaptive LASSO for Zeroinflated Count Regression). AMAZonn considers two dataadaptive weights: (i) the inverse of the maximum likelihood (ML) estimates (EM AL) and (ii) inverse of the ML estimates divided by their standard errors (EM SEAL). We show theoretically that AMAZonn is able to identify the true model consistently, and the resulting estimator is as efficient as oracle. Numerical studies confirmed our theoretical findings. The rest of the article is organized as follows. The AMAZonn method is proposed in the next section, and its theoretical properties are established in Section 3. Simulation results are reported in Section 4 and one real dataset is analyzed in Section 5. Then, the article concludes with a short discussion in Section 6. All technical details are presented in the Appendix.
2. Methods
2.1. ZeroInflated Poisson (ZIP) Model
Zeroinflated count models assume that the observations originate either from a “susceptible” population that generates zero and positive counts according to a count distribution or from a “nonsusceptible” population, which produces additional zeros [1, 2]. Thus, while a subject with a positive count is considered to belong to the “susceptible” population, individuals with zero counts may belong to either of the two latent populations. We denote the observed values of the response variable as . Following Lambert [2], a ZIP mixture distribution can be written as where is the probability of belonging to the nonsusceptible population and is the Poisson mean corresponding to the susceptible population for the individual (). It can be seen from (1) that ZIP reduces to the standard Poisson model when . Also, , indicating zeroinflation. The probability of belonging to the “nonsusceptible” population, , and the Poisson mean, , are linked to the explanatory variables through the logit and log links as where and are vectors of covariates for the th subject () corresponding to the count and zero models, respectively, and and are the corresponding regression coefficients including the intercepts.
For independent observations, the ZIP loglikelihood function can be written as
2.2. The AMAZonn Method
AMAZonn considers two dataadaptive weights in the EM adaptive LASSO framework: (i) the inverse of the maximum likelihood (ML) estimates (EM AL) and (ii) inverse of the ML estimates divided by their standard errors (EM SEAL). As defined by Tang et al. [6], the EM adaptive LASSO formulation for ZIP regression is given by where is the parameter vector of interest with known weights and . As noted by Qian and Yang [7], the inverse of the maximum likelihood (ML) estimates as weights may not always be stable, especially when the multicollinearity of the design matrix is a concern. In order to adjust for this instability, AMAZonn additionally considers the inverse of the ML estimates divided by their standard errors as weights. We refer to these two methods as AMAZonn  EM AL and AMAZonn  EM SEAL, respectively (Table 1).

2.3. The EM Algorithm
In order to efficiently estimate the parameters in the above optimization problem (5), we resort to the EM algorithm. To this end, we define a set of latent variables as follows: We consider the latent variables ’s as the “missing data" and rewrite the completedata loglikelihood function in (4) as follows: With the above formulation, the objective function in (5) can be rewritten as which can be iteratively solved as follows: (1)At iteration t, the E step computes the expectation of by substituting with its conditional expectation given observed data and current parameter estimates (2)In the M step, the expected penalized completedata loglikelihood (5) can be minimized the with respect to as (3)Continue this process until convergence, .
It is to be noted that (10) can be further decomposed as where is the weighted penalized Poisson loglikelihood defined as and is the penalized logistic loglikelihood defined as both of which can be minimized separately using computationally efficient coordinate descent algorithms developed for GLMs [9].
2.4. Selection of Tuning Parameters
We select the tuning parameters based on the minimum BIC [10] criterion, which is known to provide better variable selection performance as compared to other information criteria [11]. This can be effortlessly incorporated in our formulation by using existing implementations for zeroinflated count models [3, 4, 6].
3. Oracle Properties
Recently, Tang et al. [6] showed that the EM adaptive LASSO (i.e., AMAZonn  EM AL) enjoys the socalled oracle properties, i.e., the estimator is able to identify the true model consistently, and the resulting estimator is as efficient as oracle. Here we extend these results to the AMAZonn  EM SEAL estimator and show that the AMAZonn  EM SEAL estimator also maintains the same theoretical properties. For the sake of completeness, we provide a combined general proof for both AMAZonn estimators.
Without being too rigorous mathematically, recall that the loglikelihood function for the ZIP regression model is given by where ’s are the observed data (i.i.d observations from the ZIP distribution), is the probability mass function of Poisson distribution with parameter and , . The corresponding penalized loglikelihood is given by Let us denote the true coefficient vector as . Decompose and assume that contains all zero coefficients. Let us denote the subset of true nonzero coefficients as and the subset of selected nonzero coefficients as . With this formulation, the Fisher information matrix can be written as where is the Fisher information corresponding the true nonzero submodel. The oracle property of AMAZonn may be developed based on certain mild regularity conditions which are as follows: (A1):The Fisher information matrix is finite and positive definite for all values of .(A2):There exists functions such that where for all .
Theorem 1. Under (A1) and (A2), if , , , , then the AMAZonn estimators obey the following oracle properties: (1)consistency in variable selection: , and(2)asymptotic normality of the nonzero coefficients: .
4. Simulation Studies
In this section, we conduct simulation studies to evaluate the finite sample performance of AMAZonn. For comparison purposes, the performance of both AMAZonn and EM LASSO is evaluated. For each simulated dataset, the associated tuning parameters are selected by the minimum BIC criterion for all the methods under consideration. All the examples reported in this section are obtained from published papers with slight modifications within the scope of the current study [11, 12].
Specially, three scenarios are considered: in the data generating models of Simulations 1 and 2, we consider all continuous predictors, whereas in Simulation 3, both continuous and categorical variables are included. For each experimental instance, we randomly partition the data into training and test sets: models are fitted on the training set and prediction errors based on mean absolute scaled error (MASE) are calculated on the heldout samples in the test set. For an exhaustive comparison, we considered three sets of sample sizes , and , where and represent the size of the training and test data, respectively. The corresponding regression coefficients and intercepts are chosen so that a desired level of sparsity proportion is achieved. In order to remain as modelagnostic as possible, we consider the same set of predictors for both zero and count submodels (i.e., ). Such models are common in many practical applications where no domainspecific prior information about the zeroinflation mechanism is available. Below we provide the detailed data generation steps for both simulation examples:
Simulation 1. (1)Generate predictors from the multivariate normal distribution with mean vector , variance vector , and variancecovariance matrix , where the elements of are . The values of pairwise correlation varies from 0 (uncorrelated) to 0.4 (moderate collinearity) to 0.8 (high collinearity).(2)The count and zero regression parameters are chosen as follows: (3)The zeroinflated count outcome is simulated according to (1) with the above parameters and input data.
Simulation 2. It is similar to Simulation 1 except that the count and zero regression parameters are chosen as follows:
Simulation 3. (1)First simulate independently from the standard normal distribution. Consider the following as the continuous predictors: and .(2)Simulate 5 continuous variables from the multivariate normal distribution with mean , variance , and AR() correlation structure for varying in as before, and quantilediscretize each of them into 5 new variables based on their quantiles: , , , , and , leading to a total of categorical variables.(3)With the above input data and parameters, the zeroinflated count outcome is simulated according to (1), where the two sets of regression parameters are chosen as follows: The resulting performance measures iterated over 200 replications (Table 2) reveal that AMAZonn performs as well as or better than EM LASSO in most of the simulation scenarios. For highly collinear designs, AMAZonn  EM SEAL stands out to be the best estimator for almost every sample size and zeroinflation proportion, highlighting the benefit of incorporating dataadaptive weights based on both ML estimates and their standard errors. This phenomenon is also apparent in the analysis of German health care data in Section 5, where the parameter estimates from the AMAZonn  EM SEAL method appear to be more parsimonious than those from other methods.

5. Application to German Health Care Demand Data
Next, we apply our method to the German health care demand data [3], a subset of the German Socioeconomic Panel (GSOEP) dataset [13], which has also been used for illustration purposes in previous studies [3, 14]. The original data contains number of doctor office visits for West German men aged 25 to 65 years in the last three months of 1994 (response variable of interest), which is supplemented with complementary information on twelve annual waves from 1984 to 1995 including health care utilization, current employment status, and insurance arrangements under which subjects are protected [3]. The goal of the original study was to investigate how the employment characteristics of the German nationals are related to their health care demand. The distribution of the dependent variable (Figure 1) reveals that many doctor visits are zeros (), confirming that classical methods such as Poisson regression are inappropriate for modeling this outcome.
In the model fitting process, along with the original variables, the interactions between age groups and health condition are also considered, resulting in 28 candidate predictors (Table 3). The fitting results from the full models indicate that both EM adaptive LASSO methods provide competitive model selection performance (Table 4), often leading to sparser model selection than EM LASSO (Table 5). In addition, the AMAZonn  EM SEAL method appears to choose even fewer numbers of variables. Such feature of AMAZonn  EM SEAL can be appealing in many practical situations, where data collinearity between variables is a concern and a more aggressive feature selection is desired. While the computational overheads of both EM adaptive LASSO methods are similar, they are an order of magnitude faster than EM LASSO (Table 4), further confirming that AMAZonn offers a viable alternative to existing methods.


(a)  
 
(b)  

6. Discussion
In recent years, there has been a huge influx of zeroinflated count measurements spanning several disciplines including biology, public health, and medicine. This has motivated the widespread use of zeroinflated count models in many practical applications such as metagenomics, singlecell RNA sequencing, and health care research. In this article, we propose the AMAZonn method for adaptive variable selection in ZIP regression models. Both our simulation and real data experience suggest that AMAZonn can outperform EM LASSO under a variety of regression settings while maintaining the desired theoretical properties and computational convenience. Our preliminary results are rather encouraging, and for practical purposes, we provide a publicly available R package implementing this method: https://github.com/himelmallick/AMAZonn.
We envision a number of improvements that may further refine AMAZonn’s performance. While AMAZonn relies on ML estimates to construct the weight vector, these estimates may not be available in ultrahigh dimensions [7]. Alternative initialization schemes could further improve on this such as the ridge estimates [15]. Extension to other zeroinflated models such as marginalized zeroinflated count regression [16, 17], twopart and hurdle models [18], and multipleinflation models [19] can form a useful basis for further investigations. Although we only focused on variable selection for fixed effects models, future work could include an extension to other regularization problems such as grouped variable selection [12, 20] as well as sparse mixed effects models [21].
Appendix
Proof. It is to be noted that both logistic and Poisson distributions belong to the exponential family. Since the objective function in (10) can be decomposed into weighted logistic and Poisson loglikelihoods (each belonging to the GLM family without the penalties), Theorem 1 is the direct application of Theorem 4 in Zou [22]. Therefore, if , , , and , then both the AMAZonn  EM AL and AMAZonn  EM SEAL estimators hold the oracle properties: with probability tending to 1, the estimate of zero coefficients is 0, and the estimate for nonzero coefficients has an asymptotic normal distribution with mean being the true value and variance which approximately equals the submatrix of the Fisher information matrix containing nonzero coefficients. Hence the proof is complete.
Data Availability
The German Healthcare dataset used in the paper is publicly available from others (https://cran.rproject.org/web/packages/HDtweedie/index.html) and the software is publicly available at https://github.com/himelmallick/AMAZonn.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
Prithish Banerjee, Broti Garai, and Himel Mallick contributed equally to this work.
Acknowledgments
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the manuscript. This work was supported in part by the research computing resources acquired and managed by University of Alabama at Birmingham IT Research Computing. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the University of Alabama at Birmingham.
References
 W. H. Greene, Accounting for excess zeros and sample selection in Poisson and negative binomial regression models, New York University, New York, NY, 1994.
 D. Lambert, “Zeroinflated poisson regression, with an application to defects in manufacturing,” Technometrics, vol. 34, no. 1, pp. 1–14, 1992. View at: Publisher Site  Google Scholar
 Z. Wang, S. Ma, and C.Y. Wang, “Variable selection for zeroinflated and overdispersed data with application to health care demand in Germany,” Biometrical Journal, vol. 57, no. 5, pp. 867–884, 2015. View at: Publisher Site  Google Scholar  MathSciNet
 Z. Wang, S. Ma, C.Y. Wang, M. Zappitelli, P. Devarajan, and C. Parikh, “EM for regularized zeroinflated regression models with applications to postoperative morbidity after cardiac surgery in children,” Statistics in Medicine, vol. 33, no. 29, pp. 5192–5208, 2014. View at: Publisher Site  Google Scholar  MathSciNet
 H. Mallick and H. K. Tiwari, “EM adaptive LASSOa multilocus modeling strategy for detecting SNPs associated with zeroinflated count phenotypes,” Frontiers in Genetics, vol. 7, 2016. View at: Google Scholar
 Y. Tang, L. Xiang, and Z. Zhu, “Risk Factor Selection in Rate Making: EM Adaptive LASSO for ZeroInflated Poisson Regression Models,” Risk Analysis, vol. 34, no. 6, pp. 1112–1127, 2014. View at: Publisher Site  Google Scholar
 W. Qian and Y. Yang, “Model selection via standard error adjusted adaptive lasso,” Annals of the Institute of Statistical Mathematics, vol. 65, no. 2, pp. 295–318, 2013. View at: Publisher Site  Google Scholar  MathSciNet
 Z. Y. Algamal and M. H. Lee, “Adjusted Adaptive LASSO in Highdimensional Poisson Regression Model,” Modern Applied Science (MAS), vol. 9, no. 4, 2014. View at: Publisher Site  Google Scholar
 J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized linear models via coordinate descent,” Journal of Statistical Software , vol. 33, no. 1, pp. 1–22, 2010. View at: Google Scholar
 G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics, vol. 6, no. 2, pp. 461–464, 1978. View at: Publisher Site  Google Scholar  MathSciNet
 J. Huang, S. Ma, H. Xie, and C.H. Zhang, “A group bridge approach for variable selection,” Biometrika, vol. 96, no. 2, pp. 339–355, 2009. View at: Publisher Site  Google Scholar  MathSciNet
 S. Chatterjee, S. Chowdhury, H. Mallick, P. Banerjee, and B. Garai, “Group regularization for zeroinflated negative binomial regression models with an application to health care demand in Germany,” Statistics in Medicine, vol. 37, no. 20, pp. 3012–3026, 2018. View at: Publisher Site  Google Scholar  MathSciNet
 R. T. Riphahn, A. Wambach, and A. Million, “Incentive effects in the demand for health care: A bivariate panel count data estimation,” Journal of Applied Econometrics, vol. 18, no. 4, pp. 387–405, 2003. View at: Publisher Site  Google Scholar
 M. Jochmann, “What belongs where? Variable selection for zeroinflated count models with an application to the demand for health care,” Computational Statistics, vol. 28, no. 5, pp. 1947–1964, 2013. View at: Publisher Site  Google Scholar  MathSciNet
 A. E. Hoerl and R. W. Kennard, “Ridge regression: biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970. View at: Publisher Site  Google Scholar
 D. L. Long, J. S. Preisser, A. H. Herring, and C. E. Golin, “A marginalized zeroinflated Poisson regression model with overall exposure effects,” Statistics in Medicine, vol. 33, no. 29, pp. 5151–5165, 2014. View at: Publisher Site  Google Scholar  MathSciNet
 V. A. Smith and J. S. Preisser, “Direct and flexible marginal inference for semicontinuous data,” Statistical Methods in Medical Research, vol. 26, no. 6, pp. 2962–2965, 2016. View at: Publisher Site  Google Scholar  MathSciNet
 V. A. Smith, B. Neelon, J. S. Preisser, and M. L. Maciejewski, “A marginalized twopart model for longitudinal semicontinuous data,” Statistical Methods in Medical Research, vol. 26, no. 4, pp. 1949–1968, 2017. View at: Publisher Site  Google Scholar  MathSciNet
 X. Su, J. Fan, R. A. Levine, X. Tan, and A. Tripathi, “Multipleinflation Poisson model with L_{1} regularization,” Statistica Sinica, vol. 23, no. 3, pp. 1071–1090, 2013. View at: Google Scholar  MathSciNet
 S. Chowdhury, S. Chatterjee, H. Mallick, H. Banerjee, and B. Garai, “Group regularization for zeroinflated poisson regression models with an application to insurance ratemaking,” Journal of Applied Statistics, 2018, In Press. View at: Google Scholar
 A. Groll and G. Tutz, “Variable selection for generalized linear mixed models by L_{1}penalized estimation,” Statistics and Computing, vol. 24, no. 2, pp. 137–154, 2014. View at: Publisher Site  Google Scholar  MathSciNet
 H. Zou, “The adaptive lasso and its oracle properties,” Journal of the American Statistical Association, vol. 101, no. 476, pp. 1418–1429, 2006. View at: Publisher Site  Google Scholar  MathSciNet
Copyright
Copyright © 2018 Prithish Banerjee et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.