Journal of Probability and Statistics

Journal of Probability and Statistics / 2018 / Article
Special Issue

New Advances in Biostatistics

View this Special Issue

Research Article | Open Access

Volume 2018 |Article ID 2834183 |

Prithish Banerjee, Broti Garai, Himel Mallick, Shrabanti Chowdhury, Saptarshi Chatterjee, "A Note on the Adaptive LASSO for Zero-Inflated Poisson Regression", Journal of Probability and Statistics, vol. 2018, Article ID 2834183, 9 pages, 2018.

A Note on the Adaptive LASSO for Zero-Inflated Poisson Regression

Guest Editor: Ash Abebe
Received23 Jul 2018
Accepted21 Nov 2018
Published30 Dec 2018


We consider the problem of modelling count data with excess zeros using Zero-Inflated Poisson (ZIP) regression. Recently, various regularization methods have been developed for variable selection in ZIP models. Among these, EM LASSO is a popular method for simultaneous variable selection and parameter estimation. However, EM LASSO suffers from estimation inefficiency and selection inconsistency. To remedy these problems, we propose a set of EM adaptive LASSO methods using a variety of data-adaptive weights. We show theoretically that the new methods are able to identify the true model consistently, and the resulting estimators can be as efficient as oracle. The methods are further evaluated through extensive synthetic experiments and applied to a German health care demand dataset.

1. Introduction

Modern research studies routinely collect information on a broad array of outcomes including count measurements with excess amount of zeros. Modeling such zero-inflated count outcomes is challenging for several reasons. First, traditional count models such as Poisson and Negative Binomial are suboptimal in accounting for excess variability due to zero-inflation [1, 2]. Second, alternative zero-inflated models such as the Zero-Inflated Poisson (ZIP) [2] and Zero-Inflated Negative Binomial (ZINB) [1] models are computationally prohibitive in the presence of high-dimensional and collinear variables.

Regularization methods have been proposed as a powerful framework to mitigate these problems, which tend to exhibit significant advantages over traditional methods [3, 4]. Essentially all these methods enforce sparsity through a suitable penalty function and identify predictive features by means of a computationally efficient Expectation Maximization (EM) algorithm. Among these, EM LASSO is particularly attractive due to its capability to perform simultaneous model selection and stable effect estimation. However, recent research suggests that EM LASSO may not be fully efficient and its model selection result could be inconsistent [5, 6]. This led to a simple modification of the LASSO penalty, namely, the EM adaptive LASSO (EM AL). EM AL achieves “oracle selection consistency” by allowing different amounts of shrinkage for different regression coefficients.

Previous studies have not, however, investigated the EM AL at sufficient depth to evaluate its properties under diversified and realistic scenarios. It is not yet clear, for example, how reliable the resulting parameter estimates are in the presence of multicollinearity. In particular, the actual variable selection performance of EM AL depends on the proper construction of the data-adaptive weight vector. When the features to be associated possess an inherent collinearity, EM AL is expected to produce suboptimal results, a phenomenon that is especially evident when the sample size is limited [7]. Several remedies have been suggested for linear and generalized linear models (GLMs) such as the standard error-adjusted adaptive LASSO (SEAL) [7, 8]. However, there is a lack of similar published methods for zero-inflated count regression models. In addition, complete software packages of these methods have not been made available to the community.

We address these issues by providing a set of flexible variable selection approaches to efficiently identify correlated features associated with zero-inflated count outcomes in a ZIP regression framework. We have implemented this method as AMAZonn (A Multicollinearity-adjusted Adaptive LASSO for Zero-inflated Count Regression). AMAZonn considers two data-adaptive weights: (i) the inverse of the maximum likelihood (ML) estimates (EM AL) and (ii) inverse of the ML estimates divided by their standard errors (EM SEAL). We show theoretically that AMAZonn is able to identify the true model consistently, and the resulting estimator is as efficient as oracle. Numerical studies confirmed our theoretical findings. The rest of the article is organized as follows. The AMAZonn method is proposed in the next section, and its theoretical properties are established in Section 3. Simulation results are reported in Section 4 and one real dataset is analyzed in Section 5. Then, the article concludes with a short discussion in Section 6. All technical details are presented in the Appendix.

2. Methods

2.1. Zero-Inflated Poisson (ZIP) Model

Zero-inflated count models assume that the observations originate either from a “susceptible” population that generates zero and positive counts according to a count distribution or from a “nonsusceptible” population, which produces additional zeros [1, 2]. Thus, while a subject with a positive count is considered to belong to the “susceptible” population, individuals with zero counts may belong to either of the two latent populations. We denote the observed values of the response variable as . Following Lambert [2], a ZIP mixture distribution can be written as where is the probability of belonging to the nonsusceptible population and is the Poisson mean corresponding to the susceptible population for the individual (). It can be seen from (1) that ZIP reduces to the standard Poisson model when . Also, , indicating zero-inflation. The probability of belonging to the “nonsusceptible” population, , and the Poisson mean, , are linked to the explanatory variables through the logit and log links as where and are vectors of covariates for the th subject () corresponding to the count and zero models, respectively, and and are the corresponding regression coefficients including the intercepts.

For independent observations, the ZIP log-likelihood function can be written as

2.2. The AMAZonn Method

AMAZonn considers two data-adaptive weights in the EM adaptive LASSO framework: (i) the inverse of the maximum likelihood (ML) estimates (EM AL) and (ii) inverse of the ML estimates divided by their standard errors (EM SEAL). As defined by Tang et al. [6], the EM adaptive LASSO formulation for ZIP regression is given by where is the parameter vector of interest with known weights and . As noted by Qian and Yang [7], the inverse of the maximum likelihood (ML) estimates as weights may not always be stable, especially when the multicollinearity of the design matrix is a concern. In order to adjust for this instability, AMAZonn additionally considers the inverse of the ML estimates divided by their standard errors as weights. We refer to these two methods as AMAZonn - EM AL and AMAZonn - EM SEAL, respectively (Table 1).

Weighting Scheme Count Zero


2.3. The EM Algorithm

In order to efficiently estimate the parameters in the above optimization problem (5), we resort to the EM algorithm. To this end, we define a set of latent variables as follows: We consider the latent variables ’s as the “missing data" and rewrite the complete-data log-likelihood function in (4) as follows: With the above formulation, the objective function in (5) can be rewritten as which can be iteratively solved as follows: (1)At iteration t, the E step computes the expectation of by substituting with its conditional expectation given observed data and current parameter estimates (2)In the M step, the expected penalized complete-data log-likelihood (5) can be minimized the with respect to as (3)Continue this process until convergence, .

It is to be noted that (10) can be further decomposed as where is the weighted penalized Poisson log-likelihood defined as and is the penalized logistic log-likelihood defined as both of which can be minimized separately using computationally efficient coordinate descent algorithms developed for GLMs [9].

2.4. Selection of Tuning Parameters

We select the tuning parameters based on the minimum BIC [10] criterion, which is known to provide better variable selection performance as compared to other information criteria [11]. This can be effortlessly incorporated in our formulation by using existing implementations for zero-inflated count models [3, 4, 6].

3. Oracle Properties

Recently, Tang et al. [6] showed that the EM adaptive LASSO (i.e., AMAZonn - EM AL) enjoys the so-called oracle properties, i.e., the estimator is able to identify the true model consistently, and the resulting estimator is as efficient as oracle. Here we extend these results to the AMAZonn - EM SEAL estimator and show that the AMAZonn - EM SEAL estimator also maintains the same theoretical properties. For the sake of completeness, we provide a combined general proof for both AMAZonn estimators.

Without being too rigorous mathematically, recall that the log-likelihood function for the ZIP regression model is given by where ’s are the observed data (i.i.d observations from the ZIP distribution), is the probability mass function of Poisson distribution with parameter and , . The corresponding penalized log-likelihood is given by Let us denote the true coefficient vector as . Decompose and assume that contains all zero coefficients. Let us denote the subset of true nonzero coefficients as and the subset of selected nonzero coefficients as . With this formulation, the Fisher information matrix can be written as where is the Fisher information corresponding the true nonzero submodel. The oracle property of AMAZonn may be developed based on certain mild regularity conditions which are as follows: (A1):The Fisher information matrix is finite and positive definite for all values of .(A2):There exists functions such that where for all .

Theorem 1. Under (A1) and (A2), if , , , , then the AMAZonn estimators obey the following oracle properties: (1)consistency in variable selection: , and(2)asymptotic normality of the nonzero coefficients: .

4. Simulation Studies

In this section, we conduct simulation studies to evaluate the finite sample performance of AMAZonn. For comparison purposes, the performance of both AMAZonn and EM LASSO is evaluated. For each simulated dataset, the associated tuning parameters are selected by the minimum BIC criterion for all the methods under consideration. All the examples reported in this section are obtained from published papers with slight modifications within the scope of the current study [11, 12].

Specially, three scenarios are considered: in the data generating models of Simulations 1 and 2, we consider all continuous predictors, whereas in Simulation 3, both continuous and categorical variables are included. For each experimental instance, we randomly partition the data into training and test sets: models are fitted on the training set and prediction errors based on mean absolute scaled error (MASE) are calculated on the held-out samples in the test set. For an exhaustive comparison, we considered three sets of sample sizes , and , where and represent the size of the training and test data, respectively. The corresponding regression coefficients and intercepts are chosen so that a desired level of sparsity proportion is achieved. In order to remain as model-agnostic as possible, we consider the same set of predictors for both zero and count submodels (i.e., ). Such models are common in many practical applications where no domain-specific prior information about the zero-inflation mechanism is available. Below we provide the detailed data generation steps for both simulation examples:

Simulation 1. (1)Generate predictors from the multivariate normal distribution with mean vector , variance vector , and variance-covariance matrix , where the elements of are . The values of pairwise correlation varies from 0 (uncorrelated) to 0.4 (moderate collinearity) to 0.8 (high collinearity).(2)The count and zero regression parameters are chosen as follows: (3)The zero-inflated count outcome is simulated according to (1) with the above parameters and input data.

Simulation 2. It is similar to Simulation 1 except that the count and zero regression parameters are chosen as follows:

Simulation 3. (1)First simulate independently from the standard normal distribution. Consider the following as the continuous predictors: and .(2)Simulate 5 continuous variables from the multivariate normal distribution with mean , variance , and AR() correlation structure for varying in as before, and quantile-discretize each of them into 5 new variables based on their quantiles: , , , , and , leading to a total of categorical variables.(3)With the above input data and parameters, the zero-inflated count outcome is simulated according to (1), where the two sets of regression parameters are chosen as follows: The resulting performance measures iterated over 200 replications (Table 2) reveal that AMAZonn performs as well as or better than EM LASSO in most of the simulation scenarios. For highly collinear designs, AMAZonn - EM SEAL stands out to be the best estimator for almost every sample size and zero-inflation proportion, highlighting the benefit of incorporating data-adaptive weights based on both ML estimates and their standard errors. This phenomenon is also apparent in the analysis of German health care data in Section 5, where the parameter estimates from the AMAZonn - EM SEAL method appear to be more parsimonious than those from other methods.

Simulation 1 Simulation 2 Simulation 3



5. Application to German Health Care Demand Data

Next, we apply our method to the German health care demand data [3], a subset of the German Socioeconomic Panel (GSOEP) dataset [13], which has also been used for illustration purposes in previous studies [3, 14]. The original data contains number of doctor office visits for West German men aged 25 to 65 years in the last three months of 1994 (response variable of interest), which is supplemented with complementary information on twelve annual waves from 1984 to 1995 including health care utilization, current employment status, and insurance arrangements under which subjects are protected [3]. The goal of the original study was to investigate how the employment characteristics of the German nationals are related to their health care demand. The distribution of the dependent variable (Figure 1) reveals that many doctor visits are zeros (), confirming that classical methods such as Poisson regression are inappropriate for modeling this outcome.

In the model fitting process, along with the original variables, the interactions between age groups and health condition are also considered, resulting in 28 candidate predictors (Table 3). The fitting results from the full models indicate that both EM adaptive LASSO methods provide competitive model selection performance (Table 4), often leading to sparser model selection than EM LASSO (Table 5). In addition, the AMAZonn - EM SEAL method appears to choose even fewer numbers of variables. Such feature of AMAZonn - EM SEAL can be appealing in many practical situations, where data collinearity between variables is a concern and a more aggressive feature selection is desired. While the computational overheads of both EM adaptive LASSO methods are similar, they are an order of magnitude faster than EM LASSO (Table 4), further confirming that AMAZonn offers a viable alternative to existing methods.

Variables Mean (sd) or FrequencyDescription

health6.84 (2.19)health satisfaction (low) - (high)
handicap216 / 1596 handicap, otherwise
hdegree6.16 (18.49)degree of handicap in percentage points
married1257 / 555 married, otherwise
schooling11.83 (2.49)years of schooling
hhincome4.52 (2.13)household income per month in German marks/1000
children703 / 1109 children under 16 in household, otherwise
self153 / 1659 self-employed, otherwise
civil198 / 1614 civil servant, otherwise
bluec566 / 1246 blue collar employee, otherwise
employed1506 / 306 employed, otherwise
public1535 / 277 public health insurance, otherwise
addon33 / 1779 addon insurance, otherwise
age301480 / 332 if age 30
age351176 / 636 if age 35
age40919 / 893 if age 40
age45716 / 1096 if age 45
age50535 / 1227 if age 50
age55351 / 1461 if age 55
age60147 / 1665 if age 60

Methods BIC Time (in seconds)

EM LASSO9062.744 50.252
AMAZonn - EM AL9002.487
AMAZonn - EM SEAL 26.528


Methods Count Coefficients

EM LASSO2.322-0.140.207-0.002-0.970.00.00.078-0.178-0.1660.038-0.1060.0890.205
AMAZonn - EM AL2.305-0.1350.1110.0-0.9470.00.00.079-0.234-0.2450.0-0.0590.0430.205
AMAZonn - EM SEAL2.378-0.1420.0980.0-0.0660.00.00.046-0.189-0.2220.0-0.0550.00.14

MethodsCount Coefficients

AMAZonn - EM AL0.00.0-0.0470.7690.0-0.4020.0990.00.00.0-0.1010.00.106-0.034


Methods Zero Coefficients

EM LASSO-2.193-0.262-0.098-0.003-0.1210.0-0.0120.2530.1120.1340.00.0-0.0120.0
AMAZonn - EM AL-2.226-0.261-0.1620.
AMAZonn - EM SEAL-2.403-0.2830.00.0-0.0530.00.00.2380.

MethodsZero Coefficients

AMAZonn - EM AL0.0470.00.0650.009-0.5270.0-0.1980.

6. Discussion

In recent years, there has been a huge influx of zero-inflated count measurements spanning several disciplines including biology, public health, and medicine. This has motivated the widespread use of zero-inflated count models in many practical applications such as metagenomics, single-cell RNA sequencing, and health care research. In this article, we propose the AMAZonn method for adaptive variable selection in ZIP regression models. Both our simulation and real data experience suggest that AMAZonn can outperform EM LASSO under a variety of regression settings while maintaining the desired theoretical properties and computational convenience. Our preliminary results are rather encouraging, and for practical purposes, we provide a publicly available R package implementing this method:

We envision a number of improvements that may further refine AMAZonn’s performance. While AMAZonn relies on ML estimates to construct the weight vector, these estimates may not be available in ultrahigh dimensions [7]. Alternative initialization schemes could further improve on this such as the ridge estimates [15]. Extension to other zero-inflated models such as marginalized zero-inflated count regression [16, 17], two-part and hurdle models [18], and multiple-inflation models [19] can form a useful basis for further investigations. Although we only focused on variable selection for fixed effects models, future work could include an extension to other regularization problems such as grouped variable selection [12, 20] as well as sparse mixed effects models [21].


Proof. It is to be noted that both logistic and Poisson distributions belong to the exponential family. Since the objective function in (10) can be decomposed into weighted logistic and Poisson log-likelihoods (each belonging to the GLM family without the penalties), Theorem 1 is the direct application of Theorem 4 in Zou [22]. Therefore, if , , , and , then both the AMAZonn - EM AL and AMAZonn - EM SEAL estimators hold the oracle properties: with probability tending to 1, the estimate of zero coefficients is 0, and the estimate for nonzero coefficients has an asymptotic normal distribution with mean being the true value and variance which approximately equals the submatrix of the Fisher information matrix containing nonzero coefficients. Hence the proof is complete.

Data Availability

The German Healthcare dataset used in the paper is publicly available from others ( and the software is publicly available at

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Prithish Banerjee, Broti Garai, and Himel Mallick contributed equally to this work.


The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the manuscript. This work was supported in part by the research computing resources acquired and managed by University of Alabama at Birmingham IT Research Computing. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the University of Alabama at Birmingham.


  1. W. H. Greene, Accounting for excess zeros and sample selection in Poisson and negative binomial regression models, New York University, New York, NY, 1994.
  2. D. Lambert, “Zero-inflated poisson regression, with an application to defects in manufacturing,” Technometrics, vol. 34, no. 1, pp. 1–14, 1992. View at: Publisher Site | Google Scholar
  3. Z. Wang, S. Ma, and C.-Y. Wang, “Variable selection for zero-inflated and overdispersed data with application to health care demand in Germany,” Biometrical Journal, vol. 57, no. 5, pp. 867–884, 2015. View at: Publisher Site | Google Scholar | MathSciNet
  4. Z. Wang, S. Ma, C.-Y. Wang, M. Zappitelli, P. Devarajan, and C. Parikh, “EM for regularized zero-inflated regression models with applications to postoperative morbidity after cardiac surgery in children,” Statistics in Medicine, vol. 33, no. 29, pp. 5192–5208, 2014. View at: Publisher Site | Google Scholar | MathSciNet
  5. H. Mallick and H. K. Tiwari, “EM adaptive LASSO-a multilocus modeling strategy for detecting SNPs associated with zero-inflated count phenotypes,” Frontiers in Genetics, vol. 7, 2016. View at: Google Scholar
  6. Y. Tang, L. Xiang, and Z. Zhu, “Risk Factor Selection in Rate Making: EM Adaptive LASSO for Zero-Inflated Poisson Regression Models,” Risk Analysis, vol. 34, no. 6, pp. 1112–1127, 2014. View at: Publisher Site | Google Scholar
  7. W. Qian and Y. Yang, “Model selection via standard error adjusted adaptive lasso,” Annals of the Institute of Statistical Mathematics, vol. 65, no. 2, pp. 295–318, 2013. View at: Publisher Site | Google Scholar | MathSciNet
  8. Z. Y. Algamal and M. H. Lee, “Adjusted Adaptive LASSO in High-dimensional Poisson Regression Model,” Modern Applied Science (MAS), vol. 9, no. 4, 2014. View at: Publisher Site | Google Scholar
  9. J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized linear models via coordinate descent,” Journal of Statistical Software , vol. 33, no. 1, pp. 1–22, 2010. View at: Google Scholar
  10. G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics, vol. 6, no. 2, pp. 461–464, 1978. View at: Publisher Site | Google Scholar | MathSciNet
  11. J. Huang, S. Ma, H. Xie, and C.-H. Zhang, “A group bridge approach for variable selection,” Biometrika, vol. 96, no. 2, pp. 339–355, 2009. View at: Publisher Site | Google Scholar | MathSciNet
  12. S. Chatterjee, S. Chowdhury, H. Mallick, P. Banerjee, and B. Garai, “Group regularization for zero-inflated negative binomial regression models with an application to health care demand in Germany,” Statistics in Medicine, vol. 37, no. 20, pp. 3012–3026, 2018. View at: Publisher Site | Google Scholar | MathSciNet
  13. R. T. Riphahn, A. Wambach, and A. Million, “Incentive effects in the demand for health care: A bivariate panel count data estimation,” Journal of Applied Econometrics, vol. 18, no. 4, pp. 387–405, 2003. View at: Publisher Site | Google Scholar
  14. M. Jochmann, “What belongs where? Variable selection for zero-inflated count models with an application to the demand for health care,” Computational Statistics, vol. 28, no. 5, pp. 1947–1964, 2013. View at: Publisher Site | Google Scholar | MathSciNet
  15. A. E. Hoerl and R. W. Kennard, “Ridge regression: biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970. View at: Publisher Site | Google Scholar
  16. D. L. Long, J. S. Preisser, A. H. Herring, and C. E. Golin, “A marginalized zero-inflated Poisson regression model with overall exposure effects,” Statistics in Medicine, vol. 33, no. 29, pp. 5151–5165, 2014. View at: Publisher Site | Google Scholar | MathSciNet
  17. V. A. Smith and J. S. Preisser, “Direct and flexible marginal inference for semicontinuous data,” Statistical Methods in Medical Research, vol. 26, no. 6, pp. 2962–2965, 2016. View at: Publisher Site | Google Scholar | MathSciNet
  18. V. A. Smith, B. Neelon, J. S. Preisser, and M. L. Maciejewski, “A marginalized two-part model for longitudinal semicontinuous data,” Statistical Methods in Medical Research, vol. 26, no. 4, pp. 1949–1968, 2017. View at: Publisher Site | Google Scholar | MathSciNet
  19. X. Su, J. Fan, R. A. Levine, X. Tan, and A. Tripathi, “Multiple-inflation Poisson model with L1 regularization,” Statistica Sinica, vol. 23, no. 3, pp. 1071–1090, 2013. View at: Google Scholar | MathSciNet
  20. S. Chowdhury, S. Chatterjee, H. Mallick, H. Banerjee, and B. Garai, “Group regularization for zero-inflated poisson regression models with an application to insurance ratemaking,” Journal of Applied Statistics, 2018, In Press. View at: Google Scholar
  21. A. Groll and G. Tutz, “Variable selection for generalized linear mixed models by L1-penalized estimation,” Statistics and Computing, vol. 24, no. 2, pp. 137–154, 2014. View at: Publisher Site | Google Scholar | MathSciNet
  22. H. Zou, “The adaptive lasso and its oracle properties,” Journal of the American Statistical Association, vol. 101, no. 476, pp. 1418–1429, 2006. View at: Publisher Site | Google Scholar | MathSciNet

Copyright © 2018 Prithish Banerjee et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

More related articles

 PDF Download Citation Citation
 Download other formatsMore
 Order printed copiesOrder

Related articles