Journal of Probability and Statistics

Volume 2018, Article ID 2834183, 9 pages

https://doi.org/10.1155/2018/2834183

## A Note on the Adaptive LASSO for Zero-Inflated Poisson Regression

^{1}JP Morgan Chase & Co., USA^{2}NBCUniversal, USA^{3}Department of Biostatistics, Harvard T.H. Chan School of Public Health, USA^{4}Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, USA^{5}Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, USA^{6}Eli Lilly and Company, USA

Correspondence should be addressed to Himel Mallick; ude.dravrah.hpsh@kcillamh

Prithish Banerjee, Broti Garai, and Himel Mallick contributed equally to this work.

Received 23 July 2018; Accepted 21 November 2018; Published 30 December 2018

Guest Editor: Ash Abebe

Copyright © 2018 Prithish Banerjee et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We consider the problem of modelling count data with excess zeros using Zero-Inflated Poisson (ZIP) regression. Recently, various regularization methods have been developed for variable selection in ZIP models. Among these, EM LASSO is a popular method for simultaneous variable selection and parameter estimation. However, EM LASSO suffers from estimation inefficiency and selection inconsistency. To remedy these problems, we propose a set of EM adaptive LASSO methods using a variety of data-adaptive weights. We show theoretically that the new methods are able to identify the true model consistently, and the resulting estimators can be as efficient as oracle. The methods are further evaluated through extensive synthetic experiments and applied to a German health care demand dataset.

#### 1. Introduction

Modern research studies routinely collect information on a broad array of outcomes including count measurements with excess amount of zeros. Modeling such zero-inflated count outcomes is challenging for several reasons. First, traditional count models such as Poisson and Negative Binomial are suboptimal in accounting for excess variability due to zero-inflation [1, 2]. Second, alternative zero-inflated models such as the** Z**ero-**I**nflated** P**oisson (ZIP) [2] and** Z**ero-**I**nflated** N**egative** B**inomial (ZINB) [1] models are computationally prohibitive in the presence of high-dimensional and collinear variables.

Regularization methods have been proposed as a powerful framework to mitigate these problems, which tend to exhibit significant advantages over traditional methods [3, 4]. Essentially all these methods enforce sparsity through a suitable penalty function and identify predictive features by means of a computationally efficient Expectation Maximization (EM) algorithm. Among these, EM LASSO is particularly attractive due to its capability to perform simultaneous model selection and stable effect estimation. However, recent research suggests that EM LASSO may not be fully efficient and its model selection result could be inconsistent [5, 6]. This led to a simple modification of the LASSO penalty, namely, the EM adaptive LASSO (EM AL). EM AL achieves “oracle selection consistency” by allowing different amounts of shrinkage for different regression coefficients.

Previous studies have not, however, investigated the EM AL at sufficient depth to evaluate its properties under diversified and realistic scenarios. It is not yet clear, for example, how reliable the resulting parameter estimates are in the presence of multicollinearity. In particular, the actual variable selection performance of EM AL depends on the proper construction of the data-adaptive weight vector. When the features to be associated possess an inherent collinearity, EM AL is expected to produce suboptimal results, a phenomenon that is especially evident when the sample size is limited [7]. Several remedies have been suggested for linear and generalized linear models (GLMs) such as the standard error-adjusted adaptive LASSO (SEAL) [7, 8]. However, there is a lack of similar published methods for zero-inflated count regression models. In addition, complete software packages of these methods have not been made available to the community.

We address these issues by providing a set of flexible variable selection approaches to efficiently identify correlated features associated with zero-inflated count outcomes in a ZIP regression framework. We have implemented this method as AMAZonn (**A M**ulticollinearity-adjusted** A**daptive LASSO for** Z**ero-inflated C**o**u**n**t Regressio**n**). AMAZonn considers two data-adaptive weights: (i) the inverse of the maximum likelihood (ML) estimates (EM AL) and (ii) inverse of the ML estimates divided by their standard errors (EM SEAL). We show theoretically that AMAZonn is able to identify the true model consistently, and the resulting estimator is as efficient as oracle. Numerical studies confirmed our theoretical findings. The rest of the article is organized as follows. The AMAZonn method is proposed in the next section, and its theoretical properties are established in Section 3. Simulation results are reported in Section 4 and one real dataset is analyzed in Section 5. Then, the article concludes with a short discussion in Section 6. All technical details are presented in the Appendix.

#### 2. Methods

##### 2.1. Zero-Inflated Poisson (ZIP) Model

Zero-inflated count models assume that the observations originate either from a “susceptible” population that generates zero and positive counts according to a count distribution or from a “nonsusceptible” population, which produces additional zeros [1, 2]. Thus, while a subject with a positive count is considered to belong to the “susceptible” population, individuals with zero counts may belong to either of the two latent populations. We denote the observed values of the response variable as . Following Lambert [2], a ZIP mixture distribution can be written as where is the probability of belonging to the nonsusceptible population and is the Poisson mean corresponding to the susceptible population for the individual (). It can be seen from (1) that ZIP reduces to the standard Poisson model when . Also, , indicating zero-inflation. The probability of belonging to the “nonsusceptible” population, , and the Poisson mean, , are linked to the explanatory variables through the logit and log links as where and are vectors of covariates for the th subject () corresponding to the count and zero models, respectively, and and are the corresponding regression coefficients including the intercepts.

For independent observations, the ZIP log-likelihood function can be written as

##### 2.2. The AMAZonn Method

AMAZonn considers two data-adaptive weights in the EM adaptive LASSO framework: (i) the inverse of the maximum likelihood (ML) estimates (EM AL) and (ii) inverse of the ML estimates divided by their standard errors (EM SEAL). As defined by Tang et al. [6], the EM adaptive LASSO formulation for ZIP regression is given by where is the parameter vector of interest with known weights and . As noted by Qian and Yang [7], the inverse of the maximum likelihood (ML) estimates as weights may not always be stable, especially when the multicollinearity of the design matrix is a concern. In order to adjust for this instability, AMAZonn additionally considers the inverse of the ML estimates divided by their standard errors as weights. We refer to these two methods as AMAZonn - EM AL and AMAZonn - EM SEAL, respectively (Table 1).