Abstract

The mixed Poisson regression models are commonly employed to analyze the overdispersed count data. However, multicollinearity is a common issue when estimating the regression coefficients by using the maximum likelihood estimator (MLE) in such regression models. To deal with the multicollinearity, a Liu estimator was proposed by Liu (1993). The Poisson-Modification of the Quasi Lindley (PMQL) regression model is a mixed Poisson regression model introduced recently. The primary interest of this paper is to introduce the Liu estimator for the PMQL regression model to mitigate the multicollinearity issue. To estimate the Liu parameter, some exiting methods are used, and the superiority conditions of the new estimator over the MLE and PMQL ridge regression estimator are obtained based on the mean square error (MSE) criterion. A Monte Carlo simulation study and applications are used to assess the performance of the new estimator in the scalar mean square error (SMSE) sense. Based on the simulation study and the results of the applications, it is shown that the PMQL Liu estimator performs better than the MLE and some other existing biased estimators in the presence of multicollinearity.

1. Introduction

The Poisson regression model is a commonly used statistical method for analyzing the count response variable [1]. One disadvantage of this model is that it is an overdispersion issue that is common in the real-world applications of actuarial, engineering, biomedical, and economic sciences. The overdispersion occurs when the conditional variance of the count response variable exceeds the conditional mean of the count response variable. In this context, the index of dispersion (variance-to-mean ratio) is greater than one. To tackle this issue in the Poisson regression model, researchers have proposed several mixed Poisson regression models. The standard mixed Poisson distribution is obviously the negative binomial (NB)/Poisson-gamma regression model introduced by Greenwood and Yule [2]. However, the NB distribution fails to fit well for a count data with a higher value of the index of dispersion and long right-tail behavior. Then, the regression model based on NB is not a good choice for such a count response variable. As an alternative to the NB regression model, several mixed Poisson regression models are in the literature. However, most of the probability mass functions (pmfs) of these mixed Poisson distributions are not in an explicit form. Some notable examples of such regression models are the Poisson-Inverse Gaussian regression model [3] and the Poisson-Inverse gamma regression model [4]. This algebraic intractability in such distributions leads to computational complexity, and their regression models are limited in practice.

In the last decade, several researchers have highlighted mixed Poisson distributions obtained by mixing the Poisson and Lindley family of distributions due to their explicit form of the pmf and work efficiency. The Lindley family of distributions are two-component mixtures. Some notable such mixed Poisson distributions are Poisson–Lindley distribution, Generalized Poisson–Lindley distribution, Poisson-generalized Lindley distribution, Poisson-Quasi Lindley distribution, and Poisson-weighted Lindley distribution, proposed by Sankaran [5], Mahmoudi and Zakerzadeh [6], Wongrin and Bodhisuwan [7], Grine and Zeghdoudi [8], and Atikankul et al. [9], respectively. They may have the flexibility to capture various ranges of horizontal symmetries, right-tail behaviors, and variance-to-mean ratios based on their mixing distributions [1012]. However, the literature on their regression models is rather limited. Some of the most relevant works cited are the Generalized Poisson–Lindley (GPL) regression model derived by Wongrin and Bodhisuwan [13] and the Poisson-Quasi Lindley (PQL) regression model obtained by Altun [14].

Tharshan and Wijekoon [15] obtained a new Lindley family of distributions named the Modification of Quasi Lindley (MQL) distribution. Its probability density function (pdf) is given aswhere and are shape parameters, is a scale parameter, and is the respective random variable bounded to (0, ). Equation (1) presents the mixture of exponential and gamma distributions with the mixing proportion, . By mixing the Poisson and the MQL, Tharshan and Wijekoon [16] derived the Poisson-Modification of the Quasi Lindley (PMQL) distribution. Its explicit form of the pmf, and some other important statistics are given in Section 2. Authors have shown that the PMQL distribution is an overdispersed distribution, and it has the flexibility to capture the various ranges of horizontal symmetry, right-tail heaviness, and variance-to-mean ratio. Then, by using a reparameterization technique, the same authors [17] derived its regression model to predict the overdispersed count responses with a set of linear independent covariates based on the generalized linear model (GLM) approach. Further, in their paper, it is shown that the PMQL regression model performs better than the NB, GPL, and PQL regression models. More details of this regression model are given in Section 2.The traditional estimator to estimate the unknown regression coefficients of the PMQL regression model is the maximum likelihood estimator , where the solutions of the nonlinear equations with respect to the regression coefficients are found by applying an iterative weighted least square (IWLS) algorithm. However, the is unstable, and its variance is inflated when the covariates are linearly correlated since it is a GLM. It leads to difficulty in having a valid statistical inference. This problem is commonly known as multicollinearity by Frisch [18]. To overcome the multicollinearity problem in the PMQL regression model, Tharshan and Wijekoon [19] adopted the ridge regression estimator in the PMQL regression model . The authors have shown that the performs better than the when multicollinearity exists. Further, they recommended some ridge parameter estimation methods for the . The ridge regression estimator was suggested by Hoerl and Kennard [20] for the ordinary linear regression model, and it was extended to GLM by Segerstedf [21]. The ordinary linear regression model is defined aswhere is an vector of observations on a response variable , is a vector of unknown regression coefficients, is the known design matrix of order with -dimensional covariates, and is an vector of errors with and . Further, its unknown regression coefficients are estimated by the ordinary least square estimator, which is defined as

Even though the ridge estimator is an efficacious one, its drawback is that it includes a complicated nonlinear function of the ridge parameter , which is bounded to . Therefore, Kejian [22] proposed a biased estimator named the Liu estimator for the ordinary linear regression model , which is a linear function of the Liu parameter bounded to by modifying the ordinary least square estimator . The Liu estimator is defined in the ordinary linear regression aswhere is the identity matrix of order and is the Liu parameter.

Due to the advantageous property of the Liu estimator (linear function with respect to the ) over the ridge estimator, the Liu estimator has been considered by several researchers for different GLMs. Mansson et al. [23] discussed some improved Liu estimators for the Poisson regression model; Mansson et al. [24] adopted the Liu estimator in the logit regression model; Mansson [25] developed a Liu estimator for the negative binomial regression model; Siray et al. [26] introduced a restricted Liu estimator for the logistic regression model; Wu [27] derived a modified restricted Liu estimator in logistic regression model; Kurtoğlu and Özkale [28] proposed the Liu estimator for the generalized linear regression models and discussed an application on gamma distributed response variable; Türkan and Özel [29] proposed the Jackknifed estimators for the negative binomial regression model; Wu et al. [30] introduced a restricted almost unbiased Liu estimator for the logistic regression model; Varathan and Wijekoon [31] obtained a logistic Liu estimator under stochastic linear restrictions; Qasim et al. [32] proposed some new Liu parameter estimators for Poisson regression model; Li et al. [33] obtained stochastic restricted Liu estimator in logistic regression model; and Omer et al. [34] developed Liu estimators for the zero-inflated Poisson regression model. We may note that the Liu estimator for the regression model of a mixed Poisson distribution is rather limited in the literature.

This paper adopts the Liu estimator in the PMQL regression model to combat the multicollinearity. Further, we adhere to some possible estimation methods to estimate the Liu parameter for the PMQL Liu regression estimator based on the works carried out by Hoerl and Kennard [20], Kibria [35], and Khalaf and Shukur [36]. Then, the performance of the , , and will be compared in terms of the scalar mean square error (SMSE) criterion by using an extensive Monte Carlo simulation study. Finally, a simulated data set and a real-world example will be considered to illustrate the benefits of the Liu estimator for the PMQL regression model in handling the overdispersion and multicollinearity issues.

The rest of the paper is organized as follows: Section 2 discusses the PMQL regression model and its regression coefficients estimator. We present the , mean square error (MSE) properties of the , conditions that the is superior to the and the , and possible Liu parameter estimators for the in Section 3. Section 4 designs a Monte Carlo simulation study and discusses the results of the simulation study. Section 5 gives a simulated data set and real data applications in order to illustrate the applicability of the PMQL Liu regression model. Finally, the conclusion of the paper is given in Section 6.

2. PMQL Regression Model

In this section, we present the PMQL regression model and its regression coefficients estimation.

The PMQL distribution [16] is a resultant distribution or unconditional distribution by assuming that the Poisson parameter follows the MQL distribution. The pdf of the MQL distribution is given in equation (1). The probability mass function of the PMQL distribution is given aswhere is the respective random variable and represents the total counts of an experiment. Its mean and variance are givenrespectively. Equation (5) represents a two-component mixture of geometric and negative binomial with the mixing proportion, . Further, it possesses to be unimodal and bimodal distributions and overdispersed. The authors have shown that it has the potentiality to accommodate various horizontal symmetry, right-tail behaviors, and index of dispersion for overdispersed count data.

Let be the random sample of observations from the PMQL distribution. The link between -dimensional covariates and the mean responses was taken aswhere is the vector of row of the known design matrix supplemented with a 1 in front for the intercept, is a vector of unknown regression coefficients of order with intercept, and and are overdispersion parameters. To approach the GLM, the PMQL distribution was reparametrized based on the relationship between and given in equation (6) for a given set of and values and the link between and -dimensional covariates given in equation (8).

That is, by substituting in equation (5), the pmf of the for a given set of covariates was obtained aswhere . The conditional mean and variance of the regression model are given:respectively. Figure 1 depicts the surface plots of the variance function of the PMQL regression model at for different values of . From Figure 1, it can be observed that the variance as a function of or is not a monotonic function, and it is high for small values of and .

The estimation of the unknown regression coefficients is commonly estimated by maximizing the following log-likelihood function of its pmf given in equation (9):

The score function of the vector of regression coefficients is given as

Since equation (12) is nonlinear in , one can use the iteratively weighted least square (IWLS) algorithm (Fisher scoring method) [16] to obtain the maximum likelihood (ML) estimates. Let be the estimated value of by the ML method with iterations. Then, the Fisher scoring method can be written aswhere is a Fisher information matrix and the is the score function of the regression coefficients calculated at . In the final step of the IWLS algorithm, is obtained aswhere is a vector, and its element is given as .

The asymptotic covariance matrix of this estimator is given asand the asymptotic MSE and SMSE of this estimator are given as [16]respectively, where is the eigenvalue of the matrix .

When the covariates are highly correlated, the weighted matrix of cross-product is ill-conditioned, and this matrix will have some smaller eigenvalues. We can observe that the SMSE given in equation (18) can easily be inflated for smaller eigenvalues. In this situation, it is very hard to have a valid inference of whether the estimated regression coefficients are significant or not.

3. The PMQL Liu Regression Estimator

Note that the PMQL regression model is a GLM. Then, following the Liu estimator for the GLMs, which was proposed by Kurtoğlu and Özkale [28] for the GLMs based on the IWLS algorithm, we define the Liu estimator for the PMQL regression model to mitigate the multicollinearity issue aswhere is the estimated value of the by the Liu estimator with the iterations, is the weighted matrix evaluated at , and is the estimated value of the by the maximum likelihood method with the iterations. In the final step of the IWLS algorithm, can be obtained aswhere is the Liu parameter, is a identity matrix, and . Note that, if , then .

Asymptotic properties of the PMQL Liu regression estimator are as follows:and then the asymptotic bias and MSE are given asrespectively. Now, we derive the asymptotic SMSE of the estimator as

Let us define an orthogonal matrix whose columns are the normalized eigenvectors of the matrix , a vector , and a diagonal matrix . Then, the asymptotic SMSE can be written by using the spectral decomposition asrespectively, where is the element of and term I and term II are the total variance of regression coefficient estimates and squared bias, respectively.

3.1. MSE Properties of the PMQL Liu Regression Estimator

In this subsection, we discuss the MSE properties of the Liu estimator for the PMQL regression model. Further, we make a comparison of with the existing estimators such as and to show the superiority of under different conditions in the MSE sense.

Let us define which is the SMSE differences of and . By using equations (18) and (24), we get

It is clear that when equals one, and then . Further, the estimator is said to be superior to the estimator in the form of the SMSE criterion if and only if . Then, if we can find a such that , we can say that the estimator is superior to in the PMQL regression model.

Kejian [22] showed that there exists a such that the Liu regression estimator has a lower SMSE than the ordinary least square estimator. Further, Kurtoğlu and Özkale [28] have proven that this property holds for the GLMs. The following two propositions show that this property holds in the PMQL Liu regression model.

Proposition 1. The total variance of the regression coefficient estimates of (term I) and squared bias of (term II) are continuous monotonically increasing and decreasing functions of , respectively.

Proof. : The first derivative of term I in terms of isSince for all , equation (26) is always positive for all . Further, the derivative of the term I in the neighborhood of zero given in equation (27) and the derivative of the term I in the neighborhood of one given in equation (28) are positive.
The first derivative of term II in terms of isSince and for all , equation (29) is always negative for all . Further, the derivative of the term II in the neighborhood of zero given in equation (30) is negative, and the derivative of the term II in the neighborhood of one given in equation (31) is zero.
Then, it is shown that term I and term II are continuous monotonically increasing and decreasing functions of , respectively.

Proposition 2. The SMSE given in equation (24) is a continuous monotonically increasing function of when , where is the maximum element of and is the maximum eigenvalue of the matrix .

Proof. The first derivative of equation (24) isOne can note that if the individual Liu parameter , equation (32) is positive. Then, it is clear that when , equation (32) is always positive. Then, it is shown that the SMSE is a continuous monotonically increasing function of when .
Now, we can conclude that there is a possibility of finding a value based on the results of Proposition 1. Further, the results of given in equation (25) at equal one, and Proposition 2 reveals that the when .
Then, it is shown that there exists a such that SMSE  SMSE .
The following theorem discusses the condition that the LE is superior to the MLE in the PMQL regression model.

Theorem 1. Let and . Then, iff .

Proof. The difference between the MSE of the MLE and LE is derived asNow, we apply the spectral decomposition for the above matrix. Then, the difference can be written asThe diagonal matrix is pd if . Then, by Lemma A1 (Appendix 1), if . It completes the proof.
The following theorem discusses the condition that the LE is superior to the RE in the PMQL regression model.

Theorem 2. Let and . Then, iff .

Proof. : The difference between the MSE of the RE and LE is derived asBy applying the spectral decomposition for the above matrix, the difference can be written asIt is clear that is a pd matrix, and the diagonal matrix is pd if .
Then, by Lemma A1 (Appendix 1), if . It completes the proof.

3.2. Estimation of the Liu Parameter

Based on the MSE properties of the PMQL Liu regression estimator discussed in Section 3.1, it is clear that the performance of depends on the optimum value of the Liu parameter . The optimal value of any individual Liu parameter can be found by setting equation (32) to zero and solving for . Then, it is obtained as

From equation (37), we can note that the optimum value is negative when and otherwise positive. Since the value of the is limited between 0 and 1, we should use the “max” operator to ensure the estimated value of the is nonnegative.

In this subsection, we adopt some notable existing Liu parameter estimators in order to estimate the Liu parameter in . They are summarized in Table 1 (Appendix 2). We define the Liu parameter estimators and for based on the theoretical works of Hoerl and Kennard [20], Kibria [35], and Khalaf and Shukur [36], respectively.

To estimate the ridge parameter in the PMQL ridge regression estimator, Tharshan and Wijekoon [19] discussed 12 various estimation methods. Among all, they recommended three ridge parameter estimators, , and , based on the works of Hoerl and Kennard [20], Nomura [37], and Muniz and Kibria [38]. These ridge parameter estimators are summarized in Table 1 (Appendix 2). Therefore, , and will be utilized to estimate in in the simulation study.

4. The Monte Carlo Simulation Study

In this section, a simulation study is carried out to evaluate the performance of the , PMQL ridge regression estimators, PMQL Liu regression estimators based on the various ridge, and Liu parameter estimators, respectively, as discussed in Section 3.2. We compare the performance of different estimators in the SMSE sense. A brief discussion about the simulation study is given in the following.

4.1. The Design of the Simulation Study

Since the degrees of the correlation between the covariates greatly depends on the performance of the various estimators, we generate the covariates with several degrees of multicollinearity by following the same formula as used by McDonald and Galarneau [39]. The formula is given as follows:where ’s are independent standard normal pseudorandom numbers. The response variable of the PMQL regression model is generated from the PMQL by using the inverse transform method, where . The starting values of the slope parameters are selected such that and .

Table 2 summarizes the factors and their levels that are considered in this design. Since either higher increments or decrements of variation of may lead to a negative impact on the performance of estimators [19, 24, 40, 41], we vary , , and . When we decrease the value of , the average values of the will decrease. This phenomenon leads to having more zeros of , which makes very less variation in the sample. Further, from Figure 1, we can observe that changing the value of the overdispersion parameters or affects the variation of .

The simulation is repeated 1000 times. To judge the performance of the different estimators, we obtain the SMSE values of different estimators by using the following equation:where is an estimator of at the replication.

4.2. Results and Discussion of the Simulation Study

The estimated SMSE values of the Monte Carlo Simulations are summarized in Tables 310 (Appendix 2) for the selected situations shown in Table 2. The minimum SMSE in each combination of different factor levels is presented in bold. In general, we can note that the is more efficient than the and in all cases reviewed in this study. Further, the performances of all regression coefficient estimators are affected by the factors of degrees of the correlation among the covariates, the sample size, the value of the intercept, the number of covariates, and the values of the overdispersion parameters.

It can be noticed that as the degrees of correlation increase, the estimated SMSE of the increases, and the PMQL Liu regression estimators having the estimated values , and are also affected negatively in some cases. However, the PMQL Liu regression estimator based on the estimator does not affect and yields consistently a smaller SMSE in all cases. Further, its estimated SMSE decreases with .

When the sample size increases, SMSE of the and the PMQL ridge and Liu regression estimators decrease. This phenomenon reveals that the asymptotic property holds for all given estimators. Further, in a given sample size, the PMQL Liu regression estimator based on the estimator performs better than the and PMQL ridge regression estimators for all given situations.

It is clearly observed that as the decreases from 1 to −1, the SMSE of the and PMQL Liu regression estimators based on the , and estimators are increased with a higher amount and based on estimator are also affected negatively for some cases. However, this change does not affect the performance of the PMQL Liu regression estimator based on the estimator basically.

The increasing number of covariates affects the performance of the negatively in all given situations, and the PMQL ridge regression estimators also affect negatively in some given situations. However, the PMQL Liu regression estimator based on estimator does not affect when increasing the number of covariates in all given situations. Further, when increases for a given combination of , the SMSEs produced by PMQL Liu regression estimator based on estimator decrease.

We can clearly note that the increment of either the overdispersion parameter (0.03 to 0.25) or (0.02 to 0.04) shows a positive impact on the performance of different estimators. From Figure 1, we can note that, in both situations, the variance of decreases. These results are in line with the simulated results of [17, 41].

Then, based on the simulation study results, we may say that the Liu parameter estimator is the best option to estimate the Liu parameter in the .

5. Applications

In this section, we use a simulated data set to show the applicability of the PMQL Liu regression model for the count data set with a higher index of dispersion and long right tail over the existing Liu regression models for count data, namely, NB Liu regression and Poisson Liu regression models. Further, we illustrate the applicability of the PMQL Liu regression estimator over the by using a real data set also. We have observed that the PMQL Liu regression estimator based on estimator performs well in the simulation study. Further, the same Liu parameter estimator was recommended by Månsson [25] for the NB Liu regression model and by Månsson [23] for the Poisson Liu regression model. Then, we will use the estimator to estimate the Liu parameter in all considered different Liu regression estimators.

5.1. Simulated Data Application

A data set with , and was simulated by using the method discussed in Section 4.1 in order to show the applicability of the proposed Liu regression model over the NB Liu and Poisson Liu regression models. The skewness, excess kurtosis, and index of dispersion of the response variable are 4.628, 20.113, and 35.217, respectively. Then, the distribution of has higher positive skewness, a long right tail, and higher overdispersion (index of dispersion ). Figure 2 also illustrates the distribution of the response variable . Table 11 displays the estimated regression coefficients, their standard errors (SEs) (in parentheses), SMSE values, and Akaike information criterion (AIC) values for the given regression models. We can clearly observe that the PMQL Liu regression model produces smaller SEs for the coefficients of covariates, SMSE value, and AIC value than the other regression models. Then, we may say that the PMQL Liu regression model gives a better performance than the other regression models based on the Liu estimator.

5.2. Real Data Application

The applicability of the PMQL Liu estimator over the MLE is illustrated by using the Swedish football data set, which consists of the Swedish football teams’ performance in the top Swedish league (Allsvenskan) for the year 2012. Qasim et al. [32] used a similar data set during the year 2018 to fit a Poisson regression model. It contains 242 observations and represents the number of full-time home team goals , the pinnacle home win odds , the pinnacle away win odds , the maximum odds portal home win , the maximum odds portal away win , the average odds portal home win , and the average odds portal away win .

The conditional number, which is the ratio of the maximum to minimum eigenvalues, is 33460.350, which is clearly much larger than 1000. The index of dispersion of is 1.201, which is greater than one. These results indicate that there exists severe multicollinearity among the covariates, and is overdispersed. Further, to examine whether the PMQL distribution is suitable for , the Chi-square goodness of fit test is employed. value is computed as 3.159 with a -value equal to 0.531. Then, this test confirms that the PMQL distribution fits well for this response variable .

Table 12 lists the estimated regression coefficients, their standard errors (SEs) (in parentheses), and SMSE values for the and PMQL Liu estimator. It can be clearly noted that the PMQL Liu estimator produces smaller SEs and SMSE than the . Then, we can conclude that the PMQL Liu estimator performs better than for this real data set that has severe multicollinearity issues.

Now, we justify Theorem 1 by using the real-world application. The necessary conditions in Theorem 1 are holding as, the minimum eigenvalue of the difference matrixand hence Theorem 1 is justified.

6. Conclusion

This paper introduced the Liu estimator for the Poisson-Modification of Quasi Lindley (PMQL) regression model instead of the maximum likelihood estimator (MLE) in order to mitigate the multicollinearity and overdispersion issues. A comprehensive Monte Carlo simulation study was conducted to compare the performance of the MLE, the PMQL Liu regression estimator, and the existing biased estimator, namely, the PMQL ridge regression estimator. The scalar mean square error (SMSE) was considered as the evaluation criterion. The simulation study results revealed that the performance of the different estimators is affected by the different levels of factors such as correlations between the covariates, sample size, number of covariates, intercept, and the overdispersion parameters of the PMQL regression model. Further, the PMQL Liu regression estimator based on the Liu parameter estimator shows a better performance than the other estimators in all situations reviewed in the simulation study. The results of a simulated data set show that the PMQL Liu regression model performs better than the existing count Liu regression models, namely, the negative binomial Liu regression and the Poisson Liu regression models, by mitigating higher overdispersion and multicollinearity. Further, a real-world application also shows that the PMQL Liu regression estimator based on the Liu parameter estimator has a better performance than the MLE. Therefore, based on the simulation study and applications, the PMQL Liu regression estimator based on the Liu parameter estimator is recommended for analyzing the overdispersed count responses with intercorrelated covariates.

Appendix

A. Lemma

Lemma A.1. Let be a positive definite (pd) matrix and c be a vector of nonzero constants. Then, iff [42].

Data Availability

The data set is available for the public at https://www.football-data.co.uk/sweden.php. The data are also available from the author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.