Abstract

The known linear regression model (LRM) is used mostly for modelling the QSAR relationship between the response variable (biological activity) and one or more physiochemical or structural properties which serve as the explanatory variables mainly when the distribution of the response variable is normal. The gamma regression model is employed often for a skewed dependent variable. The parameters in both models are estimated using the maximum likelihood estimator (MLE). However, the MLE becomes unstable in the presence of multicollinearity for both models. In this study, we propose a new estimator and suggest some biasing parameters to estimate the regression parameter for the gamma regression model when there is multicollinearity. A simulation study and a real-life application were performed for evaluating the estimators' performance via the mean squared error criterion. The results from simulation and the real-life application revealed that the proposed gamma estimator produced lower MSE values than other considered estimators.

1. Introduction

The gamma regression model (GRM) is generally adopted to model a skewed response variable that follows a gamma distribution with one or more independent variables. It is used in modelling the real-life data problems of several fields such as the medical sciences, health care economic, and automobile insurance claim [1]. When the positively skewed response variable follows a gamma distribution with a given set of independent variables, then it is preferred to use the gamma regression model [24]. As in linear regression models, the explanatory variables independence assumption rarely holds in practice, so the multicollinearity problem exists in the gamma regression models which means the maximum likelihood estimator (MLE) is unstable and gives high variances [5]. Consequently, constructing confidence intervals or testing the regression parameters of the model becomes difficult [6]. A lot of authors proposed different estimators for handling multicollinearity. The ridge estimator given by Hoerl and Kennard [7] is an alternative to MLE to overcome the multicollinearity in the linear regression model. The estimator has been extended to the generalized linear models (GLM) (see [8, 9]). Also, Månsson and Shukur [10] and Månsson [11] introduced the ridge estimator to the Poisson regression model and the negative binomial regression model, respectively. Kurtoglu and Ozkale [12] extend the Liu estimator of Liu [13] to the gamma regression model. Batah et al. [14] proposed a modified Jackknife ridge estimator by combining the ideas of the generalized ridge estimator and Jackknifed ridge estimator. Also, Algamal [3] developed the modified Jackknifed ridge gamma regression estimator. Recently, the modified version of the ridge regression estimator with two biasing parameters was proposed for both the LRM and GRM [15, 16]. Kibria and Lukman [17] proposed a new estimator called the ridge-type estimator and applied to the popular linear regression model.

The main objective portrayed in this article is to extend the new ridge-type estimator of Kibria and Lukman [17] to the GRM. The article organization is as follows: in Section 1, we proposed the new ridge-type gamma estimator, and then we derived its properties. Also, we have done the theoretical comparisons and have explained the estimation of the biasing parameter in Section 2. A simulation study is conducted to investigate and compare the performance of the new gamma estimator and some existing estimators in Section 3. We also analyzed a real-life data in Section 4. Finally, we have provided some concluding remarks in Section 5.

2. The Statistical Methodology

Consider the response variable which follows the known gamma distribution with the parameter of the nonnegative shape and the parameter of the nonnegative scale with probability density function:where and The log-likelihood function of (1) is

Equation (2) is solved iteratively since it is nonlinear in using the Fisher scoring method as follows:where is the iteration degree, and . The last step for the estimated coefficients is considered aswhere , matrix, , and is called the vector in ith element, . and are obtained by procedure of the Fisher scoring iterative (see [12, 18]). The matrix form of the covariance, the matrix of the mean squared error (MMSE), as well as the mean square error (MSE) are obtained by Algamal and Asar [19] and written, respectively, as follows:where .where is considered as an jth eigenvalue of the given matrix and the notation is the transpose of X.

The gamma ridge estimator (GRE) is considered aswhere and is the biasing parameter. The MMSE and MSE of GRE are given bywhere such that is the matrix of eigenvectors of .

The gamma Liu estimator (GLE) is given bywhere and is the biasing parameter.

The MMSE and MSE of GLE are given by

2.1. The New Gamma Estimator

For the known linear regression model, Kibria and Lukman [17] proposed the following new ridge-type estimator and called as the Kibria–Lukman (KL) estimator, which is defined aswhere , , and .

In this study, we extend the KL estimator to the GRM and referred to the estimator as gamma KL estimator (GKL) which is written as follows:where .

The bias and covariance matrix form of GKL estimator are gotten respectively as:where and

So, the MMSE and MSE in terms of eigenvalues are defined, respectively, as

2.2. The Theoretical Comparison for the Estimators

Some needed lemmas are stated as follows for comparing the estimators in theoretical.

Lemma 1. Suppose matrices is positive definite (p.d.) as well as is p.d. (or is nonnegative); then, iff , where is the max eigenvalue for the matrix [20].

Lemma 2. Suppose is an matrix which is p.d. and be a vector; then, is p.d. iff [21].

Lemma 3. Suppose that , be the given two linear estimators of . Also, suppose is p.d., where is considered as the covariance matrix form of and , . Consequently,if , where [22].

2.2.1. Comparison of GKL and MLE

Theorem 1. is better than if

Proof. The difference of the dispersion isWe observed that is positive definite (p.d.) since for . By Lemma 3, the proof is done.

2.2.2. Comparison of GKL and GRE

Theorem 2. is superior to ifwhere

Proof. where and .
Clearly, for the biasing parameters and , as well as . if , where is the max eigenvalue of the matrix form . By Lemma 1, the proof is done.

2.2.3. Comparison of GKL and GLE

Theorem 3. is superior to ifwhere .

Proof. The difference of the dispersion isWe observed that is p.d. since for and . By Lemma 3, the proof is done.

2.2.4. Estimation of Parameter k

The optimal value of in is adopted from the KL estimator of the study of Kibria and Lukman [17] as follows:

The optimal value of given in (24) depends on the unknown parameters and Therefore, we put the corresponding unbiased estimators instead of them. Consequently,

3. Simulation Design

R 3.4.1 programming language is adopted for the simulation design of this study. Following Algamal [19], the response variable is generated as follows:where , denotes . The parameter vector, , is chosen such that [1, 23, 24]. Following Kibria [25] and Kibria and Banik [26], the given explanatory variables are obtained as follows:where are generated from standard normal and is the correlation between the explanatory variables. The values of in this study are chosen to be 0.95, 0.99, and 0.999. We obtained the mean function for p = 4 and 7 explanatory variables, respectively, for the following sample sizes: 20, 50, and 200. For each replicate, we compute the mean square error (MSE) of the estimators by using the following equation:where would be any of the following estimators (MLE, GRE, GLE, and GLK). The smaller the mean square error value is, the better the estimator is. The biasing parameters for GRE and GLE are obtained as follows:

We examined two shrinkage parameters for the proposed estimator. They are defined as follows:

The simulation results for different values of n, φ, and ρ are presented in Tables 1 and 2 for p = 4 and 7, respectively. For a graphical representation, we also plotted MSE vs n, ρ, φ, and p in Figure 1.

It was observed from both Tables 1 and 2 and Figure 1 that the MSE increases as the level of multicollinearity increases keeping other variables constant. For instance, when n = 50, for the MLE, the MSE increases from 1.265 to 38.172 as the level of multicollinearity, rises from 0.95 to 0.999 for given and p = 4. We also observed that, as the explanatory variables increases from p = 4 to p = 7, the MSE increases provided other variables are kept constant. For instance, when n = 20 for  = 0.99 and the MSE for the GRE-k rises from 6.753 to 19.071. Also, when other variables are fixed, increasing the sample size n results in a decrease in the MSE for all the estimators’, for example, the MSE value of GLE-d for n = 200, p = 7, and  = 0.95 reduces from 1.282 to 1.549. Furthermore, the MSE increases as the dispersion parameter increases from 0.5 to 1. The maximum likelihood estimator performs least as expected because of the effect of multicollinearity on the estimator. The result in Tables 1 and 2 and Figure 1 shows that the GKL outperforms other estimators. Since the performance of the proposed estimator GKL depends on its biasing parameter, we examined two different biasing parameters for GKL estimator and observed that the GKL estimator performs best with the biasing parameter, The simulation result further supports the theoretical results that the performance of GKL estimator is the best. The performance of the GRE and GLE is better than that of the MLE. Furthermore, we explored the performance of the proposed estimator and the existing estimators by analyzing a real-life data in Section 4.

4. Real-Life Data: Algamal Data

The chemical dataset adopted in this study was employed in the study of Algamal [3, 19]. He employed the quantitative structure-activity relationship (QSAR) model to study the relationship between the biological activities of 65 imidazo [4, 5-b] pyridine derivatives – an anticancer compound – and 15 molecular descriptors. The QSAR model is widely used in the following fields: chemical sciences, biological sciences, and engineering. The linear regression model is popularly used to model the QSAR relationship between the response variable (biological activity) and one or more physiochemical or structural properties which serve as the explanatory variables especially when the response variable is normally distributed [27]. However, the regression modelling is employed when the response variable is skewed [3, 19, 24, 28]. In this study, following Algamal [3, 19], the variables of interest are described in Table 3.

According to Algamal [3, 19]; the response variable, y, follows a gamma distribution. Using the chi-square goodness of fit test, author examined that the response variable is well fitted to the gamma distribution with test statistic ( value) given as 9.3657 (0.07521). Algamal [19] reported that the correlation coefficient between the following variables, Mor21v and Mor21e, SpMax3_Bh(s) and ATS8v, SpMaxA_D and MW and finally MW and ATS8v, is greater than 0.9 and interpreted as high correlation. The eigenvalues of are 7.6687E + 8, 1.3238E + 6, 85791, 5523.6, 358.71, 250.51, 148.46, 42.731, 27.239, 18.015, 9.1197, 8.6175, 5.7748, 2.4292, 1.6532, and 0.3659, respectively. Thus, the condition number, CN is computed as follows:

CN =  = 45777.7 which indicates the presence of severe multicollinearity [19]. The results of the gamma regression model and the mean square error are presented in Table 4.

The result in Table 4 agrees with the simulation results. The performance of the MLE is the worst in terms of possessing the highest MSE. The proposed estimator with the biasing parameter in this order has the least mean square error followed by , GRE-k and GLE-d estimators. Recall in the simulation study GKL with as the shrinkage parameter performed the best.

5. Some Concluding Remarks

The Kibria–Lukman [17] estimator was developed to circumvent the problem of multicollinearity for the linear regression model. This estimator is in the class of the ridge regression and the Liu-type regression estimator, and it has a single biasing parameter. In gamma regression model, multicollinearity is also a threat for the performance of the maximum likelihood estimator (MLE) in the estimation of the regression coefficients. The gamma ridge (GRE) and the gamma Liu estimator (GLE) has been introduced in the previous study to mitigate the problem of multicollinearity. Since, Kibria and Lukman [17] claimed that the KL estimator outperforms the ridge and Liu estimator in the linear regression model, which motivated us to develop the gamma KL (GKL) estimator for the effective estimation in the GRM. We derived the statistical properties of GKL estimator and compared it theoretically with the MLE, GRE, and GLE. Furthermore, a simulation study and a chemical data analysis were conducted in support of the theoretical study. The simulation and application result show that GKLE with as the shrinkage parameter performed the best. In conclusion, the use of the GKL estimator is preferred when multicollinearity exists in the known gamma regression model.

Data Availability

The data used to support the findings of this study are available upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.