The New Odd Log-Logistic Generalized Inverse Gaussian Regression Model
We define a new four-parameter model called the odd log-logistic generalized inverse Gaussian distribution which extends the generalized inverse Gaussian and inverse Gaussian distributions. We obtain some structural properties of the new distribution. We construct an extended regression model based on this distribution with two systematic structures, which can provide more realistic fits to real data than other special regression models. We adopt the method of maximum likelihood to estimate the model parameters. In addition, various simulations are performed for different parameter settings and sample sizes to check the accuracy of the maximum likelihood estimators. We provide a diagnostics analysis based on case-deletion and quantile residuals. Finally, the potentiality of the new regression model to predict price of urban property is illustrated by means of real data.
The inverse Gaussian (IG) distribution is widely used in several research areas, such as life-time analysis, reliability, meteorology and hydrology, engineering, and medicine. Some extensions of the IG distribution have appeared in the literature. For example, the generalized inverse Gaussian (GIG) distribution with positive support is introduced by Good  in a study of population frequencies. Several papers have investigated the structural properties of the GIG distribution. Sichel  used this distribution to construct mixtures of Poisson distributions. Statistical properties and distributional behavior of the GIG distribution were discussed by Jørgensen  and Atkinson . Dagpunar  provided algorithms for simulating this distribution. Nguyen et al.  showed that it has positive skewness. More recently, Madan et al.  proved that the Black-Scholes formula in finance can be expressed in terms of the GIG distribution function. Koudou  presented a survey about its characterizations and Lemonte and Cordeiro  obtained some mathematical properties of the exponentiated generalized inverse Gaussian (EGIG) distribution.
In this paper, we study a new four-parameter model named the odd log-logistic generalized inverse Gaussian (OLLGIG) distribution which contains as special cases the GIG and IG distributions, among others. Its major advantage is the flexibility in accommodating several forms of the density function, for instance, bimodal and unimodal shapes. It is also suitable for testing goodness-of-fit of some submodels.
Our main objective is to study a new regression model with two systematic structures based on the OLLGIG distribution. We obtain some mathematical properties and discuss maximum likelihood estimation of the parameters. For these models, we presented some ways to perform global influence (case-deletion) and, additionally, we developed residual analysis based on the quantile residual. For different parameter settings and sample sizes, various simulation studies were performed and the empirical distribution of quantile residual was displayed and compared with the standard normal distribution. These studies suggest that the empirical distribution of the quantile residual for the OLLGIG regression model with two regression structures a high agreement with the standard normal distribution.
This paper is organized as follows. In Section 2, we define the OLLGIG distribution. In Section 3, we obtain some of its structural properties. We define the OLLGIG regression model in Section 4 and evaluate the performance of the maximum likelihood estimators (MLEs) of the model parameters by means of a simulation study. In Section 5, we adopt the case-deletion diagnostic measure and define quantile residuals for the fitted model. Further, we perform various simulations for these residuals. In Section 6, we provide two applications to real data to illustrate the flexibility of the OLLGIG regression model. Finally, some concluding remarks are offered in Section 7.
2. The OLLGIG Distribution
The GIG distribution  has been applied in several areas of statistical research. The cumulative distribution function (cdf) and probability density function (pdf) of the GIG distribution are given by (for )andwhere is the location parameter, is the scale parameter, is the shape parameter, is the modified Bessel function of the third kind and index , , and .
We denote by a random variable having density function (2). The mean and variance of arerespectively.
The moment generating function (mgf) of reduces toWe use the reparameterized GIG distribution according to GAMLSS in software R. For example, we have Other properties of the GIG distribution are investigated by Jørgensen .
The statistical literature is filled with hundreds of continuous univariate distributions. Recently, several methods of introducing one or more parameters to generate new distributions have been proposed. Based on the odd log-logistic generator (OLL-G) , we define the OLLGIG cdf, say , by integrating the log-logistic density function as follows:where , is a position parameter, is a scale parameter, and and are shape parameters. Clearly, is a special case of (5) when .
Henceforth, we write to simplify the notation. The OLLGIG density function can be expressed as
The main motivations for the OLLGIG distribution are to make its skewness and kurtosis more flexible (compared to the GIG model) and also allow bi-modality. We have , where and . Thus, the parameter represents the quotient of the log odds ratio for the new and baseline distributions. Note that the pdf and cdf of the OLLGIG distribution depend on integrals, which are calculated numerically in the same way as those of the Birnbaum-Saunders distribution.
Hereafter, we assume that the random variable follows the OLLGIG cdf (5) with parameters , say . The OLLGIG distribution contains as special cases the GIG distribution when and the IG distribution when and .
Some plots of the OLLGIG density for selected parameter values are displayed in Figure 1. It is evident that the proposed distribution is much more flexible, especially in relation to bi-modality (for ), than the GIG and IG distributions.
Equation (5) has tractable properties especially for simulations, since its quantile function (qf) takes the simple formwhere is the qf of the GIG distribution. This scheme is useful because of the existence of fast generators for GIG random variables in some statistical packages. For example, we can fit the generalized additive models for the location, scale, and shape (GAMLSS) in R.
We use the GAMLSS package to simulate data from this nonlinear equation. The plots comparing the exact OLLGIG densities and the histograms from two simulated data sets with replications for selected parameter values are displayed in Figure 2. These plots (and several others not shown here) indicate that the simulated values are consistent with the OLLGIG distribution.
3. Properties of the OLLGIG Model
3.1. Linear Representation
By defining the sets for , and following the results of Lemonte and Cordeiro [9, Section 3], we can expand aswhere , , andTo calculate , the index can stop after a large number of summands.
Further, we can rewrite after some algebra aswhere (for ) and is the descending factorial and .
We obtain an expansion for in (5). First, we use a power series for ( real)whereFor any real , we consider the generalized binomial expansionInserting (11) and (13) in (5) giveswhere (for ). The ratio of the two power series in the last equation can be reduced towhere the coefficients ’s (for ) are determined from the recurrence equationBy differentiating (15), the pdf reduces towhere is the exponentiated generalized inverse Gaussian (EGIG) density function with power parameter (for ).
We can derive a linear representation for in terms of GIG densities based on the previous results and following the expansions of Lemonte and Cordeiro  that lead to their (24). First, we can express asHere, represents the density function and the coefficients are given by , where and the quantities are determined from the recurrence relation (for ) and with ’s given in (10).
Equation (19) reveals that the OLLGIG density function is an infinite linear combination of GIG densities.
3.2. Two Properties
Equation (19) becomes useful in deriving several mathematical properties of the proposed distribution using well-known properties of the GIG distribution. We provide only two examples. The th moment about zero of the random variable defined by (2) is
Then, the ordinary moments of the OLLGIG random variable follow from (19) aswhere .
4. The OLLGIG Regression Model
In many practical applications, the lifetimes are affected by explanatory variables such as sex, smoking, diet, blood pressure, cholesterol level and several others. So, it is important to explore the relationship between the response variable and the explanatory variables. Regression models can be proposed in different forms in statistical analysis. In this section, we define the OLLGIG regression model with two systematic structures based on the new distribution. It is a feasible alternative to the GIG and IG regression models for data analysis.
Regression analysis involves specifications of the distribution of given a vector of covariates. We relate the parameters and to the covariates by the logarithm link functionsrespectively, where and denote the vectors of regression coefficients and . The most important of the parametric regression models defines the covariates in which model both and .
Consider a sample of independent observations. Conventional likelihood estimation techniques can be applied here. The total log-likelihood function for the vector of parameters from model (23) is given bywhere and are defined in Section 2. The MLE of can be calculated by maximizing the log-likelihood (24) numerically in the GAMLSS package of the R software. The advantage of this package is that we can adopt many maximization methods, which will depend only on the current fitted model. Initial values for and are taken from the fit of the GIG regression model with . We do not have problems of maximizing this log-likelihood function. This fact is shown in Section 4.1, where some simulations of the proposed regression model are given under different scenarios.
Under general regularity conditions, the asymptotic distribution of is multivariate normal , where is the expected information matrix. The asymptotic covariance matrix of can be approximated by the inverse of the observed information matrix . The elements of this matrix are calculated numerically. The approximate multivariate normal distribution for can be used in the classical way to construct approximate confidence for the parameters in .
We can use the likelihood ratio (LR) statistic for comparing some special sub-models with the OLLGIG regression model. We consider the partition , where is a subset of parameters of interest and is a subset of remaining parameters. The LR statistic for testing the null hypothesis versus the alternative hypothesis is given by , where and are the estimates under the null and alternative hypotheses, respectively. The statistic is asymptotically (as ) distributed as , where is the dimension of the subset of parameters of interest. For example, the test of versus is equivalent to compare the OLLGIG regression model with the GIG regression model and the LR statistic reduces to , where , , , and are the MLEs under H and , , and are the estimates under .
4.1. Simulation Study
In this part of simulation, we approach in two different ways. First, we perform a simulation to study the behavior of the MLEs of the parameters of the OLLGIG distribution without systematic structures. Second, we evaluate the behavior of the parameter estimates considering two systematic structures.
The OLLGIG Distribution. Some properties of the MLEs are evaluated using a classical analysis by means of a simulation study. We simulate the OLLGIG distribution as follows:(i)Compute the inverse function from the cumulative distribution (1).(ii)Generate .(iii)Apply in from (7).(iv)The values are generated from the OLLGIG distribution, where is the inverse of (1).
We take and 350 for each replication and then evaluate the estimates , , , and . We repeat this process times and then calculate the average estimates (AEs), biases, and means squared errors (MSEs). In the first scenario, we take , , , and . We use the values fitted in the adjustment to the iris data set in Section 6. The estimates of the model parameters are computed using the GAMLSS package of the R software. The results of the Monte Carlo study under maximum likelihood are given in Table 1. They indicate that the MLEs are accurate. Further, the MSEs of the MLEs of the model parameters decay toward zero when increases in agreement with first-order asymptotic theory.
The OLLGIG Regression Model. We examine the performance of the MLEs in the OLLGIG regression model by means of some simulations with sample sizes and 500. We simulate samples from two scenarios ( and ) by considering and . For both cases, we take . The explanatory variable is generated by and the response variable is generated by . For each fitted model, we compute the AEs, biases, and MSEs. Based on the results given in Table 2, we note that the MSEs of the MLEs of , , , , and decay toward zero when the sample size increases, as usually expected under first-order asymptotic theory. Further, the AEs of the parameters tend to be closer to the true parameter values when increases. These facts support that the asymptotic normal distribution provides an adequate approximation to the finite sample distribution of the estimates.
5. Checking Model: Diagnostic and Residual Analysis
A first tool to perform sensitivity analysis, as stated before, is by means of global influence starting from case-deletion [11, 12]. Case-deletion is a common approach to study the effect of dropping the ith observation from the data set. The case-deletion model with systematic structures (23) is given byIn the following, a quantity with subscript “(i)” means the original quantity with the ith observation deleted. For model (25), the log-likelihood function of is denoted by . Let be the MLE of from . To assess the influence of the ith observation on the MLEs , we can compare the difference between and . If deletion of an observation seriously influences the estimates, more attention should be paid to that observation. Hence, if is far from , then the th observation can be regarded as influential. A first measure of the global influence is defined as the standardized norm of (generalized Cook distance) given by
Another alternative is to assess the values of , , and since these values reveal the impact of the ith observation on the estimates of , , and , respectively. Another popular measure of the difference between and is the likelihood distance given by
Once the model is chosen and fitted, the analysis of the residuals is an efficient way to check the model adequacy. The residuals also serve to identify the relevance of an additional factor omitted from the model and verify if there are indications of serious deviance from the distribution considered for the random error. Further, since the residuals are used to identify discrepancies between the fitted model and the data set, it is convenient to define residuals that take into account the contribution of each observation to the goodness-of-fit measure.
In summary, the residuals allow measuring the model fit for each observation and enable studying whether the differences between the observed and fitted values are due to chance or to a systematic behavior that can be modeled. The quantile residuals (qrs)  for the OLLGIG regression model with two systematic structures are defined bywhere is given in (1) and is the inverse cumulative standard normal distribution.
Atkinson  suggested the construction of an envelope to have a better interpretation of the probability normal plot of the residuals. The simulated confidence bands of the envelope should contain the residuals. If the model is well-fitted, the majority of points will be within these bands and randomly distributed. The construction of the confidence bands follows the steps:(i)Fit the proposed model and calculate the residuals ’s;(ii) Simulate samples of the response variable using the fitted model;(iii) Fit the model to each sample and calculate the residuals ( and );(iv) Arrange each group of residuals in rising order to obtain for and ;(v) For each , calculate the mean, minimum and maximum , namely,(vi) Include the means, minimum, and maximum together with the values of against the expected percentiles of the standard normal distribution.
The minimum and maximum values of form the envelope. If the model under study is correct, the observed values should be inside the bands and distributed randomly.
Simulation Study. A simulation study is conducted to investigate the behavior of the empirical distribution of the qrs for the OLLGIG regression model. We generate samples based on the algorithm presented in Section 4.1. We also give the normal probability plots to assess the degree of deviation from the normality assumption of the residuals. Based on the plots in Figures 3 and 4 representing the first and second scenarios, respectively, we conclude that the empirical distribution of the qrs agrees with the standard normal distribution in both scenarios. This empirical distribution becomes closer to the standard normal distribution when increases in both scenarios.
In this section, we provide two applications to real data to prove empirically the flexibility of the OLLGIG model. The calculations are performed with the R software.
6.1. Application 1: Iris Data
In the first application, the OLLGIG distribution is compared with the nested GIG and IG distributions. The data set is iris, in which it provides measurements in centimeters of the variables length and width of the septal and length and width of the petal, respectively, for 50 flowers of each of the 3 iris species (setosa, versicolor, and virginica). In this application, the variable septum length (Sepal.Length) is used. This data set has been analyzed by several authors in multivariate analysis, for example, Anderson (1935) and Fisher . We show that the distribution for these data presents bimodality.
Table 3 provides a descriptive summary for these data and indicates positively distorted distributions with varying degrees of variability, skewness, and kurtosis.
A brief descriptive analysis of the data in Table 3 reveals that the average score of the variable septum length is and the median value is , thus indicating that the data has a symmetric distribution.
In Table 4, we report the MLEs of the model parameters and their standard errors (SEs) in parentheses. We give in Table 5 the following goodness-of-fit measures: Akaike Information Criterion (AIC), Consistent Akaike Information Criterion (CAIC), Bayesian Information Criterion (BIC), Hernnan-Quinn Information Criterion (HIQC), Cramér-von Misses (), Anderson Darling (), and Kolmogarov-Smirnov () test statistic. The small values of these measures, the better the fit. The figures in Table 5 indicate that the OLLGIG distribution has the lowest values of AIC, CAIC, BIC, HQIC, , , and among those of the fitted models and therefore it could be chosen as the best model.
We consider LR statistics to compare nested models. The OLLGIG distribution includes some submodels as mentioned above, thus allowing their evaluations relative to the others and to a more general model. The values of the LR statistics are listed in Table 6. It is evident from the figures in this table that the OLLGIG distribution outperforms its submodels according to the values of the LR statistics. So, it indicates that the OLLGIG model provides a better fit to these data than their sub-models.
More information is provided by a visual comparison of the histogram of the data and the fitted density functions and cumulative functions. The plots of the fitted OLLGIG, GIG, and IG densities are displayed in Figure 5(a). The estimated OLLGIG density provides the closest fit to the histogram of the data. In order to assess if the model is appropriate, the plots of the fitted OLLGIG, GIG, and IG cumulative distributions and the empirical cdf are displayed in Figure 5(b). They indicate that the OLLGIG distribution provides a good fit to these data.
6.2. Application 2: Price of Urban Property Data
Here, we provide a second application of the OLLGIG regression model to evaluation the price of urban residential properties for sale in the municipality of Paranaíba in the State of Mato Grosso do Sul (MS) in Brazil. These data collected in 2017 refer to houses for sale in the municipality. In the context of real estate appraisal, it is necessary to develop statistical methodologies (characterized by the scientific accuracy) of residential property prices. Besides this aspect, we can perceive the rare use of such methodologies by the real estate market. We construct a OLLGIG regression model with two systematic components to describe the relationship between real estate prices and other explanatory variables, thus allowing an understanding of the behavior of the price variable [16, 17]. The following explanatory variables are considered:(i)price of the property ; this variable was divided by ;(ii)area of land in square meters;(iii)number of parking spaces in the residence (0=no vacancy, 1=one vacancy, and 2=more than one vacancy); in this case, two dummy variables, and , are created;(iv)number of rooms with suites in the residence (0=no suites, 1=one suites, 2=more than one suites); in this case two dummy variables, and , are created;(v)if the residence has a swimming pool (0=no, 1=yes);(vi)if the residence is located in the center of the city (0=no, 1=yes); .
In the descriptive analysis of the data from Table 7, the mean score of the variable value is , which is not close to the median value , thus indicating that the data has an asymmetric distribution.
We define the OLLGIG regression model by two systematic structures for and and
We now consider the test of homogeneity of the scale parameter for the price of urban property data. The LR statistic (see Section 4) for testing the null hypothesis is , which gives a favorable indication toward to the dispersion not be constant.
In Table 8, we present the MLEs, SEs, and p-values. The covariates , , and are significant at the level in the regression structure for the location parameter , whereas the covariates , , , and are significant (at the same level) for the parameter . The figures in this table reveal that the covariate is not significant with respect to the parameter , but it is significant with respect to the parameter . This is due to a strong dispersion in the response variable. The covariate is also significant for the number of parking spaces in the structure of . The covariate is significant in the location and scale structure; i.e., there is a significant difference between the residences that do not have a suite, have a suite, or more. The covariate is not significant in relation to the location, but it is significant in the structure of . There is a significant difference in the residence with or without swimming pool for the dispersion parameter. This fact can also be noted in Figure 7(a). The covariate is significant in relation to both parameters and ; i.e., there is a significant difference between the residences being in the center of the city and outside the center. This fact can also be noted in Figure 7(b).
The AIC, BIC, and global deviance (GD) statistics are listed in Table 9. We note that the OLLGIG regression model presents the lowest AIC, BIC, and GD values among the other fitted models. So, there are indications that the OLLGIG model provides a better fit to these data.
We adopt again the LR statistics to compare the fitted models in Table 10. We reject the null hypotheses in the two tests in favor of the wider OLLGIG regression model. Rejection is significant at the level and provides clear evidence of the need of the shape parameter when modeling real data.
We use the R software to compute the and measures in the diagnostic analysis presented in Section 5. The results of such influence measures index plots are displayed in Figure 8. These plots indicate that the cases , , and are possible influential observations.
In addition, Figure 9(a) provides plots of the qrs for the fitted model, thus showing that all observations are in the interval and a random behavior of the residuals. Hence, there is no evidence against the current suppositions of the fitted model. In order to detect possible departures from the distribution errors in model, as well as outliers, we present the normal plot for the qrs with a generated envelope in Figure 9(b). This plot reveals that the OLLGIG regression model is very suitable for these data, since there are no observations falling outside the envelope. Also, no observation appears as a possible outlier.
7. Concluding Remarks
We present a four-parameter distribution called the odd log-logistic generalized Gaussian inverse (OLLGIG) distribution, which includes as special cases the generalized Gaussian inverse (GIG) and inverse Gaussian (IG). We provide some of its mathematical properties. Further, we define the OLLGIG regression model with two systematic structures based on this new distribution, which is very suitable for modeling censored and uncensored data. The proposed model serves as an important extension to several existing regression models and could be a valuable addition to the literature. Some simulations are performed for different parameter settings and sample sizes. The maximum likelihood method is described for estimating the model parameters. Diagnostic analysis is presented to assess global influences. We also discuss the sensitivity of the maximum likelihood estimates from the fitted model via quantile residuals. The utility of the proposed OLLGIG regression model is demonstrated by means of a real data set for price data of urban residential properties in the municipality of Paranaíba in the State of Mato Grosso do Sul, Brazil.
The [DATA TYPE] data used to support the findings of this study were supplied by Uiversidade Federal de mato Grosso do Sul under license and so cannot be made freely available.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Brazil.
B. Jørgensen, Statistical Properties of the Generalized Inverse Gaussian distribution, Springer, New York, NY, USA, 1982.View at: MathSciNet
R. D. Cook and S. Weisberg, Residuals and influence in regression, Monographs on Statistics and Applied Probability, Chapman & Hall, NY, USA, 1982.View at: MathSciNet
A. C. Atkinson, Plots, transformations and regression; an introduction to graphical methods of diagnostic regression analysis, Clarendon Press Oxford, Oxford, UK, 1985.View at: Publisher Site
J. W. M. Bertrand and J. C. Fransoo, “Modelling and simulation: Operations management research methodologies using quantitative modeling,” International Journal of Operations and Production Management, vol. 22, pp. 241–264, 2002.View at: Google Scholar
E. G. Araújo, J. C. Pereira, F. Ximenes, C. P. Spanhol, S. Garson, and E. G. Araújo, “Proposta de uma metodologia para a avaliação do preço de venda de imóveis residenciais em Bonito/MS baseado em modelos de regressão linear múltipla,” Pesquisa e Desenvolvimento Engenharia de Produção, vol. 10, pp. 195–207, 2012.View at: Google Scholar