Abstract and Applied Analysis

Abstract and Applied Analysis / 2020 / Article

Research Article | Open Access

Volume 2020 |Article ID 8353481 | https://doi.org/10.1155/2020/8353481

M. Fathurahman, Purhadi, Sutikno, Vita Ratnasari, "Geographically Weighted Multivariate Logistic Regression Model and Its Application", Abstract and Applied Analysis, vol. 2020, Article ID 8353481, 10 pages, 2020. https://doi.org/10.1155/2020/8353481

Geographically Weighted Multivariate Logistic Regression Model and Its Application

Academic Editor: Alberto Fiorenza
Received30 Mar 2020
Revised22 May 2020
Accepted15 Jun 2020
Published01 Aug 2020

Abstract

This study investigates the geographically weighted multivariate logistic regression (GWMLR) model, parameter estimation, and hypothesis testing procedures. The GWMLR model is an extension to the multivariate logistic regression (MLR) model, which has dependent variables that follow a multinomial distribution along with parameters associated with the spatial weighting at each location in the study area. The parameter estimation was done using the maximum likelihood estimation and Newton-Raphson methods, and the maximum likelihood ratio test was used for hypothesis testing of the parameters. The performance of the GWMLR model was evaluated using a real dataset and it was found to perform better than the MLR model.

1. Introduction

Over the past decade, most research on geographically weighted regression (GWR) models has been focused on applications that contain two or more correlated responses (multivariate). Harini et al. [1, 2] introduced the multivariate GWR (MGWR) model and demonstrated the parameter estimation and hypothesis test procedures using the restricted maximum likelihood estimation (RMLE) and maximum likelihood ratio test (MLRT) methods, respectively. The form and properties of the estimated errors variance-covariance parameters of the MGWR model using the MLE and weighted least squares methods were investigated [3]. Triyanto et al. [4, 5] introduced the geographically weighted multivariate Poisson regression (GWMPR) model. The estimator of the GWMPR model parameters was obtained through the MLE with the Newton-Raphson iterative method, and the test statistic for hypothesis tests was determined by the MLRT method. Suyitno et al. [6] discussed the estimation of the geographically weighted trivariate Weibull regression (GWTWR) model using the MLE and Newton-Raphson methods. The geographically weighted multivariate t regression (GWMtR) model was introduced by Sugiarti et al. [7]. The MLE method and the expectation-maximization algorithm were applied to estimate the GWMtR model parameters. In [8], a new method to determine model conformity between the multivariate nonparametric truncated spline GWR model and the multivariate nonparametric truncated spline (global regression) was employed.

The responses of the multivariate GWR models in previous research were in the form of quantitative data. However, in many applications within various fields of research, the responses include not only quantitative data but also qualitative (categorical) data. Therefore, in this study, we propose the geographically weighted multivariate logistic regression (GWMLR) model. The GWMLR model is the extension of the geographically weighted bivariate logistic regression (GWBLR) proposed by Fathurahman et al. [9]. The GWMLR model has been developed from the geographically weighted logistic regression (GWLR) model proposed by Atkinson et al. [10]. The GWLR model is a combination of the GWR model [11] and the binary logistic regression model. The GWMLR model in this study is used to explain the spatial associations between two correlated categorical dependent variables with one or more independent variables, where each of the dependent variables has two categories. Similar to the methods in the works of Harini et al. [2], Triyanto et al. [4, 5], Suyitno et al. [6], and Sifriyani et al. [8], the MLE and MLRT methods were used in the modeling and applying of the GWMLR model. The MLE method was used to estimate the parameters, and the statistical test for the significance of the parameters was determined by the MLRT method. The GWMLR model performance was evaluated using the factors that influence the public health development index and human development index of districts and cities in Kalimantan Island, Indonesia.

2. Materials and Methods

2.1. Multivariate Logistic Regression Model

A multivariate logistic regression (MLR) explains the relationship between two or more correlated categorical dependent variables with one or more independent variables. In this study, the MLR model had two correlated categorical dependent variables, and each dependent variable had two categories. Let and be the two dependent variables. and each can have one of the two values (0 or 1). Let be a vector of dependent variables of the MLR model. The elements of have the probabilities of , , , and , respectively, which are presented in Table 1.


Total

Total

Following Dale [12] and Palmgren [13], follows a multinomial distribution with the joint probability mass function: where , , , , and . and are the values of the dependent variables. is the value of , which represents the elements of the vector of dependent variables. is the joint probability of the dependent variables. and are the marginal probabilities of and , respectively.

Suppose , ,…, are independent variables, then the MLR model can be expressed as follows: where is a vector of independent variables for ; , , and are the vectors of regression parameters; is the marginal probability of , and is the marginal probability of , which are defined as follows [14]:

is called the odds ratio of and depends on , which shows that and are correlated. The variables and are independent if , negatively correlated if , and positively correlated if [15, 16].

According to Dale [12] and Palmgren [13], is obtained by where

3. Geographically Weighted Multivariate Logistic Regression Model

The GWMLR model is an extension of the MLR model, used when the regression parameter depends on the spatial weight of all locations in the study area. The spatial weight, commonly used by the kernel functions [11, 17], depends on both the Euclidean distance and an optimal bandwidth. The GWMLR model in this study is expressed as follows: where is a vector of independent variables at location for and , , and are the vectors of the GWMLR model parameters at location . The vectors of independent variables and parameters at location are , , , and , respectively.

and are the marginal probabilities of dependent variables at location and are formulated as follows:

is called the odds ratio of dependent variables at location and can be determined by where

4. Model Selection

In this study, the best model was selected using the three most common information criteria, which are Akaike’s information criterion (AIC), the corrected AIC (AICC), and the Bayesian information criterion (BIC). All three information criteria formulas are as follows: where is the log-likelihood function of an estimated model, evaluated at the maximum likelihood estimator of the parameters at all locations ; is the number of effective parameters in the model at all locations, defined as with , where and are the matrix of independent variables and spatial weighting, respectively. The best model has the lowest values of AIC, AICC, and BIC.

5. Results and Discussion

5.1. Estimation of the GWMLR Model Parameters

The parameters of the GWMLR model can be obtained using the maximum likelihood method. The likelihood function is as follows: where is a vector of the GWMLR model parameters. Let for ; then, the likelihood function in Equation (16) is formulated by

The maximum likelihood estimator of the GWMLR model parameters can be determined by maximizing the likelihood function in Equation (17) or by maximizing the natural logarithm of the likelihood function (log-likelihood). The log-likelihood function is given by

Based on the GWR method, the spatial weighting function is presented as a log-likelihood. Let be the spatial weighting function for each location , where and . The log-likelihood function is defined as follows: where is a fixed kernel bi-square [16] and formulated by where is the Euclidean distance from to , and is called an optimal bandwidth for the parameter estimation of the model at location . In this study, the optimal bandwidth is determined by the cross-validation (CV) method [18, 19]. The formula of the CV method in this study is as follows: where is the observation of the dependent variables with category values of and at location , and is the estimated value of the joint probabilities of the dependent variables that have category values of , , and bandwidth with location omitted from the estimation process. The optimal bandwidth has the lowest value of CV.

Theorem 1 obtains the maximum likelihood estimator of the GWMLR model parameters.

Theorem 1. The parameter estimator of in the GWMLR model can be obtained by using the maximum likelihood method and iterative procedure with the Newton-Raphson method, where the gradient vector is and the Hessian matrix is .

Proof. Based on the GWMLR model in Equations (6)–(8), let , , and . Then, and are formed. We then determine the derivative of . The vector of has four elements, whereas the vector of only has three elements. To obtain a symmetrical matrix of , let with . Thus, the vector of is . Let ; then, the matrix of and the inverse matrix of are given by where and .

The log-likelihood function in Equation (9) is maximized by determining the first-order partial derivative of the likelihood function, then equating to zero. The first-order partial derivative of the log-likelihood function with respect to the parameters of is as follows: where and for in Equation (23).

The details of the first-order partial derivative of the log-likelihood function with respect to the parameters of in Equations (24)–(26) are presented in the appendix.

The first-order partial derivative of the log-likelihood with respect to the parameters of in Equations (24)–(26) produces an implicit form. This result shows that the estimator of the GWMLR model parameters cannot be obtained analytically and requires a numerical approach. The numerical approach by the Newton-Raphson method was used to obtain the maximum likelihood estimator of the GWMLR model parameters. The Newton-Raphson method requires the gradient vector and the Hessian matrix, which are formulated as follows:

After obtaining the gradient vector and Hessian matrix, the Newton-Raphson iteration process is carried out with the following formula: where and are the parameter estimators of on and iterations, respectively. is the inverse of the Hessian matrix of on iteration and is the gradient vector of on iteration. The iteration process in Equation (29) started from an initial value of and stopped at iteration when , where and is a low positive number.

5.2. Hypothesis Test

Hypothesis testing on the GWMLR model parameters was performed and included the similarity test, simultaneous test, and partial test. The similarity test was used to find the differences between the MLR and GWMLR models. The simultaneous test was used to simultaneously obtain the significant influence of the independent variables on the dependent variables. The simultaneous test was also used to obtain at least one of the independent variables that have a significant influence on the dependent variables. The partial test was used to obtain the partially significant influence of the independent variables on the dependent variables.

The similarity test was conducted using the hypotheses:

The statistical test is as follows: where

The statistical test in Equation (32) followed an asymptotically standard normal distribution. Therefore, the null hypothesis in Equation (30) is rejected at the level of significance when the value of the statistic in Equation (32) falls into the rejection region (i.e., ).

The next test presented is the simultaneous test. The hypothesis of this test is formulated as follows:

Theorem 2 is presented next for the simultaneous test.

Theorem 2. The statistical test of the hypothesis in the simultaneous test is as follows:

Proof. The statistic can be obtained by the maximum likelihood ratio test method. The initial step of this method determines the parameters set under the population. Analogously, in Equations (17) and (18), the likelihood function under the population is as follows: However, the maximum likelihood estimator of the GWMLR model parameters was obtained in Theorem 1. Therefore, the maximum log-likelihood function under the population is as follows: The parameters set under the null hypothesis are Analogously, in Equations (38) and (39), the likelihood function under the null hypothesis is The maximum log-likelihood function under the null hypothesis is as follows: where the joint probabilities of , , , and are obtained by with , , , and .

Based on the maximum likelihood ratio test method, the statistical test of the hypothesis in Equation (34) is formulated as follows:

The likelihood ratio statistic in Equation (44) has an asymptotic chi-square distribution, where the degree of freedom is the difference between the number of model parameters under the population and the number of model parameters under the null hypothesis is . Therefore, at an significance level, we reject the null hypothesis when the value falls into the rejection region (i.e., ).

The last hypothesis test of the GWMLR model parameters is the partial test. The hypothesis is

The statistical test for the hypothesis in Equation (45) is given by where . is the diagonal elements of and is derived in Equation (28). The statistic in Equation (47) has an asymptotic standard normal distribution. Therefore, the null hypothesis in Equation (45) is rejected when the value of the statistic falls into the rejection region (i.e., ).

5.3. Application

The GWMLR model was applied to real data, which included the public health development index (PHDI) and the human development index (HDI) for the districts/cities in Kalimantan Island, Indonesia, in 2013. The PHDI describes the quality of health and the progress of health development of the districts/cities and provinces in Indonesia. The PHDI is used to prioritize districts/cities that need assistance in health development [20]. The HDI is an index that measures the basic dimensions of human development in the districts/cities [21].

The PHDI data were provided by the Ministry of Health, Indonesia. The National Bureau of Statistics Indonesia provided the HDI data and independent variables. The variables in this study consist of two dependent variables and two independent variables. PHDI and HDI are dependent variables. The PHDI has two categories: 0 if the PHDI value of districts/cities is less than the PHDI value of Indonesia, and 1 if the PHDI value of districts/cities is greater than or equal to the PHDI value of Indonesia. The HDI has two categories: 0 if the HDI value of districts/cities is less than the HDI value of Indonesia, and 1 if the HDI value of districts/cities is greater than or equal to the HDI value of Indonesia. The poverty rate and economic growth are the independent variables. The unit observation is the districts/cities in Kalimantan Island, Indonesia, in 2013. The sample size is 55, consisting of 46 districts and 9 cities. The computation in this study is performed using MATLAB and the econometrics toolbox [22].

The implementation of the GWMLR model for the PHDI and HDI of districts/cities in Kalimantan Island began by creating a contingency table for the observed frequencies of the dependent variables and for determining their proportion and correlation. The observed frequencies of the dependent variables are reported in Table 2.


Total

13 (0.236)3 (0.055)16 (0.291)
13 (0.236)26 (0.473)39 (0.709)
Total26 (0.472)29 (0.528)55 (1)

Table 2 shows that 13 districts/cities had PHDI and HDI values greater than or equal to the PHDI and HDI values of Indonesia, and 26 districts/cities had PHDI and HDI values less than the PHDI and HDI values of Indonesia. We also see that three districts/cities had a PHDI value greater than or equal to the PHDI value of Indonesia and an HDI value less than the HDI value of Indonesia. Finally, 13 districts/cities had a PHDI value less than the PHDI value of Indonesia and an HDI value greater than or equal to the HDI value of Indonesia. The odds ratio value of the dependent variables was 8.6667, which shows that the dependent variables were positively correlated. Therefore, the dependent variables of PHDI and HDI were appropriate for the MLR and GWMLR models.

The parameter estimation obtained a total of 55 GWMLR models. The optimal bandwidth value of the fixed kernel bi-square weighting function was 4.8572, with a CV value of 90.3673. The descriptive statistics of the maximum likelihood estimator values of the 55 GWMLR models for modeling the PHDI and HDI of districts/cities in Kalimantan Island are given in Table 3.


ParameterMinimumMaximumMeanStandard deviation

-0.10160.1741-0.00030.0441
1.18510.1153-0.09580.1697
-0.52120.0000-0.12440.0997
-0.14130.0011-0.01430.0322
-0.71110.11900.00560.1045
-0.26090.0034-0.04130.0753
-0.00180.36160.04000.1004
-0.00671.78640.19590.4863
-0.00991.78670.18740.4701

The similarity evaluation between the MLR and GWMLR models was carried out using the statistical test in Equation (32). The hypothesis was formulated as follows:

The statistical test value was 376.0917, and the value at a 10% significance level was 1.6449. Therefore, the null hypothesis was rejected, and we concluded that the MLR and GWMLR models were significantly different.

The next test was the simultaneous test, and the hypothesis was formulated as follows:

The likelihood ratio statistic value was 401.6335, and the value was 363.3222. This result indicated that the likelihood ratio was statistically significant at a 10% significance level. Therefore, the null hypothesis was rejected, and we concluded that the poverty rate and economic growth were simultaneously significantly influencing the PHDI and HDI of the districts/cities in Kalimantan Island.

The partial test was used to obtain the independent variables that significantly influence the PHDI and HDI of the districts/cities in Kalimantan Island. We show the results of the parameter estimation, standard errors, and statistical test values of the partial test for the GWMLR model of Lamandau District in Table 4.


ParameterEstimateStandard error

-0.01840.0080-2.3017
-0.13680.0596-2.2963
-0.11850.0506-2.3424
-0.00230.0010-2.3167
0.00520.00262.0172
-0.01140.0057-2.0089
0.00130.00081.7245
0.01320.00761.7281
0.00780.00481.6245

Indicates significance at a 10% level.

The hypothesis of the partial test for the Lamandau District GWMLR model parameters was as follows:

The statistical test value of the estimated parameter value of in Table 4 was not statistically significant at a 10% significance level. Therefore, we concluded that the economic growth was not a partially significant influence on the PHDI and HDI of Lamandau District.

The GWMLR model of Lamandau District is expressed as follows:

The results of the estimating and hypothesis testing of the GWMLR model parameters show that not all of the independent variables had a significant influence on the PHDI and HDI of the districts/cities in Kalimantan Island. Therefore, a significant number of parameters differed for each district/city. The groupings of the independent variables that had a significant influence on the PHDI and HDI of districts/cities are presented in Table 5.


Districts/citiesTotalVariable

Lamandau, Banjarmasin City, and Banjarbaru City3Poverty rate

Barito Timur, Kotabaru, Tapin, Hulu Sungai Selatan, Hulu Sungai Utara, Tabalong, Balangan, Paser, and Kutai Barat9Economic growth

Sambas, Kapuas Hulu, Sekadau, Melawi, Kayong Utara, Kubu Raya, Pontianak City, Singkawang City, Kotawaringin Barat, Kotawaringin Timur, Kapuas, Barito Selatan, Barito Utara, Sukamara, Tanah Laut, Hulu Sungai Selatan, and Tanah Bumbu17Poverty rate and economic growth

Bengkayang, Landak, Pontianak, Sanggau, Ketapang, Sintang, Seruyan, Katingan, Pulang Pisau, Gunung Mas, Murung Raya, Palangkaraya City, Banjar, Barito Kuala, Kutai Kartanegara, Kutai Timur, Berau, Malinau, Bulungan, Nunukan, Penajam Paser Utara, Tana Tidung, Balikpapan City, Samarinda City, Tarakan City, and Bontang City26None

The performance of the GWMLR model was evaluated using the AIC, AICC, and BIC in Equations (13)–(15). The values of the AIC, AICC, and BIC for the MLR and GWMLR models are presented in Table 6.


ModelAICAICCBIC

MLR525.6258529.6258543.6918
GWMLR137.8512134.2875143.8732

The AIC, AICC, and BIC values of the GWMLR model in Table 6 are lower compared with the MLR model. This result shows that the GWMLR model is more accurate than the MLR model. Therefore, the GWMLR model is best for modeling relationships between the dependent variables (the PHDI and HDI) and the independent variables (poverty rate and the economic growth) of districts/cities in Kalimantan Island, Indonesia, in 2013.

6. Conclusions

The GWMLR model is capable of evaluating the relationships between two correlated categorical dependent variables with one or more independent variables that depend on the spatial weighting function at each location in the study area. The spatial weighting function is an essential tool for parameter estimation and hypothesis testing in the modeling of GWMLR. Therefore, the fixed kernel bi-square was used, as it relates to the Euclidean distance and optimal bandwidth. The cross-validation method was applied to obtain the optimal bandwidth. The GWMLR model parameters were estimated using the maximum likelihood method. The maximum likelihood estimators have an implicit form and were obtained by the Newton-Raphson method. Hypothesis testing of the GWMLR model included a similarity test, simultaneous test, and partial test. The similarity test was used to obtain a significant difference between the MLR and GWMLR models. The statistical test of the similarity test had an asymptotic standard normal distribution. The simultaneous test was used to obtain the simultaneously significant influence of the independent variables on the dependent variables. The statistical test of the simultaneous test had an asymptotic chi-square distribution. The partial test was used to obtain the partially significant influence of the independent variables on the dependent variables. The statistical test of the partial test had an asymptotic standard normal distribution. The performance of the GWMLR model was evaluated with the factors influencing the public health development index and the human development index of districts/cities in Kalimantan Island, Indonesia, in 2013. The GWMLR model was found to be better than the MLR model in this context.

Some improvements and future work on modeling of GWMLR are possible. Firstly, the spatial weighting function used in this study involves only kernel bi-square. Other spatial weighting functions of kernel functions could be used to further the GWMLR model performance, such as the Gaussian, exponential, or tri-cube functions. Secondly, this study is limited to two dependent variables that only have two categories. Having more dependent variables with more than two categories, which could be either multinomial or ordinal, should also be considered for future work.

Appendix

The first-order partial derivative of the log-likelihood function with respect to the parameters of is as follows: where in Equation (23). <