Bandwidth Selection in Geographically Weighted Regression Models via Information Complexity Criteria
The geographically weighted regression (GWR) model is a local spatial regression technique used to determine and map spatial variations in the relationships between variables. In the GWR model, the bandwidth is very important as it can change the parameter estimates and affect the model performance. In this study, we applied the information complexity (ICOMP) type criteria in the selection of fixed bandwidth for the first time in the literature. The ICOMP-type criteria use a complexity measure that measures how parameters in the model relate to each other. A real dataset example and a simulation study have been conducted. Results of the simulation demonstrate that GWR models created with the bandwidth selection by ICOMP-type criteria show superior performance. In addition, when the bandwidth is selected according to the ICOMP-type criteria and the GWR model is created for the actual total fertility rate data, it is seen that the spatial distribution of the total fertility rate estimates is quite compatible with the distribution of the actual total fertility rate. According to the results, ICOMP-type criteria can be used effectively instead of the classical criteria in the literature in the selection of bandwidth in the GWR model.
Spatial regression analysis has become one of the important fields in the field of statistics in recent years. In studies where space is important, it is clear that classical statistical methods are insufficient to explain statistical change and estimate statistical inferences. Therefore, spatial statistical methods started to use in this field. Spatial statistical methods include spatial models that contain spatial information and consider the effects of locations on observations. The geographically weighted regression (GWR) method is a local spatial regression technique used to model relationships on geography . GWR can generate estimates for other points where the location is known by the regression method and the reference points where the location and features are known on the geography. Unlike the classical regression model, the coefficients are not fixed in the GWR model and each spatial point has its own coefficients . In the GWR approach, the neighborhood weight of all reference points adjacent to the regression point is determined to estimate the value of the parameters of a regression point. This neighborhood weight is usually determined by the kernel function using the Euclidean distance. Bandwidth is the distance metric or number of neighbors used for each local regression equation, and changing the bandwidth is the most important parameter to consider in the GWR approach because it can change the coefficient estimates . As the bandwidth value increases, the weights decrease and the local variation of the parameters decreases. So the regression equations turn into a general equation rather than a local one. As the bandwidth value decreases, the weights increase and the local variation of the parameters increases. But in this case, the equation may not give correct results because few reference points will be taken into account . Studies to improve the accuracy of GWR approaches generally focus on the calibration of bandwidths [5–7]. These approaches mainly focus on finding the appropriate bandwidth values for the datasets and are not concerned with the spatial datasets, altitude and temporal stationary. The bandwidth value can be a fixed value or an adaptive value for the all dataset, depending on the distribution density of the locations in the data . Cross-validation (CV), generalized cross-validation (GCV), Akaike information criterion (AIC), and Bayesian information criterion (BIC) are among the methods used to find the optimal bandwidth value in the GWR method [9, 10]. Guo  constructed a forest plot with a clustered spatial model of tree positions to investigate the effects of different kernel functions and different bandwidths on the model performance and coefficient estimates of the GWR method. Cho et al.  selected bandwidths using the GWR method, CV, and smallest spatial error LaGrange multiplier test statistic calibration methods. Yacim and Boshoff  defined five different GWR models by choosing different cores and bandwidths in the residential data sample and compared the model performances. Koc and Akın  selected the bandwidth with the CV criterion and applied GWR models with different kernel functions at fixed bandwidth. Yuan et al.  examined the effects of AIC and their different bandwidths on GWR results. Hu et al.  introduced a two-dimensional bandwidth matrix for parameter estimation in the GWR model. Punzo et al.  examined local differences in the effects of the main sociodemographic, economic, and institutional determinants of land consumption by using the bandwidth adjusted AIC in the GWR method. It is clear that bandwidth selection has a strong influence on the descriptive and predictive power of GWR models. Generally, CV and AIC are used for bandwidth selection in literature.
In this study, we propose to use the information complexity (ICOMP) criteria proposed by Bozdoğan [16, 17] in the selection of bandwidth in the GWR model. ICOMP use an information theoretic measure of general model complexity based on Van Emden’s generalized covariance complexity and the Kullback–Leibler distance [18, 19]. The purpose of ICOMP-type criteria is to achieve the optimal balance between the complexity and fit of a model. ICOMP aims to establish this balance by considering a complexity measure that measures how the parameters in the model are related to each other. Therefore, although it is a measure based on the Akaike Information Criterion, unlike AIC, it penalizes the covariance complexity of the model instead of directly penalizing the number of independent parameters. The use of ICOMP-type criteria in the selection of bandwidth in the GWR model will increase the confidence in the selected bandwidth due to its theoretical basis and will bring a new perspective to the literature.
The study is organized as follows. In Section 2, we present the model and methods. First, we explain geographically weighted regression and we applied the ICOMP-type criteria for the regression model. In Section 3, we present the simulation results regarding bandwidth selection using ICOMP-type. In Section 4, a real dataset example of the total fertility rate is present. There are conclusions obtained from this study in Section 5.
2. Materials and Methods
In this section, we introduced the geographically weighted regression method and information complexity criteria. First, the geographically weighted regression model is a local spatial regression technique that produces the predicted values for other points whose positions and properties are known . Unlike the linear regression model, the coefficients in the GWR model are not constant. The coefficients of each spatial point are created . The GWR model is given by
In equation (1), is the latitude and longitude coordinates of the location in space is the dependent variable, is the independent variable, is the coefficient of the GWR regression model, and is the error of location which is assumed to be independent and identically distributed normal random variable with mean zero and constant variance .
The weighted least square methods provide a basis of estimating the GWR parameters. Parameters estimation of the GWR model are obtained as follows:where X is the matrix of independent variables and consists of m + 1 column, is the dependent variable matrix, and is a diagonal matrix of values and is as shown as follows .where is the neighborhood ratio between the regression point and the reference point. is calculated using the global model, box-car, exponential, Gaussian, bi-square, and tricube . The Gaussian function is commonly used with Gaussian kernel function [21, 22], and is calculated as in the following equation.where is the bandwidth value, is the distance between the regression point and the reference point , and is usually the Euclidean distance and is calculated as shown in the following equation, where u and v are the point coordinates.
The value of the bandwidth parameter can be constant for a whole model in the GWR model or it can be variable according to the point density in the location. The optimal value of bandwidth can be determined by cross-validation (CV), generalized cross-validation (GCV), Akaike information criterion (AIC), and corrected Akaike information criterion (AICc) methods . The cross-validation criterion is given as follows .where is the fitted value of by omitting the point from the process .
The Akaike information criterion and corrected Akaike information criterion for GWR is defined aswhere n is the sample size, is the estimated standard deviation of the error term, and tr (S) denotes the trace of the hat matrix .
Second, the information complexity (ICOMP) criteria are a measure developed by Bozdoğan  on the basis of AIC. Unlike the AIC-based information criterion, ICOMP approximates the sum of the two Kullback and Leibler  distances that measure the model’s lack of fit and model complexity in a criterion function using an entropic measure of the estimated covariance matrix of the model parameters. Thus, the concept of model complexity takes into account not only the number of free parameters in the model but also the interdependence of parameter estimates. Therefore, a general model selection criterion can be provided by ICOMP by understanding the relational structure between parameter estimates in the selected model . ICOMP-type criteria provide the most appropriate balance between the complexity of a model and the goodness-of-fit . ICOMP criteria can be defined in several formulations. The formulations of the information criteria are as follows .where n is the sample size, is the maximized likelihood function, k is the number of variables, C is a real-valued complexity measure, and is the estimated covariance matrix of the parameter vector of the models .
3. Simulation Study
A simulation study was conducted to understand the performance of the information complexity criteria on bandwidth. Simulation design is in the same way as follows:where , , , and have been generated. Lon and lat are the spatial coordinates of the locations. Sample size is n = 300, and kernel function is taken as Gaussian. In Table 1, the values of the optimal bandwidth selection by different methods are given.
The ICOMP-type information criteria selected the bandwidth value as 2811.487. It was found that ICOMPPEULN performs better with the lowest information criterion in the selection of bandwidth. While the GWR model was created with fixed bandwidth selection, the GWR model established with ICOMP-type was GWR-ICOMP-type, the GWR model established with CV bandwidth selection was taken as GWR-CV, the GWR model established with GCV bandwidth selection was taken as GWR-GCV, the GWR model established with AIC bandwidth selection was taken as the GWR-AIC model, and the GWR model established with AICc bandwidth selection was taken as GWR-AICc. Performance evaluations of the models are as given in Table 2.
In Table 2, the GWR-ICOMP-type model was performed the best with the highest Adj.R2 value of 0.9901 and the lowest information criterion AIC = 169.0912.
4. Real Data Application
In this study, total fertility rate data were used for 81 provinces of Turkey in 2020. Total fertility rate refers to the average number of children a woman can have in the 15–49 age group. The data obtained are available at . A set of 6 continues variables are used in this study and described as dependent variable (Y): total fertility rate, and independent variables: gross domestic product (GDP-per city) (x1), mean age of mother by provinces, (2009–2020) (x2), number of illiterate (x3), number of women with higher education (x4), unemployment rate (% of GDP) (x5). Furthermore, the coordinates (longitude, latitude) for the 81 cities in the Turkey (see ) are used to fit the GWR. The spatial distribution of the total fertility rate in Turkey for 2020 is shown in Figure 1.
The provinces with the highest total fertility rate in Turkey are Şanlıurfa with 3.71 which are shown in red on the map. This province was followed by Şırnak with 3.22 and Ağrı and Siirt with 2.88. The province with the lowest total fertility rate was Karabuk with 1.29, which is shown in dark blue on the map. This province was followed by Zonguldak and Kütahya with 1.31. First, factors affecting total fertility rates in Turkey are modeled with the multiple regression method. Furthermore, the GWR model is used to determine whether there is an effect of locations.
Table 3 provides the coefficients of multiple regression models. According to the multiple regression models, gross domestic product, mean age of mother by provinces, and number of women with higher education and unemployment rate had a significant effect on total fertility rates. Testing the goodness-of-fit of the GWR model is important for experimental analysis. This test is called the global test of nonstationary .
In Table 4, the nonstationary global test result fits the GWR model to the total fertility rate data. Therefore, we can apply the GWR method to the total fertility rate data. The selection of optimal bandwidths in the GWR model for the total fertility rate data is given in Table 5.
The ICOMPPEULN criterion, which has the lowest information criteria among the ICOMP-type criteria, has chosen the optimal bandwidth. GWR models were found significant in the multiple regression models and selected the fixed bandwidth values.
These models are the GWR-CV model with a bandwidth value of 1.6477 for the CV score, the GWR-GCV model with a bandwidth value of 1.6526 for the GCV score, the GWR-AIC model with bandwidth value of 16.0042, the GWR-AICc model with bandwidth value of 14.9964, and the GWR-ICOMP-type with a bandwidth value of 0.4111. Model performances are as given in Table 6.
In Table 6, it is seen that the GWR-ICOMP-type model performed the best result with the highest Adj. R2 and lowest model information criteria values. Spatial distributions of total fertility rate estimates are shown in Figure 2.
In Figure 2, after selecting the bandwidth with ICOMP-type, the spatial distribution of the total fertility rate estimates in the GWR model is given. In Figure 2, it is seen that the distribution of the total fertility rate estimates is quite compatible with the distribution of the actual total fertility rate shown in Figure 1.
The selection of bandwidth in the GWR model is very important to increase the efficiency and accuracy of the model, and the selection of bandwidth can be considered as a model selection problem. When bandwidth is large, more data points will be included in the regression. Thus, the variance will be small, while the deviation will be large. If a small bandwidth is used, the regression will be confined to a local area and parameter estimates will depend on observations close to the regression point. Therefore, the variance of parameter estimates will increase, but the bias will be small and more anomalies can be discovered. In this study, we applied ICOMP-type criteria in the selection of bandwidth for the first time in literature. We found in the simulation design that the bandwidths selected with the ICOMP-type criteria increase the model performance and prediction accuracy of the GWR model. When GWR models with different bandwidths are established for actual total fertility rate data, it is concluded that the best performance is the GWR-ICOMP-type model with the highest R2 and lowest information criteria. In addition, when the ICOMP-type criteria were examined, it was seen that the ICOMPPEULN criterion chose the most suitable model with the smallest information criterion in fixed bandwidth. The spatial distribution of the total fertility rate estimates in the GWR-ICOMP-type model seems to be quite compatible with the actual total fertility rate distribution. As a result, the use of ICOMP-type criteria in the selection of bandwidth can improve the GWR model in terms of prediction accuracy and model performance.
The data used to support this study are available from https://data.tuik.gov.tr/.
Conflicts of Interest
The author declares that there are no conflicts of interest.
A. S. Fortheringham, C. Brunsdon, and M. Charlton, Geographically Weighted Regression: The Analysis of Spatially Varying Relationships, John Willey and Song, 2003.
P. E. Bidanset and J. R. Lombard, “The effect of kernel and bandwidth specification in geographically weighted regression models on the accuracy and uniformity of mass real estate appraisal,” Journal of Property Tax Assessment and Administration, vol. 10, pp. 5–14, 2014.View at: Google Scholar
P. E. Bidanset, J. R. Lombard, P. Davis, M. McCord, and W. J. McCluskey, “Further evaluating the impact of kernel and bandwidth specifications of geographically weighted regression on the equity and uniformity of mass appraisal models,” Advances in Automated Valuation Modeling, vol. 865, pp. 191–199, 2017, Advances in Automated Valuation Modeling.View at: Publisher Site | Google Scholar
Y. Yuan, M. Cave, H. Xu, and C. Zhang, “Exploration of spatially varying relationships between Pb and Al in urban soils of London at the regional scale using geographically weighted regression (GWR),” Journal of Hazardous Materials, vol. 393, 2020.View at: Google Scholar
H. Bozdogan, Intelligent Statistical Data Mining with Information Complexity and Genetic Algorithms, Chapman and Hall/CRC, Boca Raton, FL, USA, 2003.
M. H. van Emden, “An analysis of complexity,” Mathamatics Centrum Tracts, vol. 35, 1971.View at: Google Scholar
S. Kullback, The Kullback-Leibler Distance, Springer, Berlin/Heidelberg, Germany, 1987.
I. Gollini, B. Lu, M. Charlton, C. Brunsdon, and P. Harris, “GWmodel: an R package for exploring spatial heterogeneity using geographically weighted models,” Journal of Statistical Software, vol. 63, no. 17, pp. 1–50, 2013, arXiv preprint arXiv:1306.0413. 2013.View at: Publisher Site | Google Scholar
H. Bozdogan, “ICOMP: a new model selection criterion,” Classification and Related Methods of Data Analysis, Elsevier Science Publishers, Amsterdam, Netherlands, pp. 599–608, 1988.View at: Google Scholar
E. Pamukçu, H. Bozdogan, and S. Çalık, “A novel hybrid dimension reduction technique for undersized high dimensional gene expression data sets using information complexity criterion for cancer classification,” Computational and Mathematical Methods in Medicine, vol. 2015, pp. 1–14, 2015.View at: Google Scholar
H. Koç, E. Dünder, S. Gümüştekin, T. Koç, and M. A. Cengiz, “Particle swarm optimization-based variable selection in Poisson regression analysis via information complexity-type criteria,” Communications in Statistics - Theory and Methods, vol. 47, pp. 5298–5306, 2018.View at: Google Scholar