Abstract

We present a method by using the hierarchical cluster-based Multispecies particle swarm optimization to generate a fuzzy system of Takagi-Sugeno-Kang type encapsulated in a geographical information system considered as environmental decision support for spatial analysis. We consider a spatial area partitioned in subzones: the data measured in each subzone are used to extract a fuzzy rule set of above mentioned type. We adopt a similarity index (greater than a specific threshold) for comparing fuzzy systems generated for adjacent subzones.

1. Introduction

A geographical information system (shortly, GIS) is often used by the analyst as a decision support system for problem solving in spatial analysis (e.g., cf. [19]). In some works, fuzzy inference systems are encapsulated in GIS tools, where the knowledge is represented by if-then rules. In [7], the authors present an inference model, integrated in a GIS, based on Takagi-Sugeno-Kang (shortly, TSK) fuzzy system [10, 11]. In [4], a first-order Takagi-Sugeno fuzzy inference system is generated for estimating and simulating the discharge and sediment concentration in river basins. Also here we integrate in a tool GIS an inference model which generates a TSK-fuzzy system, based on the hierarchical cluster-based multiSpecies particle swarm optimization (shortly, HCMSPSO) algorithm [12] The spatial area of study is divided into subzones: for each subzone the correspondent TSK-fuzzy system is extracted by starting from the subset of patterns georeferred in that subzone. In our method, we calculate a similarity index between TSK-fuzzy systems generated for adjacent subzones: if this index is greater than a specific threshold, then the corresponding subzones and the data subsets are merged, and the TSK-fuzzy system is generated again for the new subzone. The partition of the area of study in subzones derives from the fact that the impact of the geographical characteristics necessarily involves spatial changes of the parameters forming the rule set of the fuzzy system.

A subzone represents a spatial subarea (of the area of study) with homogeneous characteristics with respect to the phenomenon investigated: we can consider as examples of subzones those areas with a specific index (pollution index, risk index, vulnerability index, traffic index, etc.). The concept of division of the area of study in homogeneous subzones has been introduced in [1315], in which the area is partitioned in iso-reliable zones, that is, in subzones with homogeneous reliability of the spatial characteristics. Strictly speaking, the expert has provided a set of patterns (), where is the -dimensional vector input, and is the output, measured on georeferred locations in the area of study. This expert does not know a priori the optimal partition of the area of study in homogeneous subzones with respect to the fuzzy system to be generated; he already uses partitions defined for sociological, morphological, climatic characteristics. However, he does not know whether the same fuzzy system can be applied to two adjacent subzones. In this work, we propose a based particle swarm optimization (PSO) algorithm for optimizing the partition in homogeneous zones of the area of study: indeed we generate a TSK-fuzzy system for each subzone by using the so-called HCMSPSO algorithm [12], which is a variation of PSO, and we compare the TSK-fuzzy system in adjacent subzones. A TSK-fuzzy system [16] is composed from a set of fuzzy rules represented in the following form: where the fuzzy set assumes the following Gaussian membership function:

The firing strength of the th rule on is given as

In the TSK-fuzzy system of zero order, the fuzzy set is generally represented in the consequent as a constant ; the TSK-fuzzy system of one order is given by a linear combination of the input variables: where the coefficient are real numbers. The output is calculated by the weighted average defuzzification method:

The HCMSPSO algorithm generates an optimal TSK-fuzzy system by determining the number of rules and by optimizing the coefficient values , , and in each rule. In our approach, the geographical area of study is initially partitioned in subzones; for each subzone we apply the HCMSPSO algorithm to generate the optimal TSK-fuzzy system. Here, we give an evolution process in which at each iteration we merge adjacent subzones with a fuzzy rule-set similarity index greater then a predefined threshold. The process stops when no fuzzy rule-set similarity index associated to couple of adjacent subzones is greater or equal to the threshold. We propose a similarity index based on the fuzzy inclusion concept given as where and are the lower and upper bounds of the domain of j. If , the two fuzzy sets overlap completely.

In other words, given the th and th TSK-fuzzy systems correspondent to adjacent subzones and with the same number of rules, we define the following similarity index for each couple of rules and , as which is a mean of (6) on the variables ( inputs and the output variable). Then, we order the rule sets of the two fuzzy systems in accordance to the following criteria:(i)we choose the two rules and with the greatest value . The two rules become the first rules of the respective fuzzy systems, and the index (7) can be written as ;(ii)among the remaining rules, we consider the two new rules with the index (7); these rules become the second rules of the respective fuzzy systems and the index (7) can be written as ; the process is repeated up to order all the rules in the two fuzzy sets.

We calculate the similarity index of the two TSK-fuzzy systems as

The similarity value is a mean of the values obtained for all the rules. We use this index to decide if the th and th adjacent subzones can be merged. Our evolution process is a hierarchical iterative approach; initially the area of study is partitioned in a fine-grained set of subzones; the analyst divides the area of study into subzones based on significant geospatial characteristics with respect to the problem studied (e.g., he divides the area into subzones corresponding to municipalities in demographic problems, or corresponding to subzones with different climatic characteristics in meteorological problems). Since each subzone must contain a significant number of data points (otherwise the partition would be too fine with respect to the distribution of the patterns), we impose the constraint that each subzone must contain at least patterns, where stands for a threshold number. At each iteration, the HCMSPSO algorithm is applied to generate an optimal fuzzy system for each subzone. In Section 2, we introduce the HCMSPSO method, in Section 3 we present our method for finding the optimal partition of the area of study, in Section 4 we present some experimental results, and Section 5 is conclusive.

2. The HCMSPSO Algorithm: An Overview

The HCMSPSO algorithm [12] is a method based on the PSO algorithm for determining the optimal TSK-fuzzy system by using a set of patterns. The HCMSPSO determines the number of rules and the optimal values of the parameters , , and of the membership functions in each rule. In [12, 1727], variations of the PSO algorithm are proposed.

The HCMSPSO method originates from the cluster-based particle swarm optimization method (shortly, CPSO) [28], in which each swarm is used independently for optimizing a set of parameters. In HCMSPSO, each swarm forms a species and the number of species is set to the number of fuzzy rules . each species is formed from particles, and each particle in a species represents a single fuzzy rule. The th species is used for optimizing the parameters in the th fuzzy rule. The position of the th particle in the th species is given from the ()-dimensional vector: and from the ()-dimensional vector: for a TSK-fuzzy system of first order. A TSK-fuzzy system of zero order is built by choosing one for each species of the particles. To determine the final number of rules and generate the particles in each species, the following iterative process is used:

(1) initially we set ; we consider the first input pattern , forming the first particle in the first species setting and , where is a predefined value by determining the initial width of each fuzzy set;

(2) we generate all the particles in the first species, associated to the fuzzy rule ; the th particle is given by the formula (for TSK-fuzzy system of first order): where , and represent small variations to and , respectively, generated from the interval []; the values are obtained randomly in the output range;

(3) for each successive pattern , we consider the rule with maximum firing strength given by If results , where is a predefined threshold, then a new rule is generated by setting where determines the degree of overlapping between two clusters;

(4) we generate all the particles in the ()-species, associated to the fuzzy rule ; the th particle is given by the formula (for TSK-fuzzy system of first order): where , , and represent small variations to and , respectively, from the interval []; the related values are obtained randomly from an interval identical to the output range;

(5) we set and iterate the steps (3) and (4) for all the patterns. At the end of the process, we have generated rules and species with particles.

For each species, a partition of its particles is created in subspecies. A subspecies is a cluster of particles of a species. To create a subspecies partition of the particle of a species, we need to sort the particle of the species. The index used for sorting the particles of a species is the root means square error (RMSE) defined as where is the output calculated using the defuzzification formula (5), and is the output value of the th pattern. The successive steps are used for sorting the particles in each species based on the increasing value of the RMSE and partitioning each species into subspecies:

(1) for each combination of particles of each species, we determine the set of particles with minimum RMSE. For each species, we set ;

(2) for sorting the particles in the th species, we calculate the RMSE produced by the combination with ; then sorting the particles based on the increasing value of the corresponding RMSE. This step is repeated for all species.

After sorting each species, we can do a partition of them into subspecies; the first particle of a subspecies, that is, the particle of the subspecies with least RMSE, is called leader of the subspecies. The next steps are used for partitioning each ordered species in subspecies:

(1) for the th species, we set the number of subspecies and create the first subspecies by setting the leader of this subspecies as the first particles of the species, ;

(2) then we consider the successive particles of the species with and calculate the index distance between the particle and the leader of the first subspecies: where (resp., ) for a TSK-fuzzy system of zero (resp., first) order. If we have , where is a threshold value, then we assign the particle to this subspecies; otherwise we create a new subspecies by setting ;

(3) we iterate the steps (8) and (9) for all the species.

In the last steps, the PSO algorithm is applied; we define the particles in the same subspecies as neighbours of a particle. The best global position of the neighbours of a particle , enclosed in the th subspecies of the th species at the iteration time (), is given from the best position of the leader of the subspecies until the iteration time (). The best local position of the particle enclosed in the th subspecies of the th species at the iteration time () is given by the best position of the particle until the iteration time ().

3. Subzones in the Generation Process

Our method is an iterative process that determines the optimal partition in subzones of the area of study; each subzone represents a zone of the area of study, homogeneous with respect to a specific TSK-fuzzy system composed from fuzzy rules in the form (1). The expert creates an initial finer partition of the area of study according to specific local features (of type sociological, climatic, orographic, hydrological, etc.); the pattern data set is divided into subsets such that a subset is spatially included in the corresponding subzone. The successive step is performed to verify that the data distribution is consistent with the partition into subzones of the area of study; we verify that the number of patterns inside each subzone is greater or equal to a specific threshold value which can be set by the user. Clearly, the higher this value is, more the low resultant RMSE is expected; therefore, the greater the accuracy of the resulting TSK-fuzzy system will be. If is the dimension of the subset of patterns inside the th subzone and is the threshold value, we impose the following constraint for each subzone: where is the cardinality of the partition such that . We consider consistently the pattern dataset with the partition into subzones of the area of study if (17) is true; otherwise, the expert creates a more coarse-grained partition; this control is iterated until each subset of patterns is consistent with respect to the corresponding subzone. For each subzone, the HCMSPSO method is applied for generating an optimal TSK-fuzzy system; we associate to each subzone the TSK-fuzzy system determined and its RMSE.

We compare the TSK-fuzzy systems of adjacent subzones, calculating the similarity index as in (8). If is greater or equal to a threshold value , the th and th subzones are merged. When two or more subzones are merged in a new subzone, we group the corresponding pattern subsets together into a single subset, and we restart the HCMSPSO algorithm for the new subzone. This process is iterated until we have that for all th and th adjacent subzones. As final result a thematic map is produced in which the area of study is divided into the final subzones classified according to the RMSE of the TSK-fuzzy system generated. To compare homogeneous errors, we calculate two normalized errors used in the literature:(i)the normalized root mean square Error index (shortly, NRMSE) which is the rapport between the RMSE and the range (this last is given from the difference between maximal and the minimal values of the output variable , in absolute value). We define the NRMSE with the following percentage: (ii)the coefficient of variation of the RMSE error (shortly, CVRMSE) which is the rapport between the RMSE and the mean value of the output variable . We define the CVRMSE with the following percentage:

The NRMSE and CVRMSE are used in the creation of thematic maps of the TSK-fuzzy systems. The expert can fix a reliability threshold for the TSK-fuzzy system; in subzones with index greater than this threshold, it is necessary to use additional data and/or eliminate data with noise and outliers. The process described above can be schematized in the following steps:

(1) the expert divides initially the area of study into Z subzones, partitioning the area in homogeneous zones; this partition represents the finer partition desired by the expert;

(2) the pattern dataset is partitioned in subsets of data; each subset contains measured data georeferred into a specific subzone; if the dimension of each pattern subsets is less than a prefixed threshold , the partition is too fine with respect to the pattern dataset and the process return to step (1); in this case the expert must create a more coarse grained subzone partition of the area of study;

(3) for each subzone, we use the HCMSPSO method to determine the number of rules and generate the correspondent species;

(4) we use the HCMSPSO method to generate the subspecies and optimize the parameters in each rule;

(5) we compare the TSK-fuzzy system calculated to two adjacent subzones and , by calculating the similarity index ; if , where is a predefined threshold, the two subzones, are merged in one subzone and the number of subzones is ;

(6) if there are two or more subzones merged, we iterate steps (3), (4), and (5) for each new subzone;

(7) two thematic maps are created by showing for each final subzone the reliability class correspondent to the final NRMSE and CVRMSE.

In Figure 1, we have schematized the above process. If we are interested to analyze whether the final distribution of the final subzones is or not approximately uniform, we can calculate the coefficient of variation of this distribution; this index is extracted by calculating the mean and standard deviation of the distribution of the ’s as The coefficient of variation, expressed in percentage, is given by

We can consider the pattern data distribution approximately uniform if ; this control can be used at the end of the process by the expert if he intends to verify if the final pattern dataset is approximately uniform with respect to the final partition of the area of study. This analysis is useful to see that significant differences in the RMSE on the area of study can be due to substantial changes in the cardinality of the final subsets of patterns inside each final subzone. In Section 4, we present some results of our tests applied on spatial datasets. The HCMSPSO algorithm has been implemented and encapsulated in the tool GIS ESRI/ArcGIS, release 10. A first experiment is made on test spatial data firstly by comparing the results obtained with the HCMSPSO method and the ones obtained using other PSO-based methods, secondly for verifying the accuracy of our method in the subzones merging process. Then, we present the results obtained by applying our method on a problematic related to the valuation of the costs for building maintenance in the city of Pompeii (Italy).

4. Results of Tests

Now, we apply our method to geospatial data. The test concerns the buildings of the municipality of Pompeii, which is a famous touristic city for the presence of an important Sanctuary and of a large famous archeological heritage. The municipal area is partitioned into four classes: rural area, urban center, inhabited nucleus, and industrial area. An expert's goal is to plan the cost of maintenance of buildings, based to the building maintenance data. The data are extracted from the related dataset, and they are the following: of construction, of last maintenance, of the damage with respect to the volume of the building ( extension and extension), of the damage ( gravity and gravity), of the maintenance (calculated in thousands of Euro). Each pattern corresponds to a building georeferred to a specific census microzone. The initial subzones are formed by union of adjacent microzones with a same city planning class, and we obtain subzones. The thematic map in Figure 2 shows the seven subzones obtained by the subdivision of the area of Pompeii based on the four classes.

The dataset is formed by buildings (cf. Table 1); we set . We have that is greater than for every , we can assume that the set of data is suitable for the partition. Then we use this subdivision in 7 subzones of Pompeii for generating the relative TSK-fuzzy system of zero order. In the HCMSPSO method, we set the threshold values and to 0.03 and 2, respectively; the number of iterations is 200.000. The graph in Figure 3 shows the RMSE trend with respect to the iteration number obtained for the seven subzones. For all the seven TSK-fuzzy systems generated, the RMSE trend reaches a plateau after about iterations. Table 2 shows the results, and we report the final number of rules, the RMSE, NRMSE, CVRMSE, and the index . The results show that the TSK-fuzzy systems related to the subzones 3 and 7 have RMSE and greater values with respect to the ones related to the other subzones. Then, we calculated the similarity indexes between adjacent subzones; we set .

The similarity values between two TSK-fuzzy systems with same number of rules, associated with adjacent subzones are given in Table 3. Table 4 contains breaks of the three thematic classes of NRMSE and CVRMSE: we consider not sufficiently reliable the results obtained in subzones belonging to the class “high,” that is, subzones with NMRSE greater then 30% or CVRMSE greater than 50%. The results in Figures 4 and 5 show that the resultant TSK-fuzzy system obtained for the rural areas 1 and 7 are not sufficiently reliable. These subzones contain the greatest number of patterns; the results confirm that the costs of maintenance of a building in rural areas are probably not significantly related to parameters like the last year of construction and maintenance.

5. Conclusions

In this paper, we present a method based on the HCMSPSO algorithm for extracting a TSK-fuzzy system from a dataset of georeferred measured data for spatial analysis. The area of study is partitioned initially into subzones by the expert; for each subzone is extracted the TSK-fuzzy system by using the subset of patterns georeferred into the subzone. We merge subzones with similar TSK-fuzzy system and by recreating the TSK-fuzzy system for the new subzone. For each TSK-fuzzy system, the reliability is evaluated using indexes based on the final RMSE. The algorithm has been implemented in the tool ESRI/ArcGIS, release 10; the results of our tests show that this method can be well used in a GIS platform and encapsulated as decision support system for optimizing fuzzy systems related to subzones of the area of study.