Abstract

This paper presents a spatiotemporal analysis of hotspot areas based on the Extended Fuzzy C-Means method implemented in a geographic information system. This method has been adapted for detecting spatial areas with high concentrations of events and tested to study their temporal evolution. The data consist of georeferenced patterns corresponding to the residence of patients in the district of Naples (Italy) to whom a surgical intervention to the oto-laryngopharyngeal apparatus was carried out between the years 2008 and 2012.

1. Introduction

In a GIS, the impact of phenomena in a specific area due to the proximity of the event (e.g., the study of the impact area of an earthquake, or the area constraint around a river basin) is performed using buffer area geoprocessing functions. Given a geospatial event topologically represented as a georeferenced punctual, linear, or areal element, an atomic buffer area is constituted by circular areas centered on the element. For example, if the event is the epicenter of an earthquake, georeferenced by a point, a set of buffer areas is formed by concentric circular areas around that point; the radius of each circular buffer area is defined a priori.

When it is not possible to define statically an area of impact and we need to determine what is the area affected by the presence of a consistent set of events, we are faced with the problem of detecting this area as a cluster on which the georeferenced events are thickened as well. These clusters are georeferenced, represented as polygons on the map, and called hotspot areas.

The study of hotspot areas is vital in many disciplines such as crime analysis [13], which studies the spread on the territory of criminal events, fire analysis [4], which analyzes the phenomenon of spread of fires on forested areas, and disease analysis [57], which studies the localization of focuses of diseases and their temporal evolution. The clustering methods mainly used for detecting hotspot areas are the algorithms based on density (see [8, 9]); they can detect the exact geometry of the hotspots, but are highly expensive in terms of computational complexity, and in the great majority of cases, it is not necessary to determine exactly the shape of the clusters. The clustering algorithm more used for its linear computational complexity is the Fuzzy C-Means algorithm (FCM) [10], a partitive fuzzy clustering method that uses the Euclidean distance to determine prototypes cluster as points.

Let be a dataset composed of pattern , where is the th component (feature) of the pattern . The FCM algorithm minimizes the following objective function: where is the number of clusters, fixed a priori, is the membership degree of the pattern to the th cluster , is the set of points given by the centers of the clusters (prototypes), is the fuzzifier parameter and, is the distance between the center of the th cluster and the th vector , calculated as the Euclidean norm: Using the Lagrange multipliers method for minimizing the objective function (1), we obtain the following solution for the center of each cluster prototype: where and for membership degrees , subjected to the constraints:

Initially, the ’s and the are assigned randomly and updated in each iteration. If is the matrix calculated at the -th step, the iterative process stops when where is a prefixed parameter.

This algorithm has a linear computational complexity; however, it is sensitive to the presence of noise and outliers; furthermore, the number of cluster is fixed a priori and needs to use a validity index for determining an optimal value for the parameter .

In order to overcome these shortcomings, in [11, 12], the EFCM algorithm is proposed, where the cluster prototypes are hyperspheres in the case of the Euclidean metric. Like FCM, the EFCM algorithm is characterized by a linear computational complexity; furthermore, it is robust with respect to the presence of noise and outliers, and the final number of clusters is determined during the iterative process.

In [13, 14], the authors propose the use of the EFCM algorithm for detecting hotspot areas. The final hotspots are identified as the detected cluster prototypes and shown on the map as circular areas. In [4], the authors analyze the spatio-temporal evolution of the hotspots in the fire analysis. The pattern event dataset is partitioned according to the time of the event’s detection; so each subset is corresponding to a specific time interval. The authors compare the hotspots obtained in two consecutive years by studying their intersections on the map. In this way, it is possible to follow the evolution of a particular phenomenon.

The cluster prototypes detected from EFCM method are circular areas on the map that can approximate a hotspot area. Figure 1 shows an example of two circular hotspots, obtained as clusters.

Figure 1 shows three different regions. (i) An area in which the hotspot is not intersected by the hotspot (corresponding to ): this region can be considered as a geographical area in which prematurely detected event disappears successively.  (ii) The region of intersection of the two hotspots : this region can be considered a geographical area in which the event continues to persist. (iii) An area in which the hotspot is not intersected by the hotspot (corresponding to ): this region can be considered as a geographical area in which the prematurely undetected event propagates successively.We can study the spatio-temporal evolution of the hotspots by analyzing the interactions between the corresponding circular cluster prototypes obtained for consecutive periods, and detecting the presence of new hotspots in regions previously not covered by hotspots and the absebce of hotspots in regions previously spatially included in hotspot areas.

In this research, we present a method for studying the spatio-temporal evolution of hotspots areas in disease analysis; we apply the EFCM algorithm for comparing, in consecutive years, event datasets corresponding to oto-laryngopharyngeal diseases diagnosis detected in the district of Naples (I). Each event corresponds to the residence of the patient who contracted the disease.

We study the spatio-temporal evolution of the hotspots analyzing the intersections of hotspots corresponding to two consecutive years, the displacement of the centroids, the increase or reduction of the hotspots areas, and the emergence of new hotspots.

In Section 2, we give an overview of the EFCM algorithm. In Section 3, we present our method for studying the spatio-temporal evolution of hotspots in disease analysis. In Section 4, we present the results of the spatio-temporal evolution of hotspots for the otolaryngologist-laryngopharyngeal diseases diagnosis events detected in the district of Naples (I). Our conclusions are in Section 5.

2. The EFCM Algorithm

In the EFCM algorithm, we consider clustering prototypes given by hyperspheres in the -dimensional feature’s space. The th hypersphere is characterized by a centroid and a radius .

Indeed, if is the radius of , we say that belongs to if .

The radius is obtained considering the covariance matrix associated with the th cluster, defined as whose determinant gives the volume of the th cluster. Since is symmetric and positive, it can be decomposed in the following form: where is an orthonormal matrix and is a diagonal matrix. The radius is given by the following formula (see [12]): The objective function to be minimized is the following: where the membership degrees are updated as

We set and define the number for any ; thus, we obtain

However, the usage of (12) produces the negative effect of diminishing the objective function (10) when a meaningful number of features are placed in a cluster and this fact can prevent the separation of the clusters. Then a solution to this problem consists in the assumption of a small starting value of and then it is increased gradually with the factor , where is the number of clusters at the th iteration and is defined recursively as , by setting and the symmetric matrix , where is defined as well. If is the matrix at the th iteration and the threshold is introduced as limit, then two indexes and are determined such that and thus and are merged by setting The th row can be removed from the matrix .  In other words, the EFCM algorithm can be summarized in the following steps.(1)The user assigns the initial number of clusters (usually ), , the initial value , and .(2)The membership degrees are assigned randomly.(3)The centers of the clusters are calculated by using (3).(4)The radii of the clusters are calculated by using (9).(5) is calculated by using (12).(6)The indexes and are determined in such a way that assumes the possible greatest value at the th iteration.(7)If and , then the th and th clusters are merged via (14) and the th row is deleted from .(8)If (6) is satisfied, then the process stops; otherwise, go to the step for the th iteration.

3. Hotspots Detection and Evolution in Disease Analysis

Each pattern is given by the event corresponding to the residence of the patient to whom a specific disease has been detected. The two features of the pattern are the geographic coordinates of the residence.

The first step of our process is a geocoding activity necessary for obtaining the event dataset starting by the street address of the patients.

To ensure an accurate matching for the geopositioning of the event, we need the topologically correct road network and the corresponding complete toponymic data.

The starting data include the name of the street and the house number of the patient’s residence. After the matching process, each data is converted in an event point georeferenced on the map.

In Figure 2, the road network of the district of Naples is shown; the name of the street is labeled on the map; the events are georeferenced as points on the map.

Figure 3 shows the data corresponding to an event selected on the map.

After geo-referencing each event, the event dataset can be split, partitioning them by time interval. For example, the event in Figure 3 can be split by the field “Year.”

For each subset of events, we apply the EFCM algorithm to detect the final cluster prototypes.

In this research, we point out the analysis of the temporal evolution and spread of oto-laryngo-pharyngeal diseases detected within the district of Naples. The datasets, divided by time sequences corresponding to periods of one year, are made up of patterns for different events georeferenced corresponding to ailments encountered in patients for which an intervention and the subsequent histological examination were pointed out as well. The event refers to the geopositioning of the location of the patient.

The data have been further divided by the type of the disease for analyzing the distribution and evolution of each specific disease on the area of the study.

The EFCM algorithm has been encapsulated in the GIS platform ESRI ArcGIS. Figure 4 shows the mask created for setting the parameters and running the EFCM algorithm.

We can set other numerical fields for adding other features to the geographical coordinates.

Initially, we set the initial number of clusters, the fuzzifier m, and the error threshold for stopping the iterations. After running EFCM, the number of iterations, the final number of clusters, and the error calculated at the last iteration are reported. The resultant clusters are shown as circular areas on the map and can be saved in a new geographic layer.

The final process concerns the comparative analysis of the hotspots obtained by the clusters corresponding to each subsets of events.

Figure 5 shows an example of display on the map of hotspots obtained as final clusters for two consecutive subset of events.

In order to assess the expansion and the displacement of a hotspot, we measure the radius of the hotspot and the distance between the centroids of two intersecting hotspots.

In the next section, we present the results obtained by applying this method for the data corresponding to surgical interventions to the oto-laryngo-pharyngeal apparatus in patients residents in the district of Naples between the years 2008 and 2012.

We divide the dataset per year and analyze various types of diseases.

Among the types of the most frequent diseases, the following were analyzed:(i)carcinoma,(ii)edema of bilateral Reinke,(iii)hypertrophy of the inferior turbinate,(iv)nasal polyposis,(v)bilateral vocal fold prolapse.In the next section, we show the most significant results obtained by applying this method to the each partitioned dataset of events.

4. Test Results

We present the results obtained on the event dataset described above in the period between the years 2008 and 2012.

We consider first the subset of data corresponding to the edema of bilateral Reinke disease.

We fix the fuzzifier parameter to 0.1, the initial number of clusters to 15, and the final iteration error to 1 × 10−2.

Table 1 shows the results obtained for each year.

We present the details relating to the comparison of the hotspots obtained by considering the event data for the years 2011 and 2012.

Figures 6 and 7 show, respectively, the hotspots obtained by using the pattern subset of events that occurred in the years 2011 and 2012.

Figure 8 shows the overlap of the hotspots obtained for the two years: in red, the hotspots corresponding to the year 2011; in blue, the ones corresponding to the year 2012.

Table 2 shows in the first two columns the labels of the hotspots in 2011 and 2012, in third (resp., fourth) column the radius obtained in 2011 (resp., 2012), and the distance between the centroids is given in the fifth column.

The results show that only hotspot 3 obtained for the year 2011 remains almost unchanged in the year 2012. Instead, hotspots 1 and 2 seem to merge into a single larger hotspot (the hotspot 1 obtained for the year 2012), and hotspot 4, that shifts about 1 km, is expanded; the radius of this hotspot in 2012 is about 6.5 km (hotspot 3 obtained for the year 2012 in Figure 8).

Now we show the results obtained for the disease nasal polyposis.

Figure 9 shows the overlap of the hotspots obtained for the two years, 2011 and 2012.

In Table 3, the comparison’s results are reported.

The results in Figure 9 show that in 2011 and 2012 there are two hotspots: the one covering an area of the city of Naples and the other covering many Vesuvian towns. The two hotspots, which in 2011 covered a circular area with a radii of about 3 and 5 km, respectively, in 2012 cover a circular area with radii of about 5 and 7 km, respectively.

The histogram in Figure 10 shows the trend of the radii of the two hotspots in the course of time.

It is relevant the spread in recent years of the hotspot that surrounds the Vesuvian towns (the radius of this hotspot, from about 2 km in the year 2008, is about 7 km in the year 2012).

Another significant trend concerns the hotspots obtained for the carcinoma disease.

Also, in this case, the two main hotsposts cover the city of Naples and many Vesuvian towns. In in this case, we have a very high spread of the hotspot covering the city of Naples (cfr. Figure 11); in recent years, the radius of this hotspot is increased up to 9.5 km.

5. Conclusions

The hyperspheres obtained as clusters (circles in case of two dimensions) by using EFCM can represent hotspots in hotspot analysis; this method has a linear computational complexity and is robust to noises and outliers. In hotspots analysis, the patterns are bidimensional and the features are formed by geographic coordinates; the cluster prototypes are circles that can represent a good approximation of hotspot areas and can be displayed as circular areas on the map.

In this paper, we present a new method that uses the EFCM algorithm for studying the spatio-temporal evolution of hotspots in disease analysis.

We consider the residence’s information of patients in the district of Naples (Italy) to whom a surgical intervention to the oto-laryngo-pharyngeal apparatus was carried out between the years 2008 and 2012. A geocoding process is used for geo-referencing the data; then, the georeferenced dataset is partitioned per year and type of disease; we compare the hotspots obtained for each pair of consecutive years and analyze the trend of each hotspot over time measuring the variation of the radius and the distance between intersecting cluster centroids concerning two consecutive years.

The results show a consistent spread in the last years of the nasal polyposis disease hotspot covering some Vesuvian towns and of the carcinoma disease hotspot covering the city of Naples.