Abstract

The spatial scan statistic is one of the main epidemiological tools to test for the presence of disease clusters in a geographical region. While the statistical significance of the most likely cluster is correctly assessed using the model assumptions, secondary clusters tend to have conservatively high P-values. In this paper, we propose a sequential version of the spatial scan statistic to adjust for the presence of other clusters in the study region. The procedure removes the effect due to the more likely clusters on less significant clusters by sequential deletion of the previously detected clusters. Using the Northeastern United States geography and population in a simulation study, we calculated the type I error probability and the power of this sequential test under different alternative models concerning the locations and sizes of the true clusters. The results show that the type I error probability of our method is close to the nominal level and that for secondary clusters its power is higher than the standard unadjusted scan statistic.

1. Introduction

Spatial and space-time scan statistics [1] have become some of the main tools in geographic disease surveillance to test the null hypothesis that geographical data are randomly distributed against a localized cluster alternative. Examples include its use for breast cancer mortality in Texas [2], giardiasis parasites in Canada [3], pneumonia in Brazil [4], lymphatic filariasis in Haiti [5], and syndromic surveillance in New York City [6].

The standard spatial scan statistic is a maximum likelihood ratio test statistic based on a circular window of variable size scanning the geographical area under surveillance. Under the null hypothesis, there are no disease clusters. Under the alternative hypothesis, there is a single geographical cluster in the region of unknown location and size. A detailed description is presented in Section 2.

When the test significantly detects one cluster, it is of interest to know if there are additional clusters present in the region. The test procedure provides evidence for the presence of these so-called secondary clusters, spatial clusters not overlapping with the most likely cluster but with significantly large likelihood ratio. These secondary clusters have an associated -value but they are calculated ignoring the existence of the most likely cluster that had already been detected. One consequence of this is that these are conservative -values [1] leading to a loss in statistical power. The -values for testing for a second cluster could alternatively be calculated conditionally on the presence of the primary cluster and their -values could then be smaller than those delivered by the standard method.

In this paper, we propose a sequential method to evaluate the statistical significance of secondary clusters by removing the effect caused by the previously detected stronger clusters. In brief, data from the most likely cluster (MLC), possibly together with some neighboring tracts, are deleted from the dataset if the MLC is statistically significant. The standard spatial scan statistic is then applied to the reduced dataset to test for the presence of a second cluster. The procedure is reiterated until no further statistically significant cluster is found. We also evaluate an alternative approach, where we replace the cases in the most likely cluster with the expected number of cases for each location in that cluster. The expected numbers are calculated using data only from the tracts outside the MLC.

Either methods is valid only if their type I error probability is close to the nominal level. Our simulation results show that the type I error probability is under control. We compare the power of testing the significance of the second most likely cluster between the standard method and the sequential method finding that the sequential method has higher power for secondary clusters.

The paper is organized as follows. The spatial scan statistic is briefly reviewed in Section 2. In Section 3, we adjust the spatial scan statistics by either removing the most likely cluster or by replacing the most likely cluster with expected counts. We present the results concerning the type I error probability and statistical power in Section 4. A practical example applying both the standard approach and the new sequential approach to breast cancer mortality in the Northeastern United States is given in Section 5. In Section 6, we discuss the results derived in this paper.

2. The Spatial Scan Statistic

Suppose a geographical region is partitioned into small areas such as, for example, zip code areas or census tracts. The areas are represented by their geometrical centroids. The data input for the spatial scan statistic includes the geometrical centroid location (longitude and latitude coordinates), the population at risk count, and the number of disease cases in each area. We assume that the counts are independent Poisson distributed random variables. The risk population may be the raw population count or a covariate adjusted population at risk estimate. Centered at each centroid, a collection of circles of continuously varying radii defines the potential clusters: each one is composed by its center plus the neighboring centroids contained within the circle.

Let and denote the total covariate adjusted population at risk and the total number of cases of the areas within circle , respectively. In most disease surveillance applications, there is no reliable external estimate concerning the expected number of cases under the null hypothesis. Also, interest is most often in comparing disease risk in different parts of the map rather than a comparison with an external region or time period. Hence, we use the conditional rather than unconditional scan statistic [7]. Conditioning on the observed total number of cases, the spatial scan statistic is the maximum of the likelihood ratio over all possible circles : where is the likelihood that circle has observed cases conditionally on the total number of cases , on the within population , and on the total population . The denominator is the likelihood function under the null hypothesis.

If we know the null hypothesis underlying disease rate, we can approach this problem in a nonconditional way, as in [8]. He shows that better performance can be possible for the spatiotemporal scan method when one does not condition on the total number of cases. However, in most applications one does not have an estimate of the underlying rate with enough precision to assume it known.

Let be the expected number of cases in circle . We have for all with if . Therefore, we can write

The level critical value is defined as , where is the type I error probability, and represents the null hypothesis. A circle with likelihood ratio is significant, and the circle with the maximum likelihood ratio is called the most likely cluster (MLC). Besides the MLC, multiple additional clusters can be derived and evaluated as well. The most interesting additional clusters are the circles that do not overlap with the MLC and that have the likelihood ratio larger than the critical cutoff point .

To evaluate the significance, the scan statistic test uses Monte Carlo hypothesis testing [9]. The -value of the MLC is calculated by repeatedly simulating data under the null hypothesis conditional on the same total number of cases . For each of simulations one calculates the maximum likelihood ratio statistic. The critical value is equal to the highest of these maximum likelihood ratio statistics and where is the number of test statistics from the simulated datasets that are larger than the test statistic from the real data.

If the test is significant, there is interest in testing for the presence of additional secondary clusters that do not overlap with the most likely cluster. The standard way to do this is to compare the likelihood ratio of such secondary clusters with the maximum likelihood ratios from the simulated data. The interpretation of this approach is that we are evaluating whether the secondary clusters are able to reject the null hypothesis on their own strength, whether the most likely cluster is a true cluster or not. In some sense, it is like a regression analysis, where each variable is entered in a separate regression model and evaluated without adjusting for other variables. A drawback of this approach is that the -values are conservative [1] with a corresponding loss of statistical power. This is because the likelihood ratio from the real data is less than the maximum while it is compared to the maximum likelihood ratios from the simulated datasets.

An alternative approach would be to compare the likelihood ratios from secondary clusters in the real data with the likelihood ratios from the corresponding secondary clusters from the simulated data. However, since the simulations are carried out under the null hypothesis, these likelihood ratios from the simulations do not take into account that there is one cluster already present in the map and outside the secondary cluster. A more interesting approach is to try and adjust for the most likely clusters when evaluating secondary clusters.

3. Adjusting the Spatial Scan Statistic for Multiple Clusters

We propose to test the statistical significance of multiple clusters sequentially, so that one would test the second most likely cluster only if the MLC is significant, and the third most likely cluster only if the second most likely cluster is significant and so on. Since the spatial scan statistic calculates the likelihood of cases within the circle to the remaining cases outside the circle, the previously detected most likely clusters have an effect on calculating the test statistic for the less likely clusters. Our objective is to eliminate this effect to be able to test for the presence of additional clusters conditional on the presence of the previously detected clusters.

To reduce the effect due to the MLC, we remove from the original data the areas comprising the MLC. We also experiment with removing not only the MLC but also some of its nearest neighboring areas. The areas that are removed leave an empty region in the map, with no population and no cases. Next, the scan test procedure is carried out on this reduced dataset, treating the removed area as if it was a “lake” with no information. This procedure is iterated until no further statistically significant clusters are found.

By a conditioning argument, we can justify this deletion of the first cluster from the map to run the second stage test. Suppose that we want to test for the presence of a second cluster given that we detected exactly the first cluster . Since we assume independent counts in the areas, the likelihood can be written as the product of three factors. The first one, is associated only with the counts in with probability . The second one is similarly defined for a second cluster with null intersection with and the third is associated with the remaining of the areas (). Then, the distribution of the counts conditioned on those observed in is simply and the maximum likelihood test statistic should be This is the same test statistic one obtains by removing from the map the data from cluster and working in the reduced map with the usual scan statistic test.

An alternative approach is to replace the number of cases of each area within the MLC with its expected number of cases such that the ratio between the number of cases and the population within the MLC is the same as that outside of it. More specifically, let be the updated number of cases of tract within the MLC, then , where is the total number of cases, is the number of cases within the MLC, is the population of tract , is the total population and is the population within the MLC. Conditional on the updated population and on the number of cases, the standard spatial scan statistic is applied to test the significance of the MLC within the updated data. This procedure is iterated until the MLC in the latest updated data is not significant.

Either of these two methods will be valid only if the type I error probability is close to the nominal level. In the next section, that is evaluated.

4. Type I Error and Power

A publicly available simulated benchmark dataset is used to evaluate the adjusted spatial scan statistic. Described in detail by Kulldorff et al. [10], it can be downloaded from “http://www.satscan.org/datasets”.

The datasets use the geographical locations and the 1990 female population of 245 counties in the northeastern United States. The simulated datasets distributed 600 disease cases among the counties according to a multinomial distribution with probabilities proportional to the product of the population times the relative risk. Under the null hypothesis, the relative risk is equal to one. Under the alternative hypothesis, these relative risks are larger than 1 in the areas belonging to true clusters imposed on the map.

There are three types of high relative risk clusters, composed by rural, urban, or by mixed population areas. They are located in different parts of the map. The rural cluster is centered on Grand Isle County in northern Vermont on the Canadian border, which is the county with the smallest population, surrounded by other rural counties. The urban cluster is centered on Manhattan in New York City, surrounded by other urban counties. The mixed cluster is centered on Pittsburgh in western Pennsylvania, a large city surrounded by mostly rural counties. The clusters differ also by their size, being comprised by 1, 4 or 16 counties. There are 9 different types of datasets obtained by the crossing of different cluster sizes and locations.

Depending on the cluster size, the relative risks of the clusters' counties varied from 1.53 to 2.73 in the urban case, and from 2.10 to 2.85 in the mixed case. In the rural case, the relative risks were equal to 3.90 and 7.05 for 16 and 4 counties in the cluster, and equal to 192.89 in the cluster with a single county. For further details, see the paper by Kulldorff et al. [10].

There is one additional dataset classification, those with one true disease cluster and those with two true disease clusters. The two cluster datasets were built with an urban/rural, urban/mixed or mixed/rural combination, with the same number of counties in each cluster. While the number of counties is the same in both clusters, the relative risks and the population sizes are different. To calculate type I error probability of the adjusted spatial scan statistics for testing the significance of the second most likely cluster, benchmark data with one true disease cluster are needed. Each one of the type I error probabilities estimates is based on 10,000 simulated datasets. To calculate its power, benchmark data with two true disease clusters are needed. Each power estimate is based on 1000 simulated datasets.

Table 1 shows the estimated type I error probability of detecting a second cluster when there is in fact only one true cluster in the data. The tests were carried out at the nominal levels and with 0, 2, 5, 10 or 50 neighboring tracts removed, named as buffer size.

Generally, the estimated actual error type I probabilities are very close to the nominal levels. Table 1 shows that changing the buffer size has very little effect on the error type I probability. In the case, the maximum difference between actual and nominal levels is 0.001 for the mixed cluster and 0.003 for the urban cluster. In the case, these values are 0.006 and 0.004, respectively. Furthermore, the rural cluster with 4 and 16 counties do not have substantial differences between actual and nominal levels of significance. The only major discrepancies occur for the rural cluster composed by a single county, Grand Isle. Without a buffer, the estimated type I error probabilities are 0.017 and 0.059 for the nominal levels of 0.01 and 0.05 respectively, and the results are similar with a buffer added, except for the very largest buffer size.

To further investigate these type I error probabilities, we generated random dataset using different relative risks for the Grand Isle cluster, in addition to the original RR = 192.89. Also, in addition to the standard maximum window size of 50 percent of the population at risk size, we also estimated the type I error probabilities for a spatial scan statistic with a maximum window size of 5 percent. As shown in Table 2, the type I error probabilities are good when but slightly on the high side for the larger relative risks. The reason that the numbers for and circle size 50 percent is different from Table 1 is that a different random seed was used.

In Table 3, we show the estimated type I error probabilities using the method that replaces the observed counts within the MLC by the expected number of cases. The results are similar to those in Table 1, but not quite as good. In addition to the single Grand Isle cluster, there are also relatively large differences for all the rural and mixed clusters.

Table 4 shows the estimated power of the most likely cluster removing method, together with the standard scan statistic. The power of the latter is generally lower, with the biggest differences seen for larger cluster sizes. The power of the adjusted scan statistic power is not overly sensitive to the buffer size.

5. Breast Cancer Mortality in Northeastern United States

We illustrate the practical use of the adjusted spatial scan statistic using breast cancer mortality data for Northeastern United States, during 1988–1992, as collected by the National Center for Health Statistics. The study area composed by 245 counties and county equivalents in Northeast states including Maine, New Hampshire, Vermont, Massachusetts, Rhode Island, Connecticut, New York, New Jersey, Pennsylvania, Delaware, Maryland, and the District of Columbia. During 1988–1992, there were 58,943 breast cancer deaths recorded among women in this region, for which we have information about age and county of residence. The population is available for each county by five-year age groups. The total female population in the region was 29,535,210. The geographical location of each county was used as specified by the 1990 Census [11]. The data has been described in detail elsewhere [1]. The analysis is adjusted for age using indirect standardization.

Using the standard spatial scan statistic, it has previously been shown that women who live in the New York City - Philadelphia metropolitan area have an increased risk of dying from breast cancer compared to the rest of the region with 24044 cases when 23040 were expected [1]. This cluster is statistically significant () and it is shown in Figure 1. Using the standard method, the second most likely cluster in the Buffalo area is not significant (), with 1416 observed deaths when 1280 were expected. This cluster is shown in Figure 1, in the northwestern part of the map. Using the sequential method, with the most likely cluster removal method without a buffer, the second most likely cluster is instead in the Boston metropolitan area (), with 5966 observed deaths when 55565 were expected.

A complete list of the detected clusters is presented in Table 5, for both the standard and the adjusted spatial scan statistics. When using the standard method, only the most likely cluster was statistically significant. When adjusting for more likely clusters using the sequential method, six statistically significant clusters were found at the level. This illustrates the behavior of the usual scan statistic which, by not taking into account the previously detected clusters, may miss some less likely clusters, and especially if the most likely cluster is large in size. Compare, for example, the -values for the cluster around Boston, ranked as fourth in the standard scan statistic with and as second in the sequential method with . Note also that two significant clusters found by the sequential method, Pittsburgh and Albany, are not among the five most likely clusters found by the standard spatial scan statistic.

6. Discussion

When using the standard spatial scan statistic, a large cluster in one area of the map may hide the existence of a secondary cluster in another area of the map. By removing the MLC, secondary clusters can be detected using the standard scan statistic on the reduced dataset. We have shown that this procedure has an actual error type probability very close to the nominal level and that it has a higher power than the standard scan statistic for secondary clusters. The sequential approach can be applied recursively to locate other clusters conditionally on the presence of all the previously detected clusters. This is repeated until the most likely cluster in the updated data is nonsignificant. For the breast cancer mortality data from the Northeastern United States, we make different inference about the presence of disease clusters when using the two methods. While we find only one significant most likely cluster using the standard method we find six significant clusters using the sequential approach. This may have important practical implications.

A key finding of this study is that when adjusting the analysis for more likely clusters, it is better to remove these clusters rather than replacing them with expected counts. While the type I error probabilities are close to the nominal levels for both approaches, they are consistently closer when the removing method is used. Another important finding is that there is no need to use a buffer around the clusters to be removed, but it is enough to remove the cluster itself. The type I error probabilities are about equally good in both cases, while the power suffers slightly when a large buffer is used.

If there is a reliable external estimate for the expected counts under the null hypothesis, and we are interested in the absolute differences in risk compared to this external estimate rather than the geographical variation of the relative risks within the study region, we should use the unconditional rather than the conditional scan statistic. Sonesson [8] has shown that this will give higher statistical power. For the unconditional scan statistics there is no need to use a sequential approach though, as the expected number of cases will not change when the most likely cluster is removed. The only effect of the sequential approach would be a slight change in the -value of secondary clusters due to slightly less multiple testing that needs to be adjusted for. Hence, we have only considered the conditional scan statistic in this paper.

While we used a sequential approach for the breast cancer mortality example, by removing more likely clusters when evaluating the statistical significance of clusters with lower likelihood ratios, the procedure could also be used to evaluate the stronger clusters after adjusting for weaker ones. For example, one could check whether the Albany cluster has a much lower -value than its current after removing and therefore adjusting for the Eastern Maine cluster. In this manner, one may evaluate the statistical significance of any cluster adjusting for any collection of other clusters. The -values will of course be different for the same cluster depending on what other clusters were adjusted for, just like in a regression analysis, where the -value of one variable will depend on what other variables are included (adjusted for) in the regression model.

The spatial scan statistic is commonly used to analyze mortality or incidence data using a Poisson model, but there is also for example a Bernoulli model for dichotomous case-control data and an exponential model for survival data with or without censoring [12]. If one is not willing to assume a parametric distribution to the observed counts, an alternative is the semiparametric method of Wen and Kedem [13] which provides inference for the primary and secondary clusters by means of a false discovery rate method.

The sequential scan statistic approach may also be applied for these other probability models to adjust for multiple clusters. The same may or may not be true for other types of scan statistics such as purely temporal and space-time scan statistics, whether performed in a retrospective or prospective manner [14, 15], or scan statistics that are designed to detect areas with lower rather than higher rates of disease, and we do not know how well a sequential procedure would work in such a setting. Moreover, while the scan statistic is most often applied using a circular window as we did in this paper, there have more recently been developments of nonparametric shaped scan statistics as well [1619], in addition to other noncircular shapes [20]. It would be interesting to evaluate how the sequential approach works in these settings, to learn whether the type I error probabilities are still under control and close to their nominal values.

The sequential scan statistic presented here is more computer intensive than the standard scan statistic since a standard analysis must be run for each sequential stage of the procedure. The sequential scan statistic has been incorporated as an option into the freely available SaTScan software (http://www.satscan.org/) that already contain the standard temporal, spatial and space-time scan statistics.

Acknowledgment

This research was funded by the National Cancer Institute, Grant no. RO1CA095979, and by the National Institute for Child Health and Development, Grant no. R01HD048852.