New Advances in BiostatisticsView this Special Issue
Detecting Spatial Clusters via a Mixture of Dirichlet Processes
We proposed an approach that has the ability to detect spatial clusters with skewed or irregular distributions. A mixture of Dirichlet processes (DP) was used to describe spatial distribution patterns. The effects of different batches of data collection efforts were also modeled with a Dirichlet process. To cluster spatial foci, a birth-death process was applied due to its advantage of easier jumping between different numbers of clusters. Inferences of parameters including clustering were drawn under a Bayesian framework. Simulations were used to demonstrate and assess the method. We applied the method to an fMRI meta-analysis dataset to identify clusters of foci corresponding to different emotions.
This work was motivated by a study aiming to detect centers of activated foci from a function magnetic resonance imaging (fMRI) metadataset. In summary, fMRI metadata is a collection of fMRI studies identifying areas of the brain that are significantly activated by stimuli to examine a specific outcome. FMRIs are expensive which leads to small sample sizes and therefore can use metadata to increase sample size and power. To identify spatial clusters, finite mixture models are generally implemented [1–3]. Mixture components, representing a cluster, typically share a common parametric family with each component containing respective parameters [1, 2]. Each component also has a mixing proportion or weight that is respective to the frequency of data in that component . Because of the model’s ease of implementation, this allows various applications such as pattern recognition, computer vision, signal and image analysis, and machine learning to list a few .
One commonly used distribution in the aforementioned finite mixtures is the normal distribution [1, 5], appreciating its established properties and, in the Bayesian context, conjugacy. However, when it comes to clustering, assuming normality for each part of the mixture can potentially lead to oversensibility, e.g., when one cluster is formed by a mixture of two normal distributions but with rather close centers. This type of oversensibility in many research fields should be avoided; one example is emotional foci inferred from brain imaging data, where a certain emotion is covered by a wide region.
To infer the number of clusters, under the Bayesian framework, different methods have been proposed. Reversible jump Markov chain Monte Carlo has been commonly used to infer the number of clusters [6, 7]. At each iteration, a decision is made between splitting one cluster to two, combining two clusters to one, or no movement. One potential difficulty of this approach is the risk of being trapped at a local maximum. Recently, the Dirichlet process (DP) has been used often to estimate the number of clusters [8, 9]. This process has the ability to capture irregular patterns. DP has the ability to detect clusters without being burdened about additional clustering parameters. However, this feature of clustering also has an inherit weakness in that it tends to produce more clusters, making interpretations more difficult.
To overcome the aforementioned gaps, we implemented a mixture of Dirichlet processes (DP). These processes have the ability to describe irregular patterns [8, 9] and by using it as our common parametric family allows the model to identify more complex patterns than the normal distribution would be able to identify. Furthermore, motivated from our previous work with the spatial Cox process application in , we elected to incorporate the birth-death process to statistically determine the number of clusters. Compared to other clustering approaches noted earlier, the birth-death process has the advantage of quick convergence and, by controlling birth rate, embraces a potential of generating redundant clusters.
The remainder of the article is organized as follows. In Section 2, we introduce the model structure, notation, and priors and hyperpriors, simulations are presented in Section 3, the application of the model to an fMRI meta-analysis dataset is in Section 4, and conclusions and discussion are in Section 5.
2.1. The Model
We let denote the coordinate of a spatial point in a three-dimensional space, in particular, point j in group (study) r, and . We have , where is the total number of observed points. It straightforwardly follows that represents all points in the study. We model as where denotes the effect of group , while represents the mean of for the point in group after adjusting for group effects, , and denotes some random error. By modeling the random error as a standard multivariate normal distribution, the distribution of satisfies with where is the covariance matrix.
2.2. Prior and Hyperprior Distributions
We start from the prior distribution of . To detect underlying clusters of due to similarities of , we describe the prior of as a mixture of distribution , . Common choices of are normal distributions. To improve flexibility, we relax such normalizing assumptions in the mixture and assume is generated from a DP, i.e., , where is the precision parameter and is a base distribution and taken as . In particular, , where such that . For the number of clusters K, we assign a truncated Poisson prior distribution to , , . We assign a Dirichlet distribution with parameter 1 to , which implies that is k-dimensional uniformly distributed. The prior distribution of from the base distribution is chosen to be , where , which are known and set as the midpoint of observed interval of variation of the data. Parameter is set as where , and are the range of the data for each dimension. This prior setting is adopted from  and we feel it is reasonable for this setting given the fact that the number and location of clusters are unknown. We let with since the range of the observed data is small. For group effect, , we assume it is small with , where with specified later and , where the lower and upper limits are defined as 10% of the absolute range of the data. We let and , with . The variance component of the random error, , is assumed to follow a relatively noninformative Inverse Gamma (IG) distribution, . The precision parameters and are selected by minimizing the deviance information criterion (DIC) [11, 12].
2.3. Conditional Posterior Distributions and Posterior Computing
Sampling parameter estimates from their posterior distributions can be achieved via Gibbs sampler, in which the statistical inference on the number of clusters is modeled using the birth-death process. The birth-death process is one type of continuous-time Markov chain originally introduced in . This process is often used to simulate realizations of point processes as they can be difficult to directly sample from . These realizations are further used for likelihood inferences for model parameters . The birth-death scheme allows events to randomly occur throughout the chain; these events are either a “birth” or “death.” If a birth occurs, the number of components increases by one, while if a death occurs, the number of components decreases by one.
Recall considering a finite mixture prior for such that all are assumed independently distributed with each generated from one of Dirichlet processes denoted as , i.e., where each represents a DP but is unknown, are the mixing proportions, and are the component specific parameters for each DP. For cluster assignment, we introduce an index variable that indicates the assignment of observation rj and takes the values of 1 to . Denoted by , where represents the realization of independently and identically distributed discrete random variables with probability mass function,
The joint posterior distribution is proportional to where , is a vector of all estimable parameters. From here, the birth-death algorithm and Markov chain can be described for introducing and assigning clusters for :(1)Starting with the initial model , let represent the mixing proportion and cluster specific parameters for the unique clusters. Let the birth rate be .(2)Calculate the death rate for each component: (3)Calculate the total death rate . To quicken convergence, we elected not to model the time to next jump as exponential and allowed an event to occur at each iteration of the Markov chain.(4)Simulate the type of event, birth or death, with the respective probabilities: (5)Adjust the model to reflect the birth or death by the following:(i)Birth: Simulate a new component from each parameter's respective (independent) prior distributions, from and from base distribution, such that the model becomes . It can be mentioned that is the Beta distribution with parameters and can be easily simulated from and such that .(ii)Death: Select a component to die with the probabilities for such that the model becomes (6)Given the current state of the model at time , simulate values for all remaining parameters.(7)Go to step .
Incorporating the birth-death process into our model, we need to further define the likelihood for removing cluster : where . The birth-death process is conditional on the prespecified birth rate, . By setting this birth rate, which controls how often a “birth” of a new component occurs, equal to as suggested and done in , this computationally allows the death rates to be a likelihood ratio absent of . In other words, the likelihood of the data drives the death rates and ultimately the decision of a new cluster. Given that decision is a “birth,” the new cluster’s parameters ,, and are sampled from their prior distributions: The mixing proportions are adjusted by multiplying all current proportions by if a birth occurs or dividing by if a death occurs.
To simulate values for all remaining parameters and hyperparameters, we implement the Gibbs sampler. Conditional posterior distributions are listed below. Note that “” denotes data and other parameters not listed. The conditional posterior of is where is the number of subclusters for cluster , is the number of foci in cluster , and denotes the unit point mass, and which is the distribution of a DP with being the unit point mass and the number of foci in some cluster , where and denotes the average of all in cluster k, where is the number of foci in cluster , . The conditional posterior of is generated from a DP again: where is the unit point mass. The conditional posterior distributions for the related hyperparameters are where and represents the average group effect and
Lastly, the sampling distribution for is
The sampling of unique values for and can be performed using Neal’s algorithm 8 . It works by introducing auxiliary parameters that are independent to other parameters to represent potential values for and . Algorithm 8 for updating clustering assignments, denoted as , is as follows:(i)The state of the Markov chain consists of and with denoting cluster parameters, e.g., in our application. Repeatedly sample as follows:(ii)For , let be the number of distinct for , and . Label these with values in . If for some , draw values independently from base distribution for those for which . If for all , let have the label , and draw values independently from for those for which . Draw a new value for from using the following probabilities: where is the likelihood with and observation , , involved. In our case, it is ,(iii)where is the number of for that are equal to and is the appropriate normalizing constant. Change the state to contain only those that are now associated with one or more observation.(iv)For all , draw new values from such that , or perform some other update to that leaves this distribution invariant .
Thus, the value for for those foci in cluster and subcluster may be sampled from where are the number of foci in some cluster and subcluster and denotes the mean of the observed data after adjusting for group effect. The value for for those groups (studies) in group cluster (different notation than subclusters above) may be sampled from where is the number of groups in cluster and similarly denotes the mean of the observed data after adjusting for individual effect.
2.4. Determining the Clusters
To estimate the number of clusters and the center of each cluster and cluster assignment, we implement the same least-squared Euclidean distance method introduced in  and used previously in our work in  and reiterated below. This method draws inferences on clusters based on a set of converged MCMC iterations and chooses one iteration as the final estimates for the clusters and related parameters. This final MCMC iteration is selected due to its smallest Euclidean distance to the expected cluster assignments estimated based on a set of independent converged MCMC iterations. This approach incorporates all clustering information in the MCMC sampling process :(1)After the prespecified number of MCMC burn-ins, let the MCMC simulations continue for an additional iterations. An averaged clustering matrix is then created, denoted as , and is an matrix with each block or entry denoting how often foci and () are in the same cluster. Specifically, each entry is the proportion of the W iterations that two foci are in the same cluster.(2)Let the MCMC run additional iterations, where, for each iteration,(a)create an matrix using indicators to denote which foci are clustered together; i.e., let the entry denote a 1 if foci and are in one cluster and 0 otherwise.(b)use Euclidean distance to determine the similarity between this indicator matrix and the averaged clustering matrix .(3)Among the iterations, select the iteration and respective clustering pattern, number of clusters, and parameters that produce the smallest Euclidean distance.
3.1. Simulation Settings
Simulations were utilized to illustrate and assess the proposed method. We assumed an fMRI metadata study setting including 50 studies, with each study containing 10 foci. Collectively these foci were simulated from three clusters centered as (x,y,z) talairach coordinates at (, , and containing 150, 150, and 200 foci, respectively. It was also assumed that half the data, 250 foci or half from each individual cluster, came from one study cluster centered at and the remaining data from a second study cluster centered at . These study clusters were linear shifts to cluster centers. For example, 75 foci in cluster one were centered at ( and the other half were centered at (; the study effect was a linear shift to all three dimensions from the cluster center. We made various alterations to this general setting:(1)Normal setting: We simulated data from a multivariate normal for each cluster with respective means described above and a variance of . This creates spheres with little variation and we expect the method to have the ability to correctly identify the clusters.(2)Chi-squared (skewed) setting: The method’s ability to cluster in the presence of abnormal patterns is an important factor in spatial clustering. For this setting, we applied the same scenario as in the normal setting in respect of clusters 1 and 2 but simulated cluster 3 using a chi-squared distribution with 4 degrees of freedom.(3)Large variance setting: The last scenario is designed to assess the robustness of the method with respect to the distance between and among clusters. To this end, we applied the normal setting but considered increasing levels (referred to large1, large2, large3, and large4 settings, resp.) of : , and , representing gradually closer distances among clusters.
For each setting, we implemented a grid search for a single dataset to estimate values of and based on the minimization of DIC. We let precision parameter values be 0.01, 0.05, 0.1, 0.5, 1, 2, and 5. Based on and estimates, 100 MC datasets were generated with 2500 burn-in iterations, 500 working iterations to calculate the probability matrix for determining the clusters, and 100 additional iterations to infer the number of clusters and individual foci cluster centers.
Model assessment consists of three evaluations: sensitivity, specificity, and accuracy. Sensitivity and specificity are defined by their generic definitions, the proportion of foci that are correctly assigned to their simulated cluster, and the proportion of foci that are correctly not assigned their nonsimulated cluster. Accuracy is defined as the percentage of foci that are correctly clustered. Note that the definition of accuracy takes into account both true positive and true negatives. In addition to our methodology, we applied a very common existing clustering approach for continuous data, K-means, to our simulation settings. Although this method cannot adjust for additional covariates, it allows for a comparison to existing methods. Lastly, to highlight the advantage of using a mixture of DPs over existing clustering approaches, we applied our approach, a revised version of our approach using a mixture of multivariate normal distributions rather than DPs, and Kmeans to the normal and chi-squared simulation scenarios. As the emphasis was on clustering performance, the group effect was assumed to be known for these two settings.
3.2. Simulation Results
Table 1 summarizes the findings on the three foci-level cluster identifications and the quality of the identified clusters. The proposed method gives high sensitivities and specificities across all scenarios. The accuracy of cluster assignment overall is higher than 90% when the variation in the data is relatively small and only dropping once clusters were large enough to overlap (scenario Large 4). The proposed methodology was also accurate at identifying the correct number of clusters as indicated by the median number of individual and study clusters. In comparison, results from the Kmeans approach indicate relatively lower statistics in sensitivity and specificity (Table 2). The accuracy is around 70%, when the variations in the data are relatively small. However, compared to the proposed method, the Kmeans approach often inferred a higher number of clusters as indicated by the larger median number of clusters. The computation time of a single dataset for the mixture of DPs took, on average, 7-8 hours on a high performance computer (Dell cluster with 88 compute nodes, 3120 total central processing unit cores, 20664 Giga-bytes of RAM, and 61440 total graphic processing unit cores).
After removing study effects, the comparison between the mixture of DPs, mixture of normal, and Kmeans is as expected (Table 3). Both mixtures performed exceptionally well at identifying the three normally distributed clusters while the Kmeans performance was adequate with an overall accuracy of 80% (compared to 100% for both mixtures). Once the data deviated from normality, the mixture of normals approach was unable to differentiate the clusters resulting in low accuracy (32%). The Kmeans approach performed similarly to the mixture of DPs with both approaches having low sensitivities for the third cluster which was skewed but with the mixture of DPs resulting in superior accuracy. The mixture of DPs was able to differentiate clusters 1 and 2 as indicated by 99% sensitivity and 100% specificity measures but tended to “overcluster” cluster 3 into smaller clusters as indicated by the large median number of clusters, 14% sensitivity, and 100% specificity. In regard to overall accuracy, the mixture of DPs outperformed the mixture of normals and Kmeans when the data are skewed.
4. Real Data Application
For this application, we applied the proposed method to a meta-analysis dataset. Constructed originally in , this data consists of a total of 162 neuroimaging publications, of which 57 were PET and 105 were fMRI. Among these 162 publications, there were 437 contrasts or studies. Only those foci that were deemed significantly activated by their study specific criteria were included for a total of 2,478 foci. Summary statistics for this data can be seen in Tables 4 and 5.
As with the simulation studies, grid search and DIC were used to estimate values for and . Potential precision parameters values were 0.01, 0.05, 0.1, 0.5, 1, 2, 5, and 7.5. Each combination was performed over 2,600 iterations, 2,000 of those for burn-in, 500 for the probability matrix calculation, and final 100 to infer individual clusters and their centers. To assist with the magnitude of the likelihood calculations, the data was scaled down by 10.
It was found that the precision parameter combination of and produced the smallest DIC. Convergence, with the initial 2,000 discarded, was checked based on trace plots. Based on the proposed method, we identified four study clusters and 14 individual foci clusters. A single DIC setting for this data took, on average, 72 hours to run on the HPC. The break down of the 14 individual foci clusters by center location, brain location, foci frequency, and study frequency can be seen in Table 6. The frequency of each foci-associated emotion within each of the 14 clusters can be seen in Table 7. The affective emotion was dominating in all clusters with fear being the second dominating emotion in clusters 1, 2, 3, 11, and 13, disgust in clusters 5 and 14, sadness in clusters 6 and 10, and a mixture of emotions in the remaining clusters. When only focusing on those foci that fell within known brain regions of interest, as seen in Table 8, the dominating emotion, other than affective, was sadness, fear, and disgust, respectively. When compared to the number of clusters identified by the spatial Cox Point process (53) and Kmeans (20) performed in , fewer clusters were identified with our current application. It should be noted that this particular data does not visually indicate distinct clusters and is closer to a more uniform distribution throughout the brain which may lead to an inaccurate number of identifiable clusters. However, given the results from our previous analysis in  and findings in simulation studies, it is possible that the clusters formed were rather subtle and actually might not be distinct enough.
5. Conclusion and Discussion
Modeling the realization of observed foci as a linear association of study effect and individual foci cluster effect with a multivariate normal random error was motivated by the limitation of the spatial Cox process to statistically distinguish between a cluster and a mode or peak of a cluster. The overall aim remained to identify activated regions within the brain using fMRI coordinate-based metadata. By modeling the data in this fashion, it was hopeful that the distribution could statistically differentiate between clusters and modes of clusters while retaining the flexibility and robustness to mimick the behavior of the data.
Simulation studies demonstrated that the method can fit data generated from normal or abnormal distributions. Furthermore, it was able to identify clusters within covariates while retaining the integrity to identify individual clusters. Both the proposed method and Kmeans were unable to correctly identify clusters when they were large and overlapped and both the mixture of normals and mixture of DPs performed poorly at identifying a cluster severely skewed.
When applied to a fMRI metadataset, the method identified a relatively low number of clusters. Given the low sensitivity findings in the simulated study with high noise, it can be concluded that this data had a high likelihood of being too broad. When the same data was analyzed with the spatial Cox process, the difference in the results was extreme. Not only were the number of clusters substantially less, but also none of the cluster centers identified from the proposed method came close to those identified in the first method. It is worth mentioning that the meta-analysis data is not distinctly grouped and is more uniformly distributed throughout the brain and perhaps the model used did not provide the best fit.
The primary advantage to this method, besides its flexibility, is its ability to describe irregular spatial patterns and its sampling design to statistically differentiate clusters. Because of its adaptable nature, this model can also adjust for any covariate(s) of interest. However, based on simulation studies and the fMRI metadata application, the proposed method tends to be too insensitive and has a difficult time identifying clusters when data are not distinctly differentiated. A potential limitation in the approach is that each DP within the mixture was assumed to have the same precision parameter. It was noted that, during the simulation studies when the mixture of DPs was attempting to fit the Chi-squared simulations (without study effect), it was overclustering the skewed cluster. However, when the precision parameter was smaller, the identifiability of cluster 3 became more accurate but became more inaccurate for clusters 1 and 2. Thus, to further improve flexibility and accuracy for this approach when data is skewed, each DP potentially requires its own unique precision parameter. Furthermore, this method’s clustering ability is limited by the identification of study effects which may be improved by implementing stronger restrictions or could be an effect of having multiple DPs. Our future work will focus on these issues, allowing study effect to be random rather than a fixed effect, and identifying if a large number of DPs within the model is indeed a limitation.
Data is not currently publicly available but available upon request from Professor Tor D. Wager at the University of Colorado.
Conflicts of Interest
No conflicts of interest exist for any author.
Dr. Kang’s effort was supported by an NIH Grant 1R01MH105561. Dr. Zhang and Dr. Ray’s efforts were supported by their start funds provided by the University of Memphis.
M. Aitkin and R. Healey, “Estimation and hypothesis testing in finite mixture models. Journal of the Royal Statistical Society,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 67–75, 1985.View at: Google Scholar
G. J. McLachlan and K. E. Basford, Mixture Models: Applications to Clustering, Marcel Dekker, New York, NY, USA, 1988.View at: MathSciNet
P. J. Green and D. I. Hastie, “Reversible jump mcmc,” Genetics, vol. 1550, no. 3, pp. 1391–1403, 2009.View at: Google Scholar