Journal of Probability and Statistics

Volume 2018, Article ID 3506794, 12 pages

https://doi.org/10.1155/2018/3506794

## Detecting Spatial Clusters via a Mixture of Dirichlet Processes

^{1}Division of Epidemiology, Biostatistics, and Environmental Health, University of Memphis, 220 Robison Hall, Memphis, TN 38152, USA^{2}Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, USA^{3}Division of Epidemiology, Biostatistics, and Environmental Health, University of Memphis, 224 Robison Hall, Memphis, TN 38152, USA

Correspondence should be addressed to Meredith A. Ray; ude.sihpmem@yaram

Received 14 May 2018; Revised 14 November 2018; Accepted 25 November 2018; Published 18 December 2018

Guest Editor: Yichuan Zhao

Copyright © 2018 Meredith A. Ray et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We proposed an approach that has the ability to detect spatial clusters with skewed or irregular distributions. A mixture of Dirichlet processes (DP) was used to describe spatial distribution patterns. The effects of different batches of data collection efforts were also modeled with a Dirichlet process. To cluster spatial foci, a birth-death process was applied due to its advantage of easier jumping between different numbers of clusters. Inferences of parameters including clustering were drawn under a Bayesian framework. Simulations were used to demonstrate and assess the method. We applied the method to an fMRI meta-analysis dataset to identify clusters of foci corresponding to different emotions.

#### 1. Introduction

This work was motivated by a study aiming to detect centers of activated foci from a function magnetic resonance imaging (fMRI) metadataset. In summary, fMRI metadata is a collection of fMRI studies identifying areas of the brain that are significantly activated by stimuli to examine a specific outcome. FMRIs are expensive which leads to small sample sizes and therefore can use metadata to increase sample size and power. To identify spatial clusters, finite mixture models are generally implemented [1–3]. Mixture components, representing a cluster, typically share a common parametric family with each component containing respective parameters [1, 2]. Each component also has a mixing proportion or weight that is respective to the frequency of data in that component [1]. Because of the model’s ease of implementation, this allows various applications such as pattern recognition, computer vision, signal and image analysis, and machine learning to list a few [4].

One commonly used distribution in the aforementioned finite mixtures is the normal distribution [1, 5], appreciating its established properties and, in the Bayesian context, conjugacy. However, when it comes to clustering, assuming normality for each part of the mixture can potentially lead to oversensibility, e.g., when one cluster is formed by a mixture of two normal distributions but with rather close centers. This type of oversensibility in many research fields should be avoided; one example is emotional foci inferred from brain imaging data, where a certain emotion is covered by a wide region.

To infer the number of clusters, under the Bayesian framework, different methods have been proposed. Reversible jump Markov chain Monte Carlo has been commonly used to infer the number of clusters [6, 7]. At each iteration, a decision is made between splitting one cluster to two, combining two clusters to one, or no movement. One potential difficulty of this approach is the risk of being trapped at a local maximum. Recently, the Dirichlet process (DP) has been used often to estimate the number of clusters [8, 9]. This process has the ability to capture irregular patterns. DP has the ability to detect clusters without being burdened about additional clustering parameters. However, this feature of clustering also has an inherit weakness in that it tends to produce more clusters, making interpretations more difficult.

To overcome the aforementioned gaps, we implemented a mixture of Dirichlet processes (DP). These processes have the ability to describe irregular patterns [8, 9] and by using it as our common parametric family allows the model to identify more complex patterns than the normal distribution would be able to identify. Furthermore, motivated from our previous work with the spatial Cox process application in [10], we elected to incorporate the birth-death process to statistically determine the number of clusters. Compared to other clustering approaches noted earlier, the birth-death process has the advantage of quick convergence and, by controlling birth rate, embraces a potential of generating redundant clusters.

The remainder of the article is organized as follows. In Section 2, we introduce the model structure, notation, and priors and hyperpriors, simulations are presented in Section 3, the application of the model to an fMRI meta-analysis dataset is in Section 4, and conclusions and discussion are in Section 5.

#### 2. Methods

##### 2.1. The Model

We let denote the coordinate of a spatial point in a three-dimensional space, in particular, point j in group (study) r, and . We have , where is the total number of observed points. It straightforwardly follows that represents all points in the study. We model as where denotes the effect of group , while represents the mean of for the point in group after adjusting for group effects, , and denotes some random error. By modeling the random error as a standard multivariate normal distribution, the distribution of satisfies with where is the covariance matrix.

##### 2.2. Prior and Hyperprior Distributions

We start from the prior distribution of . To detect underlying clusters of due to similarities of , we describe the prior of as a mixture of distribution , . Common choices of are normal distributions. To improve flexibility, we relax such normalizing assumptions in the mixture and assume is generated from a DP, i.e., , where is the precision parameter and is a base distribution and taken as . In particular, , where such that . For the number of clusters K, we assign a truncated Poisson prior distribution to , , . We assign a Dirichlet distribution with parameter 1 to , which implies that is k-dimensional uniformly distributed. The prior distribution of from the base distribution is chosen to be , where , which are known and set as the midpoint of observed interval of variation of the data. Parameter is set as where , and are the range of the data for each dimension. This prior setting is adopted from [1] and we feel it is reasonable for this setting given the fact that the number and location of clusters are unknown. We let with since the range of the observed data is small. For group effect, , we assume it is small with , where with specified later and , where the lower and upper limits are defined as 10% of the absolute range of the data. We let and , with . The variance component of the random error, , is assumed to follow a relatively noninformative Inverse Gamma (IG) distribution, . The precision parameters and are selected by minimizing the deviance information criterion (DIC) [11, 12].

##### 2.3. Conditional Posterior Distributions and Posterior Computing

Sampling parameter estimates from their posterior distributions can be achieved via Gibbs sampler, in which the statistical inference on the number of clusters is modeled using the birth-death process. The birth-death process is one type of continuous-time Markov chain originally introduced in [13]. This process is often used to simulate realizations of point processes as they can be difficult to directly sample from [1]. These realizations are further used for likelihood inferences for model parameters [1]. The birth-death scheme allows events to randomly occur throughout the chain; these events are either a “birth” or “death.” If a birth occurs, the number of components increases by one, while if a death occurs, the number of components decreases by one.

Recall considering a finite mixture prior for such that all are assumed independently distributed with each generated from one of Dirichlet processes denoted as , i.e., where each represents a DP but is unknown, are the mixing proportions, and are the component specific parameters for each DP. For cluster assignment, we introduce an index variable that indicates the assignment of observation rj and takes the values of 1 to . Denoted by , where represents the realization of independently and identically distributed discrete random variables with probability mass function,

The joint posterior distribution is proportional to where , is a vector of all estimable parameters. From here, the birth-death algorithm and Markov chain can be described for introducing and assigning clusters for :(1)Starting with the initial model , let represent the mixing proportion and cluster specific parameters for the unique clusters. Let the birth rate be .(2)Calculate the death rate for each component: (3)Calculate the total death rate . To quicken convergence, we elected not to model the time to next jump as exponential and allowed an event to occur at each iteration of the Markov chain.(4)Simulate the type of event, birth or death, with the respective probabilities: (5)Adjust the model to reflect the birth or death by the following:(i)Birth: Simulate a new component from each parameter's respective (independent) prior distributions, from and from base distribution, such that the model becomes . It can be mentioned that is the Beta distribution with parameters and can be easily simulated from and such that .(ii)Death: Select a component to die with the probabilities for such that the model becomes (6)Given the current state of the model at time , simulate values for all remaining parameters.(7)Go to step .

Incorporating the birth-death process into our model, we need to further define the likelihood for removing cluster : where . The birth-death process is conditional on the prespecified birth rate, . By setting this birth rate, which controls how often a “birth” of a new component occurs, equal to as suggested and done in [1], this computationally allows the death rates to be a likelihood ratio absent of . In other words, the likelihood of the data drives the death rates and ultimately the decision of a new cluster. Given that decision is a “birth,” the new cluster’s parameters ,, and are sampled from their prior distributions: The mixing proportions are adjusted by multiplying all current proportions by if a birth occurs or dividing by if a death occurs.

To simulate values for all remaining parameters and hyperparameters, we implement the Gibbs sampler. Conditional posterior distributions are listed below. Note that “” denotes data and other parameters not listed. The conditional posterior of is where is the number of subclusters for cluster , is the number of foci in cluster , and denotes the unit point mass, and which is the distribution of a DP with being the unit point mass and the number of foci in some cluster , where and denotes the average of all in cluster k, where is the number of foci in cluster , . The conditional posterior of is generated from a DP again: where is the unit point mass. The conditional posterior distributions for the related hyperparameters are where and represents the average group effect and

Lastly, the sampling distribution for is

The sampling of unique values for and can be performed using Neal’s algorithm 8 [14]. It works by introducing auxiliary parameters that are independent to other parameters to represent potential values for and [14]. Algorithm 8 for updating clustering assignments, denoted as , is as follows:(i)The state of the Markov chain consists of and with denoting cluster parameters, e.g., in our application. Repeatedly sample as follows:(ii)For , let be the number of distinct for , and . Label these with values in . If for some , draw values independently from base distribution for those for which . If for all , let have the label , and draw values independently from for those for which . Draw a new value for from using the following probabilities: where is the likelihood with and observation , , involved. In our case, it is ,(iii)where is the number of for that are equal to and is the appropriate normalizing constant. Change the state to contain only those that are now associated with one or more observation.(iv)For all , draw new values from such that , or perform some other update to that leaves this distribution invariant [14].

Thus, the value for for those foci in cluster and subcluster may be sampled from where are the number of foci in some cluster and subcluster and denotes the mean of the observed data after adjusting for group effect. The value for for those groups (studies) in group cluster (different notation than subclusters above) may be sampled from where is the number of groups in cluster and similarly denotes the mean of the observed data after adjusting for individual effect.

##### 2.4. Determining the Clusters

To estimate the number of clusters and the center of each cluster and cluster assignment, we implement the same least-squared Euclidean distance method introduced in [15] and used previously in our work in [10] and reiterated below. This method draws inferences on clusters based on a set of converged MCMC iterations and chooses one iteration as the final estimates for the clusters and related parameters. This final MCMC iteration is selected due to its smallest Euclidean distance to the expected cluster assignments estimated based on a set of independent converged MCMC iterations. This approach incorporates all clustering information in the MCMC sampling process [15]:(1)After the prespecified number of MCMC burn-ins, let the MCMC simulations continue for an additional iterations. An averaged clustering matrix is then created, denoted as , and is an matrix with each block or entry denoting how often foci and () are in the same cluster. Specifically, each entry is the proportion of the W iterations that two foci are in the same cluster.(2)Let the MCMC run additional iterations, where, for each iteration,(a)create an matrix using indicators to denote which foci are clustered together; i.e., let the entry denote a 1 if foci and are in one cluster and 0 otherwise.(b)use Euclidean distance to determine the similarity between this indicator matrix and the averaged clustering matrix .(3)Among the iterations, select the iteration and respective clustering pattern, number of clusters, and parameters that produce the smallest Euclidean distance.

#### 3. Inference

##### 3.1. Simulation Settings

Simulations were utilized to illustrate and assess the proposed method. We assumed an fMRI metadata study setting including 50 studies, with each study containing 10 foci. Collectively these foci were simulated from three clusters centered as (x,y,z) talairach coordinates at (, , and containing 150, 150, and 200 foci, respectively. It was also assumed that half the data, 250 foci or half from each individual cluster, came from one study cluster centered at and the remaining data from a second study cluster centered at . These study clusters were linear shifts to cluster centers. For example, 75 foci in cluster one were centered at ( and the other half were centered at (; the study effect was a linear shift to all three dimensions from the cluster center. We made various alterations to this general setting:(1)Normal setting: We simulated data from a multivariate normal for each cluster with respective means described above and a variance of . This creates spheres with little variation and we expect the method to have the ability to correctly identify the clusters.(2)Chi-squared (skewed) setting: The method’s ability to cluster in the presence of abnormal patterns is an important factor in spatial clustering. For this setting, we applied the same scenario as in the normal setting in respect of clusters 1 and 2 but simulated cluster 3 using a chi-squared distribution with 4 degrees of freedom.(3)Large variance setting: The last scenario is designed to assess the robustness of the method with respect to the distance between and among clusters. To this end, we applied the normal setting but considered increasing levels (referred to large1, large2, large3, and large4 settings, resp.) of : , and , representing gradually closer distances among clusters.

For each setting, we implemented a grid search for a single dataset to estimate values of and based on the minimization of DIC. We let precision parameter values be 0.01, 0.05, 0.1, 0.5, 1, 2, and 5. Based on and estimates, 100 MC datasets were generated with 2500 burn-in iterations, 500 working iterations to calculate the probability matrix for determining the clusters, and 100 additional iterations to infer the number of clusters and individual foci cluster centers.

Model assessment consists of three evaluations: sensitivity, specificity, and accuracy. Sensitivity and specificity are defined by their generic definitions, the proportion of foci that are correctly assigned to their simulated cluster, and the proportion of foci that are correctly not assigned their nonsimulated cluster. Accuracy is defined as the percentage of foci that are correctly clustered. Note that the definition of accuracy takes into account both true positive and true negatives. In addition to our methodology, we applied a very common existing clustering approach for continuous data, K-means, to our simulation settings. Although this method cannot adjust for additional covariates, it allows for a comparison to existing methods. Lastly, to highlight the advantage of using a mixture of DPs over existing clustering approaches, we applied our approach, a revised version of our approach using a mixture of multivariate normal distributions rather than DPs, and Kmeans to the normal and chi-squared simulation scenarios. As the emphasis was on clustering performance, the group effect was assumed to be known for these two settings.

##### 3.2. Simulation Results

Table 1 summarizes the findings on the three foci-level cluster identifications and the quality of the identified clusters. The proposed method gives high sensitivities and specificities across all scenarios. The accuracy of cluster assignment overall is higher than 90% when the variation in the data is relatively small and only dropping once clusters were large enough to overlap (scenario Large 4). The proposed methodology was also accurate at identifying the correct number of clusters as indicated by the median number of individual and study clusters. In comparison, results from the Kmeans approach indicate relatively lower statistics in sensitivity and specificity (Table 2). The accuracy is around 70%, when the variations in the data are relatively small. However, compared to the proposed method, the Kmeans approach often inferred a higher number of clusters as indicated by the larger median number of clusters. The computation time of a single dataset for the mixture of DPs took, on average, 7-8 hours on a high performance computer (Dell cluster with 88 compute nodes, 3120 total central processing unit cores, 20664 Giga-bytes of RAM, and 61440 total graphic processing unit cores).