Abstract

We expand a framework for Bayesian variable selection for Gaussian process (GP) models by employing spiked Dirichlet process (DP) prior constructions over set partitions containing covariates. Our approach results in a nonparametric treatment of the distribution of the covariance parameters of the GP covariance matrix that in turn induces a clustering of the covariates. We evaluate two prior constructions: the first one employs a mixture of a point-mass and a continuous distribution as the centering distribution for the DP prior, therefore, clustering all covariates. The second one employs a mixture of a spike and a DP prior with a continuous distribution as the centering distribution, which induces clustering of the selected covariates only. DP models borrow information across covariates through model-based clustering. Our simulation results, in particular, show a reduction in posterior sampling variability and, in turn, enhanced prediction performances. In our model formulations, we accomplish posterior inference by employing novel combinations and extensions of existing algorithms for inference with DP prior models and compare performances under the two prior constructions.

1. Introduction

In this paper, we expand a framework for Bayesian variable selection for Gaussian process (GP) models by employing spiked Dirichlet process (DP) prior constructions over set partitions containing covariates. Savitsky et al. [1] incorporate Gaussian processes in the generalized linear model framework of McCullagh and Nelder [2] by expanding the flexibility for the response surface to lie in the space of continuous functions. Their modeling approach results in a class of nonparametric regression models where the covariance matrix depends on the predictors. GP models, in particular, accommodate high-dimensional heterogenous covariate spaces where covariates possess different degrees of linear and non-linear association to the response, Rasmussen and Williams [3].

In this paper, we investigate mixture prior models that induce a nonparametric treatment of the distribution of the covariance parameters of the GP covariance matrix that, in turn, induces a clustering of the covariates. Mixture priors that employ a spike at zero are now routinely used for variable selectionβ€”see for example, George and McCulloch [4] and Brown et al. [5] for univariate and multivariate regression settings, respectively, and Sha et al. [6] for probit modelsβ€”and have been particularly successful in applications to high-dimensional settings. These approaches employ mixture prior formulations for the regression coefficients that impose an a priori multiplicity penalty, as argued by Scott and Berger [7]. More recently, MacLehose et al. [8] have proposed a Bayesian nonparametric approach to the univariate linear model that incorporates mixture priors containing a Dirac measure component into the DP construction of Ferguson [9] and Antoniak [10]. Dunson et al. [11] use a similar spiked centering distribution in a logistic regression. As noted by these authors, DP models borrow information across covariates through model-based clustering, possibly induced through a spatial or temporal correlation, and can achieve improved variable selection and prediction performances with respect to models that use mixture priors alone. Within the modeling settings of MacLehose et al. [8] and Dunson et al. [11], the clustering induced by the Dirichlet process is on the univariate regression coefficients and strength is borrowed across covariates. Kim et al. [12] have adapted the DP modeling framework to provide meaningful posterior probabilities of sharp hypotheses on the coefficients of a random effects model. Their goal is not necessarily variable selection, but rather the more general problem of testing hypotheses on the model parameters. Their model formulation does not share information across covariates but rather clusters vectors of regression coefficients across observations.

While the prior constructions described above all use a mixture of a point mass and a continuous distribution as the centering distribution of the DP prior, in this paper we also investigate constructions that employ a mixture of a spike and a DP prior with a continuous distribution as the centering distribution. The former approach clusters all covariates, the latter induces clustering of the selected covariates only. The prior formulations we adopt show a reduction in posterior sampling variability with enhanced prediction performances in cases of covariates that express nearly the same association to the response.

In our model formulations, we accomplish posterior inference by employing novel combinations and extensions of existing algorithms for inference with DP prior models and variable selection. Unlike prior constructions for linear models, which are able to marginalize over the model space indicators and directly sample the model coefficients a posteriori, our non-linear modeling frameworks employ nonconjugate priors. We achieve robust selection results by using set partitions on which we impose a DP prior to enclose both the model and the associated parameter spaces. We optimize performances of posterior sampling with a modification of the auxiliary Gibbs algorithm of Neal [13] that accounts for a trivial cluster containing nuisance covariates. We investigate performances on simulated data under the two prior constructions. Our DP prior model constructions represent generalized nonconjugate formulations with associated posterior sampling algorithms that, while specific to GP models, may be applied to other nonconjugate settings.

The remainder of the paper is structured as follows: GP models and covariance matrix formulations are reviewed in Section 2. Section 3 introduces our spiked DP prior formulations, including separate models to cluster all covariates and only selected covariates. Sampling schemes for posterior inference are described in Section 4. Simulations are conducted in Section 5, where we compare the clustering construction to the mixture priors model and also compare the two DP prior model formulations. A benchmark dataset is analysed in Section 6. Concluding remarks are offered in Section 7.

2. Generalized Gaussian Process Models

Savitsky et al. [1] incorporate GP models within the generalized linear model framework of McCullagh and Nelder [2] by employing the relation for link function , where is the (possibly latent) canonical parameter for the th observation. A Gaussian process prior is then specified on the latent vector with the covariance matrix being an arbitrarily complex function of the predictors. This general construction extends to latent regression models used for continuous, categorical, and count data, obtained by choosing the appropriate link in (2.1), and incorporates in particular the univariate regression and classification contributions of Neal [14] and Linkletter et al. [15]. Continuous data regression models, for example, are obtained by choosing the identity link function in (2.1) to obtain with being the observed response vector and with being a precision parameter. For inference can be integrated out to work with a marginalized likelihood.

Savitsky et al. [1] extend GP models to also include the class of proportional hazard models of Cox [16] by defining the hazard rate as where is the baseline hazard function, is the failure time, and is defined as in (2.2). Let the triples indicate the data, with censoring index if the observation is right censored and if the associated failure time, , is observed. Suppose that there are no ties among event/failure times and let be the distinct noncensored failure times. In this paper, we use the partial likelihood formulation, defined as where , with being the set of individuals at risk right before and the evaluated at the th failure time. The use of the partial likelihood conveniently avoids prior specification of the baseline hazard.

2.1. Covariance Formulation

A common choice for in (2.2) is a covariance function that includes a constant term and an exponential term, that is, with being an matrix of 1's and a matrix with elements and , with associated to variable . This single-term exponential covariance provides a parsimonious representation that enables a broad class of linear and non-linear response surfaces. In particular, Rasmussen and Williams [3] show how the exponential form (2.5) can be derived from a linear construction by expanding the inputs, 's, into an infinite basis. The chosen parametrization allows simple prior specifications and it is also used by Linkletter et al. [15] as a transformation of the exponential term used by Neal [14] and Sacks et al. [17] in their covariance matrix formulations. This construction is sensitive to scaling and we find best results by normalizing the predictors to lie in the unit cube . Other choices of , such as exponential constructions that include a second exponential term and Matern constructions, are reviewed in Savitsky et al. [1].

3. Spiked Dirichlet Process Prior Models for Variable Selection

Variable selection can be achieved in the GP modeling framework with covariance matrix of type (2.5) by imposing mixture priors on the covariance parameters, that is, for , , which employs a prior for and a , that is, a point mass distribution at one, for . This formulation is similar in spirit to the use of mixture priors employed for variable selection in linear regression models, as, for example, in George and McCulloch [4] and Brown et al. [5] for univariate and multivariate regression settings, respectively.

In this paper, we embed mixture priors for variable selection into Dirichlet process prior models that cluster covariates to strengthen selection. The Dirichlet process (DP) construction of Ferguson [9] and Antoniak [10] is a typical choice for a prior on an unknown distribution, . In particular, given a set of a priori i.i.d. parameters, , with , we define the DP prior on , where is the parametric base measure defining the prior mean, . The concentration parameter, , expresses the prior confidence in the base measure. Draws from are discrete a.s., implying a positive probability of ties to instantiate randomly generated partitions. Indeed, many contributions in nonparametric Bayesian inference are formulated in terms of random partition models, that is, probability models that cluster the set of experimental units. See Quintana [18] for a nice review of nonparametric Bayesian models.

Here we introduce probability distributions on set partitions with a particular focus on clustering the covariates (through the a priori i.i.d. covariance parameters, ), rather than the usual choice of i.i.d. observations. Let , for , define the unique values of , and let us define the clusters as . Let indicate the space of all possible partitions of the covariates. The partition captures a particular disjoint clustering of the covariates, with for , such that we recover the full set of covariates in the disjoint union, . The DP provides the PΓ³lya urn scheme of Blackwell and MacQueen [19] by marginalizing over to define a joint prior construction for a particular partition, where is the gamma function and the number of covariates in cluster . Higher values of tend to produce a larger number of clusters. This is evident if we factorize the joint prior as where we introduce cluster indicators, , and employ exchangeability for to treat covariate as the last one added. Here indicates the number of covariates, excluding covariate , allocated to the nontrivial cluster . Similarly, , captures the total number of clusters when excluding covariate . In particular, this construction of the conditional prior reveals that the probability for covariate to be clustered with is uniform for all or for . We complete the prior specification with to allow the data to update the concentration parameter for a fully Bayesian approach. It is important to note that our prior construction is over set partitions that contain covariates and that all the observations are in every cluster. We next develop two specific and alternative prior formulations, the first permits clustering on allβ€”trivial and selectedβ€”covariates and the second one focuses on clustering only the selected covariates.

3.1. Clustering All Covariates

The first prior construction we consider employs the mixture prior as the centering distribution for the DP prior model, therefore, clustering all covariates. Let us consider the covariance parameters, with , for . We proceed with the usual DP prior construction which encloses the mixture prior on and the Bernoulli prior on in the base distribution, and where is the prior probability of covariate inclusion. A further Beta prior can be placed on to introduce additional variation. Under model sparsity, we a priori expect most covariates to be excluded from the model space, which we accomplish by allocating the associated for a nuisance covariate to the Dirac measure component of the conditional mixture prior under the setting , effectively reducing the dimensionality of the parameter space. Our clustering model (3.4) therefore strengthens exclusion of nuisance covariates with a prior construction that co-clusters nuisance covariates. Let us define the trivial cluster as , with the star symbol indicating unique values. The trivial cluster can be extracted into a separate line in the conditional prior formulation over the set partitions from (3.3), Notice how prior (3.5) strengthens selection by aggregating all trivial covariates into a single cluster. Later in the paper we will employ a data augmentation approach to conduct posterior samples from this nonconjugate model formulation.

3.2. Clustering Selected Covariates

Alternatively, we can use prior models that employ a mixture of a spike and a DP prior with a continuous distribution as the centering distribution, therefore, inducing clustering of the selected covariates only. We construct this model as which may be written more intuitively as . This formulation confines the set partitions to cluster only the selected covariates. While the dimension of the selected covariates, , will change at every iteration of the MCMC algorithm for posterior inference, we may still marginalize over , given , to produce the PΓ³lya urn prior formulation (3.3) where we set , We note that the normalizing expression in the denominator now uses , rather than , to account for our reduced clustering set. Trivial covariates are not clustered together so that we a priori expect reduced efficacy to remove trivial covariates from the model space. In other words, this prior construction produces a relatively flatter prior for assignment to nontrivial clusters under model sparsity as compared to (3.5). Yet, we expect improvement in computational speed as we are only clustering covariates.

4. Markov Chain Monte Carlo Methods

We accomplish posterior inference by employing novel combinations and extensions of existing algorithms for inference with DP models and variable selection. In particular, we adapt the auxiliary Gibbs algorithm of Neal [13] to data augmentation schemes that draw posterior samples for our nonconjugate model formulations.

First we extend our notation and use to indicate the vector where when , for . For model (3.5), we collect the covariance parameters in where indicates the unique cluster values of . For model (3.7), we have to reflect the fact that covariate selection is done separately from the clustering.

We recall the augmented data likelihood of Savitsky et al. [1] employed as a generalized formulation for GP models to construct the joint posterior distribution over model parameters, with and to capture the observed data augmented by the unobserved GP variate, , in the case of latent responses, such as for the survival model (2.4). Note that depends on the concentration parameter for clustering covariates, , through the prior on . We collect all other model-specific parameters in ; for example, for the univariate regression model (2.3).

4.1. Clustering All Covariates

We first define our sampling algorithm using the covariate clustering construction (3.5) which includes all covariatesβ€”both trivial and selected. Our MCMC algorithm sequentially samples in a Gibbs-type fashion. We improve efficiency of the auxiliary Gibbs algorithm of Neal [13] used to sample by making a modification that avoids duplicate draws of the trivial cluster. The sampling scheme we propose is as follows.(1)Update : The auxiliary Gibbs algorithm of Neal [13] achieves sampling of the cluster indicators by introducing temporary auxiliary parameters typically generated from the base distribution (3.4). While multiple draws of nontrivial cluster are almost surely unique, repeated draws of the trivial cluster are entirely duplicative. We make a modification to the auxiliary Gibbs algorithm by ensuring that our state space always contains the trivial cluster, therefore, avoiding duplicate generations. The algorithm employs a tuning parameter, , as the number of temporary auxiliary parameters to be generated from the prior to facilitate sampling each at every MCMC iteration. We begin by drawing the auxiliary parameters from the conditional prior given the current state space values. If , then one of possibly multiple auxiliary parameters has a connection to the trivial state. We thus sample , which draws this value as the trivial cluster. If, however, , then the auxiliary parameters are independent of the trivial state and are sampled as nontrivial clusters from , as in the original auxiliary Gibbs algorithm. Next, we draw the cluster indicator, , in a Gibbs Step from the conditional posterior over the set partitions with a state that includes our auxiliary parameters, where we abbreviate (4.1) with with , the unique parameter associated to cluster index, . In the examples below, we use , and therefore a probability to assign a covariate to each of the new clusters as proportional to . Neal [13] notes that larger values of produce posterior draws of lower autocorrelation.(2)Update : We update the cluster parameters, , using a Metropolis-within-Gibbs. This scheme consists of 2 moves: a between-models move to jointly update for in a componentwise fashion, and a within model move to update for covariates in the current model after the between-models move. We use uniform proposals for the s. Under our clustering formulation, we update the clusters, and not the covariates, therefore, borrowing strength among coclustered covariates. (3) Update : We employ the two-step Gibbs sampling algorithm of Escobar and West [20] constructed as a posterior mixture of two Gamma distributions with the mixture component, , drawn from a beta distribution. The algorithm is facilitated by the conditional independence of from , given . (4)Update : These are updated using Metropolis-Hastings moves. Proposals are generated from the Gamma distributions centered at the previously sampled values.

4.2. Clustering Selected Covariates

We describe the steps to perform updates for from (3.7). (1)Update : Obtain draws in a Gibbs Step for , employing the auxiliary Gibbs algorithm in the usual way according to the conditional posterior (2)Update : We update componentwise by employing a Metropolis-within-Gibbs algorithm. Notice that an update that adds a covariate also adds a cluster and, similarly, the removal of a covariate also discards a cluster in the case where the cluster contains only the deleted covariate. (3)Update : We update the unique 's values for the previously selected covariates in a componentwise fashion using a Metropolis-within-Gibbs algorithm and uniform proposals. (4)Update : As in Step 3. of the previous algorithm. (5)Update : As in Step 4. of the previous algorithm.

For both MCMC schemes, final selection of the variables is accomplished by employing a cutoff value for the marginal posterior probabilities of inclusion of single variables based on a target expected False Discovery Rate (EFDR) in a fashion similar to Newton et al. [21]. For example, let be the posterior probability of the event , that is, a significant association of the th predictor to the response. We fix , a prespecified false discovery rate, and select those covariates with posterior probabilities of exclusion under the null hypothesis, , that are below threshold, , that is, with the indicator function. As noted by Kim et al. [12], the optimal posterior threshold, , may be determined as the maximum value in the set .

5. Simulation Study

We explore performances of the proposed models on simulated data and on a benchmark dataset. Results will show that the application of DP priors may supply a reduction in posterior sampling variability that, in turn, enhances prediction performances, for cases where there is an expected clustering among covariates. We investigate performances under the two prior constructions described above.

5.1. Hyperparameters Setting

In all examples below, we generally follow the guidelines for hyperparameter settings given in Savitsky et al. [1] for prior settings related to the mixture prior construction of Section 3 and to specific data models. In particular, we employ priors on , . In addition, we center and normalize the response and transform the design matrix to lie in to produce a small intercept term, which in turn supplies a better conditioned GP covariance matrix. Savitsky et al. [1] note little sensitivity of the results to the choice of , the prior expectation of covariate inclusion. Here we set in all examples below. In the univariate regression model (2.3), the parameters of the prior on the precision error term, , should be set to estimate the a priori expected residual variance. We choose .

As for the DP priors, we choose , a setting that produces a prior expected number of clusters of about 7.5 for covariates. We briefly discuss sensitivity to the choice of these hyperparameter settings in the simulation results.

5.2. Clustering versus No Clustering

We first consider the univariate regression model (2.3) and compare performance of covariate clustering under (3.5) with the original GP construction (3.1) of Savitsky et al. [1]. With the latter approach, we employ their MCMC-scheme 2 algorithm to accomplish posterior inference. Results we report were obtained by using a response kernel that includes both linear and non-linear associations and where subgroups of covariates share the same functional form, to induce clustering, with , , and with covariates simulated from a . We use covariates, with the response kernel constructed from the first 7. We do 10,000 MCMC iterations, discarding half as burn-in. Results are presented in Figure 1; plots (a), (b) present box plots for posterior samples of the 's without clustering and under the clustering model (3.5), respectively. Only the first 20 covariates are displayed, to help visualization. One readily notes both the reduced spread between covariates sharing the same functional form and within covariate sampled values (of ) for all covariates under application of our clustering model. Such reduction in within and between spreads of the posterior sampling affects, in turn, prediction performances. In particular, we assessed prediction accuracy by looking at the mean-squared prediction error (MSPE) normalized by the variance of the randomly selected test set, that we term β€œnormalized MSPE”. The normalized MSPE declines from 0.12 to 0.02 under application of our clustering model. We further applied the least-squares posterior clustering algorithm of Dahl [22] that chooses among the sampled partitions, post-burn-in, those that are closest to the empirical pairwise clustering probabilities obtained from averaging over posterior samples. Our model returned the correct 3 clusters.

5.3. Clustering All versus Selected Covariates

Next we compare performances for the two prior models (3.5) and (3.7), clustering all covariates and selected covariates only, respectively. We conduct this comparison under the Cox survival model (2.4). The latent response kernel is constructed as with , , and with covariates simulated from a . We generate observed , where we employ . We subsequently randomly censor 5% of our generated survival times. Figure 2 presents the results for clustering all covariates (plots (a), (c)) and only selected covariates (plots (b), (d)). Again, we see the expected clustering behavior among selected covariates in both models, with a slightly less sharp cluster separation in the latter case, indicating a reduction in borrowing of strength among coclustered covariates. We further experimented with the prior expected number of clusters by employing , with , and found a further slight reduction of within covariate sampling spread for selected covariates with increasing , likely resulting from the greater tendency to produce more clusters.

It is worth noting that, under model sparsity, the MCMC algorithm of the model clustering selected covariates only is about 13-14 times faster than the one under the model that clusters all covariates. More precisely, results presented here for the former model were obtained with a computation of 4600 CPU-seconds as compared to 63000 CPU-seconds for the latter model clustering all covariates. This is not surprising under model sparsity, since the model formulation clustering all covariates assigns covariates to clusters on every MCMC iteration, while the construction clustering selected covariates assigns only . Computation times do of course increase for both clustering methods proportionally to the number of true clusters. Reported CPU run times were based on utilization of Matlab with a 2.4 GHz Quad Core (Q6600) PC with 4 GB of RAM running 64-bit Windows XP.

6. Benchmark Data Application

We analyze the Boston Housing data of Breiman and Friedman [23] using the covariate clustering model (3.5) that includes all covariates. This data set relates predictors to the median value of owner-occupied homes in census tracks in the Boston metropolitan area; see Breiman and Friedman [23] for a detailed description of the predictors. We held out a randomly chosen validation set of 250 observations. Figure 3 compares box plots of marginal posterior samples of for all covariates in the following two models: (a) excluding clustering (result reproduced from Savitsky et al. [1]), and (b) clustering all covariates using (3.5). The normalized MSPEs were 0.1 and 0.09, respectively. Clustering covariates therefore induces a relatively modest improvement in performances, though by itself this is not a clear indicator to prefer this formulation.

We again observe some reduction in spread in the posterior samples for the clustered covariates with respect to the formulation of Savitsky et al. [1] that does not cluster covariates, though the effect is less pronounced than what observed for our simulations. When clustering covariates posterior samples for covariates which are frequently coclustered during the MCMC tend to express greater location alignment and general distribution similarity for sampled values. Based on alignment of posterior sampled values, a simple visual inspection of plot (b) of Figure 3 suggests two clusters, . Indeed, the posterior configuration with the minimum score suggested by the least squares clustering algorithm of Dahl [22], which provides an analytical approach for selecting clusters from among the posterior partitions, contained and a separate cluster capturing . The set partition with the second lowest least squares deviation score defines this second cluster as . These results then generally support our subjective visual interpretation.

7. Discussion

In this paper, we have expanded the framework for Bayesian variable selection for generalized Gaussian process (GP) models by employing spiked Dirichlet process (DP) prior constructions over set partitions containing covariates. Our approach results in a nonparametric treatment of the distribution of the covariance parameters of the GP covariance matrix that in turn induces a clustering of the covariates. We have proposed MCMC schemes for posterior inference that use modifications of the auxiliary Gibbs algorithm of Neal [13] to facilitate posterior sampling under model sparsity avoiding the generation of duplicate trivial clusters. Our simulation results have shown a reduction in posterior sampling variability and enhanced prediction performances. In addition, we have evaluated two prior constructions: the first one employs a mixture of a point-mass and a continuous distribution as the centering distribution for the DP prior, therefore, clustering all covariates. The second one employs a mixture of a spike and a DP prior with a continuous distribution as the centering distribution, which induces clustering of the selected covariates only. While the former prior construction achieves a better clusters separation, clustering only selected covariates is computationally more efficient.

In the future, it may be interesting to extend our nonparametric covariate clustering models to hierarchical structures that impose some prior dependence among covariates. Another possible extension of our modeling framework includes augmentation with the simultaneous employment of a complementary clustering of observations in a Dirichlet mixture construction incorporating the regression error term of the model. There is no inherent conflict between these two schemes since all observations are in every covariate cluster.

Acknowledgments

M. Vannucci is partially supported by NIH-NHGRI, Grant number R01-HG003319 and by NSF-DMS, Grant number 1007871. T. Savitsky was supported under NIH-NCI Grant T32 CA096520.