About this Journal Submit a Manuscript Table of Contents
International Journal of Biomedical Imaging
Volume 2011 (2011), Article ID 350838, 13 pages
http://dx.doi.org/10.1155/2011/350838
Research Article

Multiclass Sparse Bayesian Regression for fMRI-Based Prediction

1PARIETAL Team, INRIA Saclay-Île-de-France, 91191 Saclay, France
2Laboratoire de Mathématiques, Université Paris-Sud 11, 91400 Orsay, France
3CEA, DSV, I2BM, NeuroSpin, 91191 Gif-sur-Yvette, France
4CEA, DSV, I2BM, INSERM U562, 91191 Gif-sur-Yvette, France
5SELECT Team, INRIA Saclay-Île-de-France, 91400, France

Received 23 December 2010; Revised 3 March 2011; Accepted 7 April 2011

Academic Editor: Kenji Suzuki

Copyright © 2011 Vincent Michel et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Inverse inference has recently become a popular approach for analyzing neuroimaging data, by quantifying the amount of information contained in brain images on perceptual, cognitive, and behavioral parameters. As it outlines brain regions that convey information for an accurate prediction of the parameter of interest, it allows to understand how the corresponding information is encoded in the brain. However, it relies on a prediction function that is plagued by the curse of dimensionality, as there are far more features (voxels) than samples (images), and dimension reduction is thus a mandatory step. We introduce in this paper a new model, called Multiclass Sparse Bayesian Regression (MCBR), that, unlike classical alternatives, automatically adapts the amount of regularization to the available data. MCBR consists in grouping features into several classes and then regularizing each class differently in order to apply an adaptive and efficient regularization. We detail these framework and validate our algorithm on simulated and real neuroimaging data sets, showing that it performs better than reference methods while yielding interpretable clusters of features.

1. Introduction

In the context of neuroimaging, machine learning approaches have been used so far to address diagnostic problems, where patients were classified into different groups based on anatomical or functional data. By contrast, in cognitive studies, the standard framework for functional or anatomical brain mapping was based on mass univariate inference procedures [1]. Recently, a new way of analyzing functional neuroimaging data has emerged [2, 3], and it consists in assessing how well behavioral information or cognitive states can be predicted from brain activation images such as those obtained with functional magnetic resonance imaging (fMRI). This approach opens new ways for understanding the mental representation of various perceptual and cognitive parameters, which can be regarded as the study of the corresponding neural code, albeit at a relatively low spatial resolution. The accuracy of the prediction of the behavioral or cognitive target variable, as well as the spatial layout of predictive regions, can provide valuable information about functional brain organization; in short, it helps to decode the brain system [4].

Many different pattern recognition and machine leaning methods have been used to extract information from brain images and compare it to the corresponding target. Among them, Linear Discriminant Analysis (LDA) [3, 5], Support Vector Machine (SVM) [69], or regularized prediction [10, 11] has been particularly used. The major bottleneck in this kind of analytical framework is that there are far more features than samples, so that the problem is plagued by the curse of dimensionality, leading to overfitting. Dimension reduction can be used to extract relevant information from the data, the standard approach in functional neuroimaging being feature selection (e.g., Anova) [3, 6, 11, 12]. However, by performing feature selection and parameter estimation separately, such approach is not optimal. Thus, a popular combined selection/estimation scheme, Recursive Feature Elimination [13], may be used. However, this approach relies on a specific heuristic, which does not guarantee the optimality of the solution and is particularly costly. By contrast, there is great interest in sparsity-inducing regularizations, which optimize both simultaneously.

In this paper, we assume that the code under investigation is about some scalar parameter that characterizes the stimuli, such as a scale/shape parameters but possibly also position, speed (assuming a 1-D space), or cardinality. Thus, we focus on regression problems and defer the generalization to classification to future work. Let us introduce the following predictive linear model: 𝐲=𝐗𝐰+𝑏,(1) where 𝑦 represents the behavioral variable and (𝐰,𝑏) are the parameters to be estimated on a training set. A vector 𝐰𝑝 can be seen as an image; 𝑝 is the number of features (or voxels), and 𝑏 is called the intercept. The matrix 𝐗𝑛×𝑝 is the design matrix. Each row is a 𝑝-dimensional sample, that is, an activation map related to the observation. With 𝑛𝑝, the estimation of 𝐰 is ill posed.

To cope with the high dimensionality of the data, one can penalize the estimation of 𝐰, for example, based on the 2 norm of the weights. Classical regularization schemes have been used in functional neuroimaging, such as the Ridge regression [14], Lasso [15], or Elastic Net regression [16]. However, these approaches require the amount of penalization to be fixed beforehand and possibly optimized by cross-validation. To deal with the choice of the amount of penalization, one can use the Bayesian regression techniques, which include the estimation of regularization parameters in the whole estimation procedure. Standard Bayesian regularization schemes are based on the fact that a penalization by weighted 2 norm is equivalent to setting the Gaussian priors on the weights 𝐰: 𝐰𝒩0,𝐴1𝛼,𝐴=diag1,,𝛼𝑝,[]𝑖1,,𝑝,𝛼𝑖+,(2) where 𝒩 is the Gaussian distribution and 𝛼𝑖 the precision of the 𝑖th feature. The model in (2) defines two classical Bayesian regression schemes. The first one is Bayesian Ridge Regression (BRR) [17], which corresponds to the particular case 𝛼1==𝛼𝑚. By regularizing all the features identically, BRR is not well suited when only few features are relevant. The second classical scheme is Automatic Relevance Determination (ARD) [18], which corresponds to the case 𝛼𝑖𝛼𝑗 if 𝑖𝑗. The regularization performed by ARD is very adaptive, as all the weights are regularized differently. However, by regularizing each feature separately, ARD is prone to underfitting when the model contains too many regressors [19] and also suffers from convergence issues [20].

These classical Bayesian regularization schemes have been used in fMRI inverse inference studies [10, 14, 21]. However, these studies used only sparsity as built-in feature selection and do not consider neuroscientific assumptions for improving the regularization (i.e., within the design of the matrix 𝐴). Indeed, due to the intrinsic smoothness of functional neuroimaging data [22], predictive information is rather encoded in different groups of features sharing similar information. A potentially more adapted approach is the Bayesian regression scheme presented in [23], which regularizes patterns of voxels differently. The weights of the model are defined by 𝐰=𝑈𝜂, where 𝑈 is a matrix defined as set of spatial patterns (one pattern by column) and 𝜂 are the parameters of the decomposition of 𝐰 in the basis defined by 𝑈. The regularization is controlled through the covariance of 𝜂, which is assumed to be diagonal with only 𝑚 possible different values cov(𝜂)=exp(𝜆1)𝐈(𝟏)++exp(𝜆𝑚)𝐈(𝐦).

The matrices 𝐈(𝐢) are diagonal and defined subsets of columns of 𝑈 sharing similar variance exp(𝜆𝑖). Due to its class-based model, this approach is similar to the one proposed in this paper, but the construction of 𝐼 relies on ad hoc voxel selection steps, so that there is no proof that the solution is correct. A contrario, the proposed approach jointly optimizes, within the same framework, the construction of the pattern of voxels and the regularization parameter of each pattern.

In this paper, we detail a model for the Bayesian regression in which features are grouped into 𝐾 different classes that are subject to different regularization penalties. The estimation of the penalty is performed in each class separately, leading to a stable and adaptive regularization. The construction of the group of features and the estimation of the predictive function are performed jointly. This approach, called Multiclass Sparse Bayesian Regression (MCBR), is thus an intermediate solution between BRR and ARD. It requires less parameters to estimate than ARD and is far more adaptive than BRR. Another asset of the proposed approach in fMRI inverse inference is that it creates a clustering of the features and thus yields useful maps for brain mapping. After introducing our model and giving some details on the parameter estimation algorithms (the variational Bayes or Gibbs sampling procedures), we show that the proposed algorithm yields better accuracy than reference methods, while providing more interpretable models.

2. Multiclass Sparse Bayesian Regression

We first detail the notations of the problem and describe the priors and parameters of the model. Then, we detail the two different algorithms used for model inference.

2.1. Model and Priors

We recall the linear model for regression: 𝐲=𝑓(𝐗,𝐰,𝑏)=𝐗𝐰+𝑏.(3) We denote by 𝐲𝑛 the targets to be predicted and 𝐗𝑛×𝑝 the set of activation images related to the presentation of different stimuli. The integer 𝑝 is the number of voxels and 𝑛 the number of samples (images). Typically, 𝑝103to105 (for a whole volume), while 𝑛10to102.

Priors on the Noise
We use classical priors for regression, and we model the noise on 𝐲 as an i.i.d. Gaussian variable: 𝜖𝒩0,𝛼1𝐈𝐧,𝛼Γ𝛼;𝛼1,𝛼2,(4) where 𝛼 is the precision parameter and Γ stands for the gamma density with two hyperparameters 𝛼1,𝛼2: Γ𝑥;𝛼1,𝛼2=𝛼𝛼12𝑥𝛼11exp𝑥𝛼2Γ𝛼1.(5)

Priors on the Class Assignment
In order to combine the sparsity of ARD with the stability of BRR, we introduce an intermediate representation, in which each feature 𝑗 belongs to one class among 𝐾 indexed by a discrete variable 𝑧𝑗 (𝐳={𝑧1,,𝑧𝑝}).All the features within a class 𝑘{1,,𝐾} share the same precision parameter 𝜆𝑘, and we use the following prior on 𝐳: 𝐳𝑝𝐾𝑗=1𝑘=1𝜋𝛿𝑗𝑘𝑘,(6) where 𝛿 is Kronecker's 𝛿, defined as 𝛿𝑗𝑘=0if𝑧𝑗𝑘,1if𝑧𝑗=𝑘.(7)
We finally introduce an additional Dirichlet prior [24] on 𝜋: 𝜋Dir(𝜂)(8) with a hyperparameter 𝜂. By updating at each step the probability 𝜋𝑘 of each class, it is possible to prune classes. This model has no spatial constraints and thus is not spatially regularized.

Priors on the Weights
As in ARD, we make use of an independent Gaussian prior for the weights: 𝐰𝒩0,𝐀1𝜆withdiag(𝐀)=𝑧1,,𝜆𝑧𝑝,(9) where 𝜆𝑧𝑗 is the precision parameter of the 𝑗th feature, with 𝑧𝑗{1,,𝐾}. We introduce the following prior on 𝜆𝑘: 𝜆𝑘𝜆Γ𝑘;𝜆1,𝑘,𝜆2,𝑘(10) with hyperparameters 𝜆1,𝑘, 𝜆2,𝑘. The complete generative model is summarized in Figure 1.

350838.fig.001
Figure 1: Graphical model of Multiclass Sparse Bayesian Regression (MCBR). We denote by 𝐲𝑛 the targets to be predicted and by 𝐗𝑛×𝑝 the set of activation images. both the weights of the model 𝐰 depend on a discrete variable 𝐳 that assigns each feature to a class 𝑘 among 𝐾. Both the noise 𝜖 and the weights 𝐰 have a Gamma prior on their precisions. The variable 𝐳 follows a Dirichlet prior 𝜋.
2.1.1. Link with Other Bayesian Regularization Schemes

The link between the proposed MCBR model and the other regularization methods, Bayesian Ridge Regression and Automatic Relevance Determination, is obvious.(i)With 𝐾=1, that is, 𝜆𝑧1==𝜆𝑧𝑝, we retrieve the BRR model, (ii)With 𝐾=𝑝, that is, 𝜆𝑧𝑖𝜆𝑧𝑗 if 𝑖𝑗, and assigning each feature to a singleton class (i.e., 𝑧𝑗=𝑗), we retrieve the ARD model.

Moreover, the proposed approach is related to the one developed in [25]. In this paper, the authors proposed, for the distribution of weights of the features, a binary mixture of Gaussians with small and large precisions. This model is used for variable selection and estimated by the Gibbs sampling. Our work can be viewed as a generalization of this model to a number of classes 𝐾2.

2.2. Model Inference

For models with latent variables, such as MCBR, some singularities can exist. For instance in a mixture of components, a singularity is a component with one single sample and thus zero variance. In such cases, maximizing the log likelihood yields flawed solutions, and one can use the posterior distribution of the latent variables 𝑝(𝐳𝐗,𝐲) for this maximization. However, the posterior distribution of the latent variables given the data does not have a closed-form expression, and some specific estimation methods, such as variational Bayes or Gibbs sampling, have to be used.

We propose two different algorithms for inferring the parameters of the MCBR model. We first estimate the model by the variational Bayes, and the resulting algorithm is thus called VB-MCBR. We also detail an algorithm, called Gibbs-MCBR, based on a Gibbs sampling procedure.

2.2.1. Estimation by Variational Bayes: VB-MCBR

The variational Bayes (or VB) approach provides an approximation 𝑞(Θ) of 𝑝(Θ𝐲), where 𝑞(Θ) is taken in a given family of distributions and Θ=[𝐰,𝜆,𝛼,𝐳,𝜋]. Additionally, the variational Bayes approach often uses the following mean field approximation, which allows the factorization between the approximate distribution of the latent variables and the approximate distributions of the parameters:𝑞(Θ)=𝑞(𝐰)𝑞(𝜆)𝑞(𝛼)𝑞(𝐳)𝑞(𝜋).(11)

We introduce the Kullback-Leibler divergence 𝒟(𝑞(Θ)) that measures the similarity between the true posterior 𝑝(Θ𝐲) and the variational approximation 𝑞(Θ). One can decompose the marginal log-likelihood log𝑝(𝐲) as log𝑝(𝐲Θ)=(𝑞(Θ))+𝒟(𝑞(Θ))(12) with (𝑞(Θ))=𝑑Θ𝑞(Θ)log𝑝(𝐲,Θ),𝑞(Θ)𝒟(𝑞(Θ))=𝑑Θ𝑞(Θ)log𝑞(Θ),𝑝(Θ𝐲)(13) where (𝑞(Θ)) is called free energy and can be seen as the measure of the quality of the model. As 𝒟(𝑞(Θ))0, the free energy is a lower bound on log𝑝(𝐲) with equality if and only if 𝑞(Θ)=𝑝(Θ𝐲). So, inferring the density 𝑞(Θ) of the parameters corresponds to maximizing with respect to the free distribution 𝑞(Θ). In practice, the VB approach consists in maximizing the free energy iteratively with respect to the approximate distribution 𝑞(𝐳) of the latent variables and with respect to the approximate distributions of the parameters of the model 𝑞(𝐰), 𝑞(𝜆), 𝑞(𝛼), and 𝑞(𝜋).

The variational distributions and the pseudocode of the VB-MCBR algorithm are provided in Appendix A. This algorithm maximizes the free energy . In practice, iterations are performed until convergence to a local maximum of . With an ARD prior (i.e., 𝐾=𝑝 and fixing 𝑧𝑗=𝑗), we retrieve the same formulas as the ones found for Variational ARD [18].

2.2.2. Estimation by Gibbs Sampling: Gibbs-MCBR

We develop here an estimation of the MCBR model using Gibbs sampling [26]. The resulting algorithm is called Gibbs-MCBR; the pseudocode of the algorithm and the candidate distributions are provided in Appendix B. The Gibbs sampling algorithm is used for generating a sequence of samples from the joint distribution to approximate marginal distributions. The main idea is to use conditional distributions that should be known and possibly easy to sample from, instead of directly computing the marginals from the joint law by integration (the joint law may not be known or hard to sample from). The sampling is done iteratively among the different parameters, and the final estimation of parameters is obtained by averaging the values of the different parameters across the different iterations (one may not consider the first iterations, this is called the burn in).

2.2.3. Initialization and Priors on the Model Parameters

Our model needs few hyperparameters; we choose here to use slightly informative and class-specific hyperparameters in order to reflect a wide range of possible behaviors for the weight distribution. This choice of priors is equivalent to setting heavy-tailed centered Student’s t-distributions with variance at different scales, as priors on the weight parameters. We set 𝐾=9, with weakly informative priors 𝜆1,𝑘=10𝑘4, 𝑘[1,,𝐾] and 𝜆2,𝑘=102, 𝑘[1,,𝐾]. Moreover, we set 𝛼1=𝛼2=1. Starting with a given number of classes and letting the model automatically prune the classes can be seen as a means of avoiding costly model selection procedures. The choice of class-specific priors is also useful to avoid label switching issues and thus speeds up convergence. Crucially, the priors used here can be used in any regression problem, provided that the target data is approximately scaled to the range of values used in our experiments. In that sense, the present choice of priors can be considered as universal. We also randomly initialize 𝑞(𝐳) for VB-MCBR (or 𝐳 for Gibbs-MCBR).

2.3. Validation and Model Evaluation
2.3.1. Performance Evaluation

Our method is evaluated with a cross-validation procedure that splits the available data into training and validation sets. In the following, (𝐗𝑙,𝐲𝑙) are a learning set (𝐗𝑡,𝐲𝑡) is a test set, and ̂𝐲𝑡=𝐹(𝐗𝑡𝐰) refers to the predicted target, where 𝐰 is estimated from the training set. The performance of the different models is evaluated using 𝜁, the ratio of explained variance: 𝜁𝐲𝑡,̂𝐲𝑡=𝐲var𝑡𝐲var𝑡̂𝐲𝑡var(𝐲𝑡).(14) This is the amount of variability in the response that can be explained by the model (perfect prediction yields 𝜁=1, while 𝜁<0 if prediction is worse than chance).

2.3.2. Competing Methods

In our experiments, the proposed algorithms are compared to different state-of-the-art regularization methods. (i)Elastic Net Regression [27], which requires setting two parameters 𝜆1 and 𝜆2. In our analyzes, a cross-validation procedure within the training set is used to optimize these parameters. Here, we use 𝜆1̃̃̃̃{0.2𝜆,0.1𝜆,0.05𝜆,0.01𝜆}, where ̃𝜆=𝐗𝑇𝐲, and 𝜆2{0.1,0.5,1.,10.,100.}. Note that 𝜆1 and 𝜆2 parametrize heterogeneous norms. (ii)Support Vector Regression (SVR) with a linear kernel [28], which is the reference method in neuroimaging. The 𝐶 parameter is optimized by cross-validation in the range of 10−3 to 101 in multiplicative steps of 10. (iii)Bayesian Ridge Regression (BRR), which is equivalent to MCBR with 𝐾=1 and 𝜆1=𝜆2=𝛼1=𝛼2=106, that is, weakly informative priors.(iv)Automatic Relevance Determination (ARD), which is equivalent to MCBR with 𝐾=𝑝 and 𝜆1=𝜆2=𝛼1=𝛼2=106, that is, weakly informative priors.

All these methods are used after an Anova-based feature selection as this maximizes their performance. Indeed, irrelevant features and redundant information can decrease the accuracy of a predictor [29]. The optimal number of voxels is selected within the range {50,100,250,500}, using a nested cross-validation within the training set. We do not directly select a threshold on 𝑃 value or cluster size, but rather a predefined number of features. The estimation of the parameters of the learning function is also performed using a nested cross-validation within the training set, to ensure a correct validation and an unbiased comparison of the methods. All methods are developed in 𝐶 and used in Python. The implementation of elastic net is based on coordinate descent [30], while SVR is based on LibSVM [31]. Methods are used from Python via the Scikit-learn open source package [32].

For VB-MCBR and Gibbs-MCBR, in order to avoid a costly internal cross-validation, we select 500 voxels, and this selection is performed on the training set. The number of iterations used is fixed to 5000 (burn in of 4000 iterations) for Gibbs-MCBR and 500 for VB-MCBR. Preliminary results on both simulated and real data showed that these values are sufficient enough for an accurate inference of the model. As explained previously, we set 𝐾=9, with weakly informative priors 𝜆1,𝑘=10𝑘4, 𝑘[1,,𝐾] and 𝜆2,𝑘=102, 𝑘[1,,𝐾]. Moreover, we set 𝛼1=𝛼2=1, and we randomly initialize 𝑞(𝐳) for VB-MCBR (or 𝐳 for Gibbs-MCBR).

3. Experiments and Results

3.1. Experiments on Simulated Data

We now evaluate and illustrate MCBR on two different sets of simulated data.

3.1.1. Details on Simulated Regression Data

We first test MCBR on a simulated data set, designed for the study of ill-posed regression problem, that is, 𝑛𝑝. Data are simulated as follows: 𝐗𝐗𝒩(0,1)with𝜖𝒩(0,1),𝐲=21+𝐗2𝐗3𝐗4𝐗+0.55+𝐗6𝐗7𝐗8+𝜖.(15) We have 𝑝=200 features, 𝑛𝑙=50 images for the training set, and 𝑛𝑡=50 images for the test set. We compare MCBR to the reference methods, but we do not use feature selection, as the number of features is not very high.

3.1.2. Results on Simulated Regression Data

We average the results of 15 different trials, and the average explained variance is shown in Table 1. Gibbs-MCBR outperforms the other approaches, yielding higher prediction accuracy than the reference elastic net and ARD methods. The prediction accuracy is also more stable than the other methods. VB-MCBR falls into the local maximum of and does not yield an accurate prediction. BBR has a low prediction accuracy compared to other methods such as ARD. Indeed, it cannot finely adapt the weights of the relevant features, as these features are regularized similarly as the irrelevant ones. SVR has also low accuracy, due to the fact that we do not perform any feature selection. Thus, SVR suffers from the curse of dimensionality, unlike other methods such as ARD or elastic net, which performs feature selection and model estimation jointly.

tab1
Table 1: Simulated regression data. Explained variance 𝜁 for different methods (average of 15 different trials). The P-values are computed using a paired t-test.

In Figure 2, we represent the probability density function of the distributions of the weights obtained with BRR (a), Gibbs-MCBR (b), and ARD (c). With BRR, the weights are grouped in a monomodal density. ARD is far more adaptive and sets lots of weights to zero. The Gibbs-MCBR algorithm creates a multimodal distribution, lots of weights being highly regularized (pink distributions), and informative features are allowed to have higher weights (blue distributions).

fig2
Figure 2: Results on simulated regression data. Probability density function of the weight distributions obtained with BRR (a), Gibbs-MCBR (b), and ARD (c). Each color represents a different component of the mixture model.

With MCBR, weights are clustered into different groups, depending on their predictive power, which is interesting in application such as fMRI inverse inference, as it can yield more interpretable models. Indeed, the class to the features with higher weights ({𝐗1,𝐗2,𝐗3,𝐗4}) belong which is small (average size of 6 features) but has a high purity (percentage of relevant features in the class) of 74%.

3.1.3. Comparison between VB-MCBR and Gibbs-MCBR

We now look at the values of 𝑤1 and 𝑤2 for the different steps of the two algorithms (see Figure 3). We can see that VB-MCBR (b) quickly falls into a local maximum, while Gibbs-MCBR (a) visits the space and reaches the region of the correct set of parameters (red dot). VB-MCBR is not optimal in this case.

fig3
Figure 3: Results on simulated regression data. Weights of the first two features found for the different steps of Gibbs-MCBR (a) and VB-MCBR (b). The red dot represents the ground truth of both weights, and the green dot represents the final state found by the two algorithms. VB-MCBR is stuck in a local maximum, and Gibbs-MCBR finds the correct weights.
3.2. Simulated Neuroimaging Data
3.2.1. Details on Simulated Neuroimaging Data

The simulated data set 𝐗 consists of 𝑛=100 images (size 12×12×12 voxels) with a set of four square regions of interest (ROI) (size 2×2×2). We call the support of the ROI (i.e., the 32 resulting voxels of interest). Each of the four ROIs has a fixed weight in {0.5,0.5,0.5,0.5}. We call 𝑤𝑖,𝑗,𝑘 the weight of the (𝑖,𝑗,𝑘) voxel. The resulting images are smoothed with a Gaussian kernel with a standard deviation of 2 voxels, to mimic the correlation structure observed in real fMRI data. To simulate the spatial variability between images (intersubject variability, movement artifacts in intrasubject variability), we define a new support of the ROIs, called such that, for each image 𝑙th, 50% (randomly chosen) of the weights 𝐰 are set to zero. Thus, we have . We simulate the target 𝐲 for the 𝑙th image as 𝑦𝑙=(𝑖,𝑗,𝑘)𝑤𝑖,𝑗,𝑘𝑋𝑖,𝑗,𝑘,𝑙+𝜖𝑙(16) with the signal in the (𝑖,𝑗,𝑘) voxel of the 𝑙th image simulated as𝑋𝑖,𝑗,𝑘,𝑙𝒩(0,1),(17) and 𝜖𝑙𝒩(0,𝛾) is a Gaussian noise with standard deviation 𝛾>0. We choose 𝛾 in order to have a signal-to-noise ratio of 5 dB.

3.2.2. Results on Simulated Neuroimaging Data

We compare VB-MCBR and Gibbs-MCBR with the different competing algorithms. The resulting images of weights are given in Figure 4, with the true weights (a) and resulting Anova F-scores (b). The reference methods can detect the truly informative regions (ROIs), but elastic net (f) and ARD (h) retrieve only part of the support of the weights. Moreover, elastic net yields an overly sparse solution. BRR (g) also retrieves the ROIs but does not yield a sparse solution, as all the features are regularized in the same way. We note that the weights in the feature space estimated by SVR (e) are nonzero everywhere and do not outline the support of the ground truth. VB-MCBR (c) converges to a local maximum similar to the solution found by BRR (g); that is, it creates only one nonempty class, and thus regularizes all the features similarly. We can thus clearly see that, in this model, the variational Bayes approach is very sensitive to the initialization and can fall into nonoptimal local maxima, for very sparse support of the weights. Finally, Gibbs-MCBR (d) retrieves most of the true support of the weights by performing an adapted regularization.

fig4
Figure 4: Two-dimensional slices of the three-dimensional volume of simulated data. Weights found by different methods, the true target (a) and F-score (b). The Gibbs-MCBR method (d) retrieves almost the whole spatial support for the weights. The sparsity-promoting reference methods, elastic net (f) and ARD (h), find an overly sparse support of the weights. VB-MCBR (c) converges to a local maximum similar to BRR (g) and thus does not yield a sparse solution. SVR (e) yields smooth maps that are not similar to the ground truth.
3.3. Experiments and Results on Real fMRI Data

In this section, we assess the performance of MCBR in an experiment on the mental representation of object size, where the aim is to predict the size of an object seen by the subject during the experiment, in both intrasubject and intersubject cases. The size (or scale parameter) of the object will be the target variable 𝐲.

3.3.1. Details on Real Data

We apply the different methods on a real fMRI dataset related to an experiment studying the representation of objects, on ten subjects, as detailed in [33]. During this experiment, ten healthy volunteers viewed objects of 4 shapes in 3 different sizes (yielding 12 different experimental conditions), with 4 repetitions of each stimulus in each of the 6 sessions. We pooled data from the 4 repetitions, resulting in a total of 𝑛=72 images by subject (one image of each stimulus by session). Functional images were acquired on a 3-T MR system with an eight-channel head coil (Siemens Trio, Erlangen, Germany) as T2*-weighted echo-planar image (EPI) volumes. Twenty transverse slices were obtained with a repetition time of 2 s (echo time: 30 ms; flip angle: 70°; 2×2×2-mm voxels; 0.5 mm gap). Realignment, normalization to MNI space, and general linear model (GLM) fit were performed with the SPM5 software (http://www.fil.ion.ucl.ac.uk/spm/software/spm5/). The normalization is the conventional method of SPM (implying affine and nonlinear transformations) and not the one using unified segmentation. The normalization parameters are estimated on the basis of a whole-head EPI acquired in addition and are then applied to the partial EPI volumes. The data are not smoothed. In the GLM, the effect of each of the 12 stimuli convolved with a standard hemodynamic response function was modeled separately, while accounting for serial autocorrelation with an AR(1) model and removing low-frequency drift terms using a high-pass filter with a cutoff of 128 s. The GLM is fitted separately in each session for each subject, and we used in the present work the resulting session-wise parameter estimate images (the 𝛽-maps are used as rows of 𝐗). The four different shapes of objects were pooled across for each one of the three sizes, and we are interested in finding discriminative information on sizes. This reduces to a regression problem, in which our goal is to predict a simple scalar factor (size of an object). All the analyzes are performed without any prior selection of regions of interest and use the whole acquired volume.

Intrasubject Regression Analysis
First, we perform an intrasubject regression analysis. Each subject is evaluated independently, in a 12-fold cross-validation. The dimensions of the real data set for one subject are 𝑝7×104 and 𝑛=72 (divided in 3 different sizes, 24 images per size). We evaluate the performance of the method by a leave-one-condition-out cross-validation (i.e., leave-6-image-out), and doing so the GLM is performed separately for the training and test sets. The parameters of the reference methods are optimized with a nested leave-one-condition-out cross-validation within the training set, in the ranges given before.

Intersubject Regression Analysis
Additionally, we perform an intersubject regression analysis on the sizes. The intersubject analysis relies on subject-specific fixed-effect activations that is, for each condition, the 6 activation maps corresponding to the 6 sessions are averaged together. This yields a total of 12 images per subject, one for each experimental condition. The dimensions of the real data set are 𝑝7×104 and 𝑛=120 (divided into 3 different sizes). We evaluate the performance of the method by cross-validation (leave-one-subject-out). The parameters of the reference methods are optimized with a nested leave-one-subject-out cross-validation within the training set, in the ranges given before.

3.3.2. Results on Real Data

Intrasubject Regression Analysis
The results obtained by the different methods are given in Table 2. The 𝑃-values are computed using a paired 𝑡-test across subjects. VB-MCBR outperforms the other methods. Compared to the results on simulated data, VB-MCBR still falls in a local maximum similar to the Bayesian ridge regression which performs well in this experiment. Moreover, both Gibbs-MCBR and VB-MCBR are more stable than the reference methods.

tab2
Table 2: Intrasubject analysis. Explained variance 𝜁 for the three different methods. The 𝑃-values are computed using a paired 𝑡-test. VB-MCBR yields the best prediction accuracy, while being more stable than the reference methods.

Intersubject Regression Analysis
The results obtained with the different methods are given in Table 3. As in the intrasubject analysis, both MCBR approaches outperform the reference methods, SVR, BRR, and ARD. However, the prediction accuracy is similar to that of elastic net. In this case, Gibbs-MCBR performs slightly better than VB-MCBR, but the difference is not significant.
One major asset of MCBR (and more particularly Gibbs-MCBR, as VB-MCBR often falls into a one-class local maximum) is that it creates a clustering of the features, based on the relevance of the features in the predictive model. This clustering can be accessed using the variable 𝐳, which is implied in the regularization performed on the different features. In Figure 5, we give the histogram of the weights of Gibbs-MCBR for the intersubject analysis. We keep the weights and the values of 𝐳 of the last iteration; the different classes are represented as dots of different colors and are superimposed on the histogram. We can notice than the pink distribution represented at the bottom of the histogram corresponds to relevant features. This cluster is very small (19 voxels), compared to the two blue classes represented at the top of the histogram that contain many voxels (746 voxels) which are highly regularized, as they are noninformative.
The maps of weights found by the different methods are detailed in Figure 6. The methods are used combined with an Anova-based univariate feature selection (2500 voxels selected, in order to have a good support of the weights). As elastic net, Gibbs-MCBR yields a sparse solution but extracts a few more voxels. The map found by elastic net is not easy to interpret, with very few informative voxels scattered in the whole occipital cortex. The map found by SVR is not sparse in the feature space and is thus difficult to interpret, as the spatial layout of the neural code is not clearly extracted. VB-MCBR does not yield a sparse map either, all the features having nonnull weights

tab3
Table 3: Intersubject analysis. Explained variance 𝜁 for the different methods. The 𝑃-values are computed using a paired 𝑡-test. MCBR yields highest prediction accuracy than the two other Bayesian regularizations BRR and ARD.
350838.fig.005
Figure 5: Intersubject analysis. Histogram of the weights found by Gibbs-MCBR and corresponding 𝐳 values (each color of dots represents a different class), for the intersubject analyzes. We can see that Gibbs-MCBR creates clusters of informative and noninformative voxels and that the different classes are regularized differently, according to the relevance of the features in each of them.
350838.fig.006
Figure 6: Intersubject analysis. Maps of weights found by the different methods on the 2500 most relevant features by Anova. The map found by elastic net is difficult to interpret as the very few relevant features are scattered within the whole brain. SVR and VB-MCBR do not yield a sparse solution. Gibbs-MCBR, by performing an adaptive regularization, draws a compromise between the other approaches and yields a sparse solution, but also extracts small groups of relevant features.

4. Discussion

It is well known that in high-dimensional problems, regularization of feature loadings significantly increases the generalization ability of the predictive model. However, this regularization has to be adapted to each particular dataset. In place of costly cross-validation procedures, we cast regularization in a Bayesian framework and treat the regularization weights as hyperparameters. The proposed approach yields an adaptive and efficient regularization and can be seen as a compromise between a global regularization (Bayesian Ridge Regression) that does not take into account the sparse or focal distribution of the information and automatic relevance determination. Additionally, MCBR creates a clustering of the features based on their relevance and thus explicitly extracts groups of informative features.

Moreover, MCBR can cope with the different issues of ARD. ARD is subject to an underfitting in the hyperparameter space that corresponds to an underfitting in model selection (i.e., on the features to be pruned) [19]. Indeed, as ARD is estimated by maximizing evidence, models with less selected features are preferred, as the integration is done on less dimensions, and thus evidence is higher. ARD will choose the sparsest model across models with similar accuracy. A contrario, MCBR requires far less hyperparameter (2×𝐾, with 𝐾𝑝) and suffers less from this issue, as the sparsity of the model is defined by groups. Moreover, a full Bayesian framework for estimating ARD requires to set some priors on the hyperparameters (e.g., 𝛼1 and 𝛼2), and it may be sensitive to specific choice of these hyperparameters. A solution is to use an internal cross-validation for optimizing these parameters, but this approach can be computationally expensive. In the case of MCBR, the distributions of the hyperparameters are bound to a class and not to each feature. Thus, the proposed approach is less sensitive to the choice of the hyperparameters. Indeed, the choice of good hyperparameters for the features is dealt with at the class level.

On simulated data, our approach performs better than other classical methods such as SVR, BRR, ARD, and elastic net and yields a more stable prediction accuracy. Moreover, by adapting the regularization to different groups of voxels, MCBR retrieves the true support of the weights and recovers a sparse solution. Results on real data show that MCBR yields more accurate predictions than other regularization methods. As it yields less sparse solution than elastic net, it gives access to more plausible loading maps which are necessary for understanding the spatial organization of brain activity, that is, retrieving the spatial layout of the neural coding. On real fMRI data, the explicit clustering of Gibbs-MCBR is also an interesting aspect of the model, as it can extract few groups of relevant features from many voxels.

In some experiments, the variational Bayes algorithm yields less accurate predictions than the Gibbs sampling approach, which can be explained by the difficulty of initializing the different variables (especially 𝐳) when the support of the weight is overly sparse. Moreover, the VB-MCBR algorithm relies on a variational Bayes approach, which may not be optimal, due to strong approximations in model inference. A contrario Gibbs-MCBR is more time consuming but yields a better model inference. Finally, the variability in the results may be explained by the difficulty to estimate the model (optimality is not ensured).

The question of model selection (i.e., the number of classes 𝐾) has not been addressed in this paper. One can use the free energy in order to select the best model, but due to the instability of VB-MCBR, this approach does not seem promising. A more interesting method is the one detailed in [34], which can be used with the Gibbs sampling algorithm. Here, model selection is performed implicitly by emptying classes that do not fit the data well. In that respect, the choice of heterogeneous priors for different classes is crucial: replacing our priors with class-independent priors (i.e., 𝜆1,𝑘=102, 𝑘[1,,𝐾]) in the intersubject analysis on size prediction leads Gibbs-MCBR to a local maximum similar to VB-MCBR.

Finally, this model is not restricted to the Bayesian regularization and can be used for classification, within a probit or logit model [35, 36]. The proposed model may thus be used for diagnosis in medical imaging, for the prediction of both continuous or discrete variables.

5. Conclusion

In this paper, we have proposed a model for adaptive regression, called MCBR. The proposed method integrates, in the same Bayesian framework, BRR and ARD and performs a different regularization for relevant and irrelevant features. It can tune the regularization to the possible different level of sparsity encountered in fMRI data analysis, and it yields interpretable information for fMRI inverse inference, namely, the 𝐳 variable (latent class variable). Experiments on both simulated and real data show that our approach is well suited for neuroimaging, as it yields accurate and stable predictions compared to the state-of-the-art methods.

Appendices

A. VB-MCBR Algorithm

The variational Bayes approach yields the following variational distributions:

(i)𝑞(𝐰)𝒩(𝐰𝜇,𝚺) with 𝐀=diag𝑙1,,𝑙𝑝with𝑙𝑗=𝐾𝑘=1𝑞𝑧𝑗𝑙=𝑘1,𝑘𝑙2,𝑘𝑎𝑗{1,,𝑝},(A.1)𝚺=1𝑎2𝐗𝑇𝐗+𝐀1,𝑎(A.2)𝜇=1𝑎2𝚺𝐗𝑇𝐲;(A.3)(ii)𝑞(𝜆𝑘)Γ(𝑙1,𝑘,𝑙2,𝑘) with 𝑙1,𝑘=𝜆1,𝑘+12𝑝𝑗=1𝑞𝑧𝑗,𝑙=𝑘(A.4)2,𝑘=𝜆2,𝑘+12𝑝𝑗=1𝜇2𝑗𝑗+Σ𝑗𝑗𝑞𝑧𝑗;=𝑘(A.5)(iii)𝑞(𝛼)Γ(𝑎1,𝑎2) with 𝑎1=𝛼1+𝑛2,𝑎(A.6)2=𝛼2+12(𝐲𝐗𝜇)𝑇1(𝐲𝐗𝜇)+2Tr𝚺𝐗𝑇𝐗;(A.7)(iv)𝑞(𝑧𝑗=𝑘)exp(𝜌𝑗𝑘) with 𝜌𝑗𝑘1=2𝜇2𝑗+Σ𝑗𝑗𝑙1,𝑘𝑙2,𝑘𝜋+ln𝑘+12Ψ𝑙1,𝑘𝑙log2,𝑘𝜋,(A.8)𝑘=exp{Ψ(𝑑𝑘)Ψ(𝑘=𝐾𝑘=1𝑑𝑘)}𝑑,(A.9)𝑘=𝜂𝑘+𝑝𝑗=1𝑞𝑧𝑗,=𝑘(A.10) where Ψ is the digamma function Ψ(𝑥)=Γ(𝑥)/Γ(𝑥). The VB-MCBR algorithm is provided in pseudo-code in Algorithm 1.

alg1
Algorithm 1: VB-MCBR algorithm.

B. Gibbs-MCBR Algorithm

With Θ=[𝐰,𝜆,𝛼,𝐳,𝜋], we have the following candidate distributions (i.e., the distributions used for the sampling of the different parameters):

(i)𝑝(𝐰Θ{𝐰})𝒩(𝐰𝜇,𝚺) with 𝐗𝚺=𝑇𝐗𝛼+𝐀1𝜆with𝐀=diag𝑧1,,𝜆𝑧𝑝,(B.1)𝜇=𝚺𝛼𝐗𝑇𝐲;(B.2)(ii)𝑝(𝜆Θ{𝜆})𝐾𝑘=1Γ(𝜆𝑘𝑙1,𝑘,𝑙2,𝑘) with 𝑙1,𝑘=𝜆1,𝑘+12𝑝𝑗=1𝛿𝑧𝑗,𝑙=𝑘(B.3)2,𝑘=𝜆2,𝑘+12𝑝𝑗=1𝛿𝑧𝑗𝑤=𝑘2𝑗;(B.4)(iii)𝑝(𝛼Θ{𝛼})Γ(𝑎1,𝑎2) with 𝑎1=𝛼1+𝑛2,𝑎(B.5)2=𝛼2+12(𝐲𝐗𝜇)𝑇(𝐲𝐗𝜇);(B.6)(iv)𝑝(𝑧𝑗Θ{𝐳})mult(exp𝜌𝑗,1,,exp𝜌𝑗,𝐾) with 𝜌𝑗𝑘1=2𝑤2𝑗𝜆𝑘𝜋+ln𝑘+12log𝜆𝑘;(B.7)(v)𝑝(𝜋𝑘Θ{𝜋})Dir(𝑑𝑘) with 𝑑𝑘=𝜂𝑘+𝑝𝑗=1𝛿𝑧𝑗.=𝑘(B.8)

The algorithm is provided in pseudocode in Algorithm 2.

alg2
Algorithm 2: Gibbs-MCBR algorithm.

Acknowledgment

The authors acknowledge support from the ANR Grant ViMAGINE ANR-08-BLAN-0250-02.

References

  1. K. J. Friston, A. P. Holmes, K. J. Worsley, J. P. Poline, C. D. Frith, and R. S. J. Frackowiak, “Statistical parametric maps in functional imaging: a general linear approach,” Human Brain Mapping, vol. 2, no. 4, pp. 189–210, 1994. View at Scopus
  2. S. Dehaene, G. Le Clec'H, L. Cohen, J. B. Poline, P. F. van de Moortele, and D. Le Bihan, “Inferring behavior from functional brain images,” Nature Neuroscience, vol. 1, no. 7, pp. 549–550, 1998. View at Scopus
  3. D. D. Cox and R. L. Savoy, “Functional magnetic resonance imaging (fMRI) "brain reading": detecting and classifying distributed patterns of fMRI activity in human visual cortex,” NeuroImage, vol. 19, no. 2, pp. 261–270, 2003. View at Publisher · View at Google Scholar · View at Scopus
  4. P. Dayan and L. F. Abbott, Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems, MIT Press, 2001.
  5. J. D. Haynes and G. Rees, “Predicting the stream of consciousness from activity in human visual cortex,” Current Biology, vol. 15, no. 14, pp. 1301–1307, 2005. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  6. T. M. Mitchell, R. Hutchinson, R. S. Niculescu et al., “Learning to decode cognitive states from brain images,” Machine Learning, vol. 57, no. 1-2, pp. 145–175, 2004. View at Publisher · View at Google Scholar · View at Scopus
  7. S. LaConte, S. Strother, V. Cherkassky, J. Anderson, and X. Hu, “Support vector machines for temporal classification of block design fMRI data,” NeuroImage, vol. 26, no. 2, pp. 317–329, 2005. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  8. J. Mourão-Miranda, A. L. W. Bokde, C. Born, H. Hampel, and M. Stetter, “Classifying brain states and determining the discriminating activation patterns: Support Vector Machine on functional MRI data,” NeuroImage, vol. 28, no. 4, pp. 980–995, 2005. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  9. S. J. Hanson and Y. O. Halchenko, “Brain reading using full brain support vector machines for object recognition: there is no face identification area,” Neural Computation, vol. 20, no. 2, pp. 486–503, 2008.
  10. O. Yamashita, M. A. Sato, T. Yoshioka, F. Tong, and Y. Kamitani, “Sparse estimation automatically selects voxels relevant for the decoding of fMRI activity patterns,” NeuroImage, vol. 42, no. 4, pp. 1414–1429, 2008. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  11. S. Ryali, K. Supekar, D. A. Abrams, and V. Menon, “Sparse logistic regression for whole-brain classification of fMRI data,” NeuroImage, vol. 51, no. 2, pp. 752–764, 2010. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  12. F. De Martino, G. Valente, N. Staeren, J. Ashburner, R. Goebel, and E. Formisano, “Combining multivariate voxel selection and support vector machines for mapping and classification of fMRI spatial patterns,” NeuroImage, vol. 43, no. 1, pp. 44–58, 2008. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  13. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Machine Learning, vol. 46, no. 1–3, pp. 389–422, 2002. View at Publisher · View at Google Scholar · View at Scopus
  14. C. Chu, Y. Ni, G. Tan, C. J. Saunders, and J. Ashburner, “Kernel regression for fMRI pattern prediction,” NeuroImage, vol. 56, no. 2, pp. 662–673, 2011. View at Publisher · View at Google Scholar · View at PubMed
  15. H. Liu, M. Palatucci, and J. Zhang, “Blockwise coordinate descent procedures for the multi-task Lasso, with applications to neural semantic basis discovery,” in Proceedings of the 26th International Conference On Machine Learning (ICML '09), pp. 649–656, June 2009. View at Scopus
  16. M. K. Carroll, G. A. Cecchi, I. Rish, R. Garg, and A. R. Rao, “Prediction and interpretation of distributed neural activity with sparse models,” NeuroImage, vol. 44, no. 1, pp. 112–122, 2009. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  17. C. M. Bishop, Pattern Recognition and Machine Learning, Information Science and Statistics, Springer, Berlin, Germany, 1st edition, 2007.
  18. M. Tipping, The Relevance Vector Machine, Morgan Kaufmann, 2000.
  19. Y. Qi, T. P. Minka, R. W. Picard, and Z. Ghahramani, “Predictive automatic relevance determination by expectation propagation,” in Proceedings of the 21st International Conference on Machine Learning (ICML '04), ACM Press, 2004.
  20. D. Wipf and S. Nagarajan, “A new view of automatic relevance determination,” in Advances in Neural Information Processing Systems, vol. 20, pp. 1625–1632, MIT Press, 2008.
  21. Y. Ni, C. Chu, C. J. Saunders, and J. Ashburner, “Kernel methods for fmri pattern prediction,” in Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN '08), pp. 692–697, 2008.
  22. K. Uǧurbil, L. Toth, and D. S. Kim, “How accurate is magnetic resonance imaging of brain function?” Trends in Neurosciences, vol. 26, no. 2, pp. 108–114, 2003. View at Publisher · View at Google Scholar · View at Scopus
  23. K. Friston, C. Chu, J. Mourão-Miranda et al., “Bayesian decoding of brain images,” NeuroImage, vol. 39, no. 1, pp. 181–205, 2008. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  24. H. Steck and T. S. Jaakkola, “On the dirichlet prior and bayesian regularization,” in Advances in Neural Information Processing Systems, vol. 15, pp. 697–704, 2002.
  25. E. I. George and R. E. McCulloch, “Variable selection via gibbs sampling,” Journal of the American Statistical Association, vol. 88, no. 423, pp. 881–889, 1993.
  26. S. Geman and D. Geman, Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images, Morgan Kaufmann, 1987.
  27. H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society. Series B, vol. 67, no. 2, pp. 301–320, 2005. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  28. C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. View at Publisher · View at Google Scholar · View at Scopus
  29. G. Hughes, “On the mean accuracy of statistical pattern recognizers,” IEEE Transactions on Information Theory, vol. 14, no. 1, pp. 55–63, 1968.
  30. J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized linear models via coordinate descent,” Journal of Statistical Software, vol. 33, no. 1, pp. 1–22, 2010. View at Scopus
  31. C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
  32. scikit-learn, version 0.2, 2010, http://scikit-learn.sourceforge.net/.
  33. E. Eger, C. A. Kell, and A. Kleinschmidt, “Graded size sensitivity of object-exemplar-evoked activity patterns within human LOC subregions,” Journal of Neurophysiology, vol. 100, no. 4, pp. 2038–2047, 2008. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  34. S. Chib and I. Jeliazkov, “Marginal Likelihood from the Metropolis-Hastings Output,” Journal of the American Statistical Association, vol. 96, no. 453, pp. 270–281, 2001. View at Scopus
  35. J. H. Albert and S. Chib, “Bayesian analysis of binary and polychotomous response data,” Journal of the American Statistical Association, vol. 88, no. 422, pp. 669–679, 1993.
  36. R. E. McCulloch, N. G. Polson, and P. E. Rossi, “A Bayesian analysis of the multinomial probit model with fully identified parameters,” Journal of Econometrics, vol. 99, no. 1, pp. 173–193, 2000. View at Scopus