Abstract

Inverse inference has recently become a popular approach for analyzing neuroimaging data, by quantifying the amount of information contained in brain images on perceptual, cognitive, and behavioral parameters. As it outlines brain regions that convey information for an accurate prediction of the parameter of interest, it allows to understand how the corresponding information is encoded in the brain. However, it relies on a prediction function that is plagued by the curse of dimensionality, as there are far more features (voxels) than samples (images), and dimension reduction is thus a mandatory step. We introduce in this paper a new model, called Multiclass Sparse Bayesian Regression (MCBR), that, unlike classical alternatives, automatically adapts the amount of regularization to the available data. MCBR consists in grouping features into several classes and then regularizing each class differently in order to apply an adaptive and efficient regularization. We detail these framework and validate our algorithm on simulated and real neuroimaging data sets, showing that it performs better than reference methods while yielding interpretable clusters of features.

1. Introduction

In the context of neuroimaging, machine learning approaches have been used so far to address diagnostic problems, where patients were classified into different groups based on anatomical or functional data. By contrast, in cognitive studies, the standard framework for functional or anatomical brain mapping was based on mass univariate inference procedures [1]. Recently, a new way of analyzing functional neuroimaging data has emerged [2, 3], and it consists in assessing how well behavioral information or cognitive states can be predicted from brain activation images such as those obtained with functional magnetic resonance imaging (fMRI). This approach opens new ways for understanding the mental representation of various perceptual and cognitive parameters, which can be regarded as the study of the corresponding neural code, albeit at a relatively low spatial resolution. The accuracy of the prediction of the behavioral or cognitive target variable, as well as the spatial layout of predictive regions, can provide valuable information about functional brain organization; in short, it helps to decode the brain system [4].

Many different pattern recognition and machine leaning methods have been used to extract information from brain images and compare it to the corresponding target. Among them, Linear Discriminant Analysis (LDA) [3, 5], Support Vector Machine (SVM) [6–9], or regularized prediction [10, 11] has been particularly used. The major bottleneck in this kind of analytical framework is that there are far more features than samples, so that the problem is plagued by the curse of dimensionality, leading to overfitting. Dimension reduction can be used to extract relevant information from the data, the standard approach in functional neuroimaging being feature selection (e.g., Anova) [3, 6, 11, 12]. However, by performing feature selection and parameter estimation separately, such approach is not optimal. Thus, a popular combined selection/estimation scheme, Recursive Feature Elimination [13], may be used. However, this approach relies on a specific heuristic, which does not guarantee the optimality of the solution and is particularly costly. By contrast, there is great interest in sparsity-inducing regularizations, which optimize both simultaneously.

In this paper, we assume that the code under investigation is about some scalar parameter that characterizes the stimuli, such as a scale/shape parameters but possibly also position, speed (assuming a 1-D space), or cardinality. Thus, we focus on regression problems and defer the generalization to classification to future work. Let us introduce the following predictive linear model: 𝐲=𝐗𝐰+𝑏,(1) where 𝑦 represents the behavioral variable and (𝐰,𝑏) are the parameters to be estimated on a training set. A vector π°βˆˆβ„π‘ can be seen as an image; 𝑝 is the number of features (or voxels), and π‘βˆˆβ„ is called the intercept. The matrix π—βˆˆβ„π‘›Γ—π‘ is the design matrix. Each row is a 𝑝-dimensional sample, that is, an activation map related to the observation. With 𝑛β‰ͺ𝑝, the estimation of 𝐰 is ill posed.

To cope with the high dimensionality of the data, one can penalize the estimation of 𝐰, for example, based on the β„“2 norm of the weights. Classical regularization schemes have been used in functional neuroimaging, such as the Ridge regression [14], Lasso [15], or Elastic Net regression [16]. However, these approaches require the amount of penalization to be fixed beforehand and possibly optimized by cross-validation. To deal with the choice of the amount of penalization, one can use the Bayesian regression techniques, which include the estimation of regularization parameters in the whole estimation procedure. Standard Bayesian regularization schemes are based on the fact that a penalization by weighted β„“2 norm is equivalent to setting the Gaussian priors on the weights 𝐰: ξ€·π°βˆΌπ’©0,π΄βˆ’1𝛼,𝐴=diag1,…,𝛼𝑝,[]βˆ€π‘–βˆˆ1,…,𝑝,π›Όπ‘–βˆˆβ„+,(2) where 𝒩 is the Gaussian distribution and 𝛼𝑖 the precision of the 𝑖th feature. The model in (2) defines two classical Bayesian regression schemes. The first one is Bayesian Ridge Regression (BRR) [17], which corresponds to the particular case 𝛼1=β‹―=π›Όπ‘š. By regularizing all the features identically, BRR is not well suited when only few features are relevant. The second classical scheme is Automatic Relevance Determination (ARD) [18], which corresponds to the case 𝛼𝑖≠𝛼𝑗 if 𝑖≠𝑗. The regularization performed by ARD is very adaptive, as all the weights are regularized differently. However, by regularizing each feature separately, ARD is prone to underfitting when the model contains too many regressors [19] and also suffers from convergence issues [20].

These classical Bayesian regularization schemes have been used in fMRI inverse inference studies [10, 14, 21]. However, these studies used only sparsity as built-in feature selection and do not consider neuroscientific assumptions for improving the regularization (i.e., within the design of the matrix 𝐴). Indeed, due to the intrinsic smoothness of functional neuroimaging data [22], predictive information is rather encoded in different groups of features sharing similar information. A potentially more adapted approach is the Bayesian regression scheme presented in [23], which regularizes patterns of voxels differently. The weights of the model are defined by 𝐰=π‘ˆπœ‚, where π‘ˆ is a matrix defined as set of spatial patterns (one pattern by column) and πœ‚ are the parameters of the decomposition of 𝐰 in the basis defined by π‘ˆ. The regularization is controlled through the covariance of πœ‚, which is assumed to be diagonal with only π‘š possible different values cov(πœ‚)=exp(πœ†1)𝐈(𝟏)+β‹―+exp(πœ†π‘š)𝐈(𝐦).

The matrices 𝐈(𝐒) are diagonal and defined subsets of columns of π‘ˆ sharing similar variance exp(πœ†π‘–). Due to its class-based model, this approach is similar to the one proposed in this paper, but the construction of 𝐼 relies on ad hoc voxel selection steps, so that there is no proof that the solution is correct. A contrario, the proposed approach jointly optimizes, within the same framework, the construction of the pattern of voxels and the regularization parameter of each pattern.

In this paper, we detail a model for the Bayesian regression in which features are grouped into 𝐾 different classes that are subject to different regularization penalties. The estimation of the penalty is performed in each class separately, leading to a stable and adaptive regularization. The construction of the group of features and the estimation of the predictive function are performed jointly. This approach, called Multiclass Sparse Bayesian Regression (MCBR), is thus an intermediate solution between BRR and ARD. It requires less parameters to estimate than ARD and is far more adaptive than BRR. Another asset of the proposed approach in fMRI inverse inference is that it creates a clustering of the features and thus yields useful maps for brain mapping. After introducing our model and giving some details on the parameter estimation algorithms (the variational Bayes or Gibbs sampling procedures), we show that the proposed algorithm yields better accuracy than reference methods, while providing more interpretable models.

2. Multiclass Sparse Bayesian Regression

We first detail the notations of the problem and describe the priors and parameters of the model. Then, we detail the two different algorithms used for model inference.

2.1. Model and Priors

We recall the linear model for regression: 𝐲=𝑓(𝐗,𝐰,𝑏)=𝐗𝐰+𝑏.(3) We denote by π²βˆˆβ„π‘› the targets to be predicted and π—βˆˆβ„π‘›Γ—π‘ the set of activation images related to the presentation of different stimuli. The integer 𝑝 is the number of voxels and 𝑛 the number of samples (images). Typically, π‘βˆΌ103to105 (for a whole volume), while π‘›βˆΌ10to102.

Priors on the Noise
We use classical priors for regression, and we model the noise on 𝐲 as an i.i.d. Gaussian variable: ξ€·πœ–βˆΌπ’©0,π›Όβˆ’1πˆπ§ξ€Έ,ξ€·π›ΌβˆΌΞ“π›Ό;𝛼1,𝛼2ξ€Έ,(4) where 𝛼 is the precision parameter and Ξ“ stands for the gamma density with two hyperparameters 𝛼1,𝛼2: Ξ“ξ€·π‘₯;𝛼1,𝛼2ξ€Έ=𝛼𝛼12π‘₯𝛼1βˆ’1expβˆ’π‘₯𝛼2Γ𝛼1ξ€Έ.(5)

Priors on the Class Assignment
In order to combine the sparsity of ARD with the stability of BRR, we introduce an intermediate representation, in which each feature 𝑗 belongs to one class among 𝐾 indexed by a discrete variable 𝑧𝑗 (𝐳={𝑧1,…,𝑧𝑝}).All the features within a class π‘˜βˆˆ{1,…,𝐾} share the same precision parameter πœ†π‘˜, and we use the following prior on 𝐳: π³βˆΌπ‘ξ‘πΎπ‘—=1ξ‘π‘˜=1πœ‹π›Ώπ‘—π‘˜π‘˜,(6) where 𝛿 is Kronecker's 𝛿, defined as π›Ώπ‘—π‘˜=ξƒ―0ifπ‘§π‘—β‰ π‘˜,1if𝑧𝑗=π‘˜.(7)
We finally introduce an additional Dirichlet prior [24] on πœ‹: πœ‹βˆΌDir(πœ‚)(8) with a hyperparameter πœ‚. By updating at each step the probability πœ‹π‘˜ of each class, it is possible to prune classes. This model has no spatial constraints and thus is not spatially regularized.

Priors on the Weights
As in ARD, we make use of an independent Gaussian prior for the weights: ξ€·π°βˆΌπ’©0,π€βˆ’1ξ€Έξ‚†πœ†withdiag(𝐀)=𝑧1,…,πœ†π‘§π‘ξ‚‡,(9) where πœ†π‘§π‘— is the precision parameter of the 𝑗th feature, with π‘§π‘—βˆˆ{1,…,𝐾}. We introduce the following prior on πœ†π‘˜: πœ†π‘˜ξ€·πœ†βˆΌΞ“π‘˜;πœ†1,π‘˜,πœ†2,π‘˜ξ€Έ(10) with hyperparameters πœ†1,π‘˜, πœ†2,π‘˜. The complete generative model is summarized in Figure 1.

2.1.1. Link with Other Bayesian Regularization Schemes

The link between the proposed MCBR model and the other regularization methods, Bayesian Ridge Regression and Automatic Relevance Determination, is obvious.(i)With 𝐾=1, that is, πœ†π‘§1=β‹―=πœ†π‘§π‘, we retrieve the BRR model, (ii)With 𝐾=𝑝, that is, πœ†π‘§π‘–β‰ πœ†π‘§π‘— if 𝑖≠𝑗, and assigning each feature to a singleton class (i.e., 𝑧𝑗=𝑗), we retrieve the ARD model.

Moreover, the proposed approach is related to the one developed in [25]. In this paper, the authors proposed, for the distribution of weights of the features, a binary mixture of Gaussians with small and large precisions. This model is used for variable selection and estimated by the Gibbs sampling. Our work can be viewed as a generalization of this model to a number of classes 𝐾β‰₯2.

2.2. Model Inference

For models with latent variables, such as MCBR, some singularities can exist. For instance in a mixture of components, a singularity is a component with one single sample and thus zero variance. In such cases, maximizing the log likelihood yields flawed solutions, and one can use the posterior distribution of the latent variables 𝑝(π³βˆ£π—,𝐲) for this maximization. However, the posterior distribution of the latent variables given the data does not have a closed-form expression, and some specific estimation methods, such as variational Bayes or Gibbs sampling, have to be used.

We propose two different algorithms for inferring the parameters of the MCBR model. We first estimate the model by the variational Bayes, and the resulting algorithm is thus called VB-MCBR. We also detail an algorithm, called Gibbs-MCBR, based on a Gibbs sampling procedure.

2.2.1. Estimation by Variational Bayes: VB-MCBR

The variational Bayes (or VB) approach provides an approximation π‘ž(Θ) of 𝑝(Θ∣𝐲), where π‘ž(Θ) is taken in a given family of distributions and Θ=[𝐰,πœ†,𝛼,𝐳,πœ‹]. Additionally, the variational Bayes approach often uses the following mean field approximation, which allows the factorization between the approximate distribution of the latent variables and the approximate distributions of the parameters:π‘ž(Θ)=π‘ž(𝐰)π‘ž(πœ†)π‘ž(𝛼)π‘ž(𝐳)π‘ž(πœ‹).(11)

We introduce the Kullback-Leibler divergence π’Ÿ(π‘ž(Θ)) that measures the similarity between the true posterior 𝑝(Θ∣𝐲) and the variational approximation π‘ž(Θ). One can decompose the marginal log-likelihood log𝑝(𝐲) as log𝑝(𝐲∣Θ)=β„±(π‘ž(Θ))+π’Ÿ(π‘ž(Θ))(12) with β„±ξ€œ(π‘ž(Θ))=π‘‘Ξ˜π‘ž(Θ)log𝑝(𝐲,Θ),ξ€œπ‘ž(Θ)π’Ÿ(π‘ž(Θ))=π‘‘Ξ˜π‘ž(Θ)logπ‘ž(Θ),𝑝(Θ∣𝐲)(13) where β„±(π‘ž(Θ)) is called free energy and can be seen as the measure of the quality of the model. As π’Ÿ(π‘ž(Θ))β‰₯0, the free energy is a lower bound on log𝑝(𝐲) with equality if and only if π‘ž(Θ)=𝑝(Θ∣𝐲). So, inferring the density π‘ž(Θ) of the parameters corresponds to maximizing β„± with respect to the free distribution π‘ž(Θ). In practice, the VB approach consists in maximizing the free energy β„± iteratively with respect to the approximate distribution π‘ž(𝐳) of the latent variables and with respect to the approximate distributions of the parameters of the model π‘ž(𝐰), π‘ž(πœ†), π‘ž(𝛼), and π‘ž(πœ‹).

The variational distributions and the pseudocode of the VB-MCBR algorithm are provided in Appendix A. This algorithm maximizes the free energy β„±. In practice, iterations are performed until convergence to a local maximum of β„±. With an ARD prior (i.e., 𝐾=𝑝 and fixing 𝑧𝑗=𝑗), we retrieve the same formulas as the ones found for Variational ARD [18].

2.2.2. Estimation by Gibbs Sampling: Gibbs-MCBR

We develop here an estimation of the MCBR model using Gibbs sampling [26]. The resulting algorithm is called Gibbs-MCBR; the pseudocode of the algorithm and the candidate distributions are provided in Appendix B. The Gibbs sampling algorithm is used for generating a sequence of samples from the joint distribution to approximate marginal distributions. The main idea is to use conditional distributions that should be known and possibly easy to sample from, instead of directly computing the marginals from the joint law by integration (the joint law may not be known or hard to sample from). The sampling is done iteratively among the different parameters, and the final estimation of parameters is obtained by averaging the values of the different parameters across the different iterations (one may not consider the first iterations, this is called the burn in).

2.2.3. Initialization and Priors on the Model Parameters

Our model needs few hyperparameters; we choose here to use slightly informative and class-specific hyperparameters in order to reflect a wide range of possible behaviors for the weight distribution. This choice of priors is equivalent to setting heavy-tailed centered Student’s t-distributions with variance at different scales, as priors on the weight parameters. We set 𝐾=9, with weakly informative priors πœ†1,π‘˜=10π‘˜βˆ’4, π‘˜βˆˆ[1,…,𝐾] and πœ†2,π‘˜=10βˆ’2, π‘˜βˆˆ[1,…,𝐾]. Moreover, we set 𝛼1=𝛼2=1. Starting with a given number of classes and letting the model automatically prune the classes can be seen as a means of avoiding costly model selection procedures. The choice of class-specific priors is also useful to avoid label switching issues and thus speeds up convergence. Crucially, the priors used here can be used in any regression problem, provided that the target data is approximately scaled to the range of values used in our experiments. In that sense, the present choice of priors can be considered as universal. We also randomly initialize π‘ž(𝐳) for VB-MCBR (or 𝐳 for Gibbs-MCBR).

2.3. Validation and Model Evaluation
2.3.1. Performance Evaluation

Our method is evaluated with a cross-validation procedure that splits the available data into training and validation sets. In the following, (𝐗𝑙,𝐲𝑙) are a learning set (𝐗𝑑,𝐲𝑑) is a test set, and ̂𝐲𝑑=𝐹(𝐗𝑑𝐰) refers to the predicted target, where 𝐰 is estimated from the training set. The performance of the different models is evaluated using 𝜁, the ratio of explained variance: πœξ€·π²π‘‘,̂𝐲𝑑=𝐲varπ‘‘ξ€Έξ€·π²βˆ’varπ‘‘βˆ’Μ‚π²π‘‘ξ€Έvar(𝐲𝑑).(14) This is the amount of variability in the response that can be explained by the model (perfect prediction yields 𝜁=1, while 𝜁<0 if prediction is worse than chance).

2.3.2. Competing Methods

In our experiments, the proposed algorithms are compared to different state-of-the-art regularization methods. (i)Elastic Net Regression [27], which requires setting two parameters πœ†1 and πœ†2. In our analyzes, a cross-validation procedure within the training set is used to optimize these parameters. Here, we use πœ†1ΜƒΜƒΜƒΜƒβˆˆ{0.2πœ†,0.1πœ†,0.05πœ†,0.01πœ†}, where Μƒπœ†=β€–π—π‘‡π²β€–βˆž, and πœ†2∈{0.1,0.5,1.,10.,100.}. Note that πœ†1 and πœ†2 parametrize heterogeneous norms. (ii)Support Vector Regression (SVR) with a linear kernel [28], which is the reference method in neuroimaging. The 𝐢 parameter is optimized by cross-validation in the range of 10βˆ’3 to 101 in multiplicative steps of 10. (iii)Bayesian Ridge Regression (BRR), which is equivalent to MCBR with 𝐾=1 and πœ†1=πœ†2=𝛼1=𝛼2=10βˆ’6, that is, weakly informative priors.(iv)Automatic Relevance Determination (ARD), which is equivalent to MCBR with 𝐾=𝑝 and πœ†1=πœ†2=𝛼1=𝛼2=10βˆ’6, that is, weakly informative priors.

All these methods are used after an Anova-based feature selection as this maximizes their performance. Indeed, irrelevant features and redundant information can decrease the accuracy of a predictor [29]. The optimal number of voxels is selected within the range {50,100,250,500}, using a nested cross-validation within the training set. We do not directly select a threshold on 𝑃 value or cluster size, but rather a predefined number of features. The estimation of the parameters of the learning function is also performed using a nested cross-validation within the training set, to ensure a correct validation and an unbiased comparison of the methods. All methods are developed in 𝐢 and used in Python. The implementation of elastic net is based on coordinate descent [30], while SVR is based on LibSVM [31]. Methods are used from Python via the Scikit-learn open source package [32].

For VB-MCBR and Gibbs-MCBR, in order to avoid a costly internal cross-validation, we select 500 voxels, and this selection is performed on the training set. The number of iterations used is fixed to 5000 (burn in of 4000 iterations) for Gibbs-MCBR and 500 for VB-MCBR. Preliminary results on both simulated and real data showed that these values are sufficient enough for an accurate inference of the model. As explained previously, we set 𝐾=9, with weakly informative priors πœ†1,π‘˜=10π‘˜βˆ’4, π‘˜βˆˆ[1,…,𝐾] and πœ†2,π‘˜=10βˆ’2, π‘˜βˆˆ[1,…,𝐾]. Moreover, we set 𝛼1=𝛼2=1, and we randomly initialize π‘ž(𝐳) for VB-MCBR (or 𝐳 for Gibbs-MCBR).

3. Experiments and Results

3.1. Experiments on Simulated Data

We now evaluate and illustrate MCBR on two different sets of simulated data.

3.1.1. Details on Simulated Regression Data

We first test MCBR on a simulated data set, designed for the study of ill-posed regression problem, that is, 𝑛β‰ͺ𝑝. Data are simulated as follows: ξ€·π—π—βˆΌπ’©(0,1)withπœ–βˆΌπ’©(0,1),𝐲=21+𝐗2βˆ’π—3βˆ’π—4𝐗+0.55+𝐗6βˆ’π—7βˆ’π—8ξ€Έ+πœ–.(15) We have 𝑝=200 features, 𝑛𝑙=50 images for the training set, and 𝑛𝑑=50 images for the test set. We compare MCBR to the reference methods, but we do not use feature selection, as the number of features is not very high.

3.1.2. Results on Simulated Regression Data

We average the results of 15 different trials, and the average explained variance is shown in Table 1. Gibbs-MCBR outperforms the other approaches, yielding higher prediction accuracy than the reference elastic net and ARD methods. The prediction accuracy is also more stable than the other methods. VB-MCBR falls into the local maximum of β„± and does not yield an accurate prediction. BBR has a low prediction accuracy compared to other methods such as ARD. Indeed, it cannot finely adapt the weights of the relevant features, as these features are regularized similarly as the irrelevant ones. SVR has also low accuracy, due to the fact that we do not perform any feature selection. Thus, SVR suffers from the curse of dimensionality, unlike other methods such as ARD or elastic net, which performs feature selection and model estimation jointly.

In Figure 2, we represent the probability density function of the distributions of the weights obtained with BRR (a), Gibbs-MCBR (b), and ARD (c). With BRR, the weights are grouped in a monomodal density. ARD is far more adaptive and sets lots of weights to zero. The Gibbs-MCBR algorithm creates a multimodal distribution, lots of weights being highly regularized (pink distributions), and informative features are allowed to have higher weights (blue distributions).

With MCBR, weights are clustered into different groups, depending on their predictive power, which is interesting in application such as fMRI inverse inference, as it can yield more interpretable models. Indeed, the class to the features with higher weights ({𝐗1,𝐗2,𝐗3,𝐗4}) belong which is small (average size of 6 features) but has a high purity (percentage of relevant features in the class) of 74%.

3.1.3. Comparison between VB-MCBR and Gibbs-MCBR

We now look at the values of 𝑀1 and 𝑀2 for the different steps of the two algorithms (see Figure 3). We can see that VB-MCBR (b) quickly falls into a local maximum, while Gibbs-MCBR (a) visits the space and reaches the region of the correct set of parameters (red dot). VB-MCBR is not optimal in this case.

3.2. Simulated Neuroimaging Data
3.2.1. Details on Simulated Neuroimaging Data

The simulated data set 𝐗 consists of 𝑛=100 images (size 12Γ—12Γ—12 voxels) with a set of four square regions of interest (ROI) (size 2Γ—2Γ—2). We call β„› the support of the ROI (i.e., the 32 resulting voxels of interest). Each of the four ROIs has a fixed weight in {βˆ’0.5,0.5,βˆ’0.5,0.5}. We call 𝑀𝑖,𝑗,π‘˜ the weight of the (𝑖,𝑗,π‘˜) voxel. The resulting images are smoothed with a Gaussian kernel with a standard deviation of 2 voxels, to mimic the correlation structure observed in real fMRI data. To simulate the spatial variability between images (intersubject variability, movement artifacts in intrasubject variability), we define a new support of the ROIs, called ℛ such that, for each image 𝑙th, 50% (randomly chosen) of the weights 𝐰 are set to zero. Thus, we have ξ‚β„›βŠ‚β„›. We simulate the target 𝐲 for the 𝑙th image as 𝑦𝑙=(𝑖,𝑗,π‘˜)βˆˆξ‚β„›π‘€π‘–,𝑗,π‘˜π‘‹π‘–,𝑗,π‘˜,𝑙+πœ–π‘™(16) with the signal in the (𝑖,𝑗,π‘˜) voxel of the 𝑙th image simulated as𝑋𝑖,𝑗,π‘˜,π‘™βˆΌπ’©(0,1),(17) and πœ–π‘™βˆΌπ’©(0,𝛾) is a Gaussian noise with standard deviation 𝛾>0. We choose 𝛾 in order to have a signal-to-noise ratio of 5 dB.

3.2.2. Results on Simulated Neuroimaging Data

We compare VB-MCBR and Gibbs-MCBR with the different competing algorithms. The resulting images of weights are given in Figure 4, with the true weights (a) and resulting Anova F-scores (b). The reference methods can detect the truly informative regions (ROIs), but elastic net (f) and ARD (h) retrieve only part of the support of the weights. Moreover, elastic net yields an overly sparse solution. BRR (g) also retrieves the ROIs but does not yield a sparse solution, as all the features are regularized in the same way. We note that the weights in the feature space estimated by SVR (e) are nonzero everywhere and do not outline the support of the ground truth. VB-MCBR (c) converges to a local maximum similar to the solution found by BRR (g); that is, it creates only one nonempty class, and thus regularizes all the features similarly. We can thus clearly see that, in this model, the variational Bayes approach is very sensitive to the initialization and can fall into nonoptimal local maxima, for very sparse support of the weights. Finally, Gibbs-MCBR (d) retrieves most of the true support of the weights by performing an adapted regularization.

3.3. Experiments and Results on Real fMRI Data

In this section, we assess the performance of MCBR in an experiment on the mental representation of object size, where the aim is to predict the size of an object seen by the subject during the experiment, in both intrasubject and intersubject cases. The size (or scale parameter) of the object will be the target variable 𝐲.

3.3.1. Details on Real Data

We apply the different methods on a real fMRI dataset related to an experiment studying the representation of objects, on ten subjects, as detailed in [33]. During this experiment, ten healthy volunteers viewed objects of 4 shapes in 3 different sizes (yielding 12 different experimental conditions), with 4 repetitions of each stimulus in each of the 6 sessions. We pooled data from the 4 repetitions, resulting in a total of 𝑛=72 images by subject (one image of each stimulus by session). Functional images were acquired on a 3-T MR system with an eight-channel head coil (Siemens Trio, Erlangen, Germany) as T2*-weighted echo-planar image (EPI) volumes. Twenty transverse slices were obtained with a repetition time of 2 s (echo time: 30 ms; flip angle: 70Β°; 2Γ—2Γ—2-mm voxels; 0.5 mm gap). Realignment, normalization to MNI space, and general linear model (GLM) fit were performed with the SPM5 software (http://www.fil.ion.ucl.ac.uk/spm/software/spm5/). The normalization is the conventional method of SPM (implying affine and nonlinear transformations) and not the one using unified segmentation. The normalization parameters are estimated on the basis of a whole-head EPI acquired in addition and are then applied to the partial EPI volumes. The data are not smoothed. In the GLM, the effect of each of the 12 stimuli convolved with a standard hemodynamic response function was modeled separately, while accounting for serial autocorrelation with an AR(1) model and removing low-frequency drift terms using a high-pass filter with a cutoff of 128 s. The GLM is fitted separately in each session for each subject, and we used in the present work the resulting session-wise parameter estimate images (the 𝛽-maps are used as rows of 𝐗). The four different shapes of objects were pooled across for each one of the three sizes, and we are interested in finding discriminative information on sizes. This reduces to a regression problem, in which our goal is to predict a simple scalar factor (size of an object). All the analyzes are performed without any prior selection of regions of interest and use the whole acquired volume.

Intrasubject Regression Analysis
First, we perform an intrasubject regression analysis. Each subject is evaluated independently, in a 12-fold cross-validation. The dimensions of the real data set for one subject are π‘βˆΌ7Γ—104 and 𝑛=72 (divided in 3 different sizes, 24 images per size). We evaluate the performance of the method by a leave-one-condition-out cross-validation (i.e., leave-6-image-out), and doing so the GLM is performed separately for the training and test sets. The parameters of the reference methods are optimized with a nested leave-one-condition-out cross-validation within the training set, in the ranges given before.

Intersubject Regression Analysis
Additionally, we perform an intersubject regression analysis on the sizes. The intersubject analysis relies on subject-specific fixed-effect activations that is, for each condition, the 6 activation maps corresponding to the 6 sessions are averaged together. This yields a total of 12 images per subject, one for each experimental condition. The dimensions of the real data set are π‘βˆΌ7Γ—104 and 𝑛=120 (divided into 3 different sizes). We evaluate the performance of the method by cross-validation (leave-one-subject-out). The parameters of the reference methods are optimized with a nested leave-one-subject-out cross-validation within the training set, in the ranges given before.

3.3.2. Results on Real Data

Intrasubject Regression Analysis
The results obtained by the different methods are given in Table 2. The 𝑃-values are computed using a paired 𝑑-test across subjects. VB-MCBR outperforms the other methods. Compared to the results on simulated data, VB-MCBR still falls in a local maximum similar to the Bayesian ridge regression which performs well in this experiment. Moreover, both Gibbs-MCBR and VB-MCBR are more stable than the reference methods.

Intersubject Regression Analysis
The results obtained with the different methods are given in Table 3. As in the intrasubject analysis, both MCBR approaches outperform the reference methods, SVR, BRR, and ARD. However, the prediction accuracy is similar to that of elastic net. In this case, Gibbs-MCBR performs slightly better than VB-MCBR, but the difference is not significant.
One major asset of MCBR (and more particularly Gibbs-MCBR, as VB-MCBR often falls into a one-class local maximum) is that it creates a clustering of the features, based on the relevance of the features in the predictive model. This clustering can be accessed using the variable 𝐳, which is implied in the regularization performed on the different features. In Figure 5, we give the histogram of the weights of Gibbs-MCBR for the intersubject analysis. We keep the weights and the values of 𝐳 of the last iteration; the different classes are represented as dots of different colors and are superimposed on the histogram. We can notice than the pink distribution represented at the bottom of the histogram corresponds to relevant features. This cluster is very small (19 voxels), compared to the two blue classes represented at the top of the histogram that contain many voxels (746 voxels) which are highly regularized, as they are noninformative.
The maps of weights found by the different methods are detailed in Figure 6. The methods are used combined with an Anova-based univariate feature selection (2500 voxels selected, in order to have a good support of the weights). As elastic net, Gibbs-MCBR yields a sparse solution but extracts a few more voxels. The map found by elastic net is not easy to interpret, with very few informative voxels scattered in the whole occipital cortex. The map found by SVR is not sparse in the feature space and is thus difficult to interpret, as the spatial layout of the neural code is not clearly extracted. VB-MCBR does not yield a sparse map either, all the features having nonnull weights

4. Discussion

It is well known that in high-dimensional problems, regularization of feature loadings significantly increases the generalization ability of the predictive model. However, this regularization has to be adapted to each particular dataset. In place of costly cross-validation procedures, we cast regularization in a Bayesian framework and treat the regularization weights as hyperparameters. The proposed approach yields an adaptive and efficient regularization and can be seen as a compromise between a global regularization (Bayesian Ridge Regression) that does not take into account the sparse or focal distribution of the information and automatic relevance determination. Additionally, MCBR creates a clustering of the features based on their relevance and thus explicitly extracts groups of informative features.

Moreover, MCBR can cope with the different issues of ARD. ARD is subject to an underfitting in the hyperparameter space that corresponds to an underfitting in model selection (i.e., on the features to be pruned) [19]. Indeed, as ARD is estimated by maximizing evidence, models with less selected features are preferred, as the integration is done on less dimensions, and thus evidence is higher. ARD will choose the sparsest model across models with similar accuracy. A contrario, MCBR requires far less hyperparameter (2×𝐾, with 𝐾β‰ͺ𝑝) and suffers less from this issue, as the sparsity of the model is defined by groups. Moreover, a full Bayesian framework for estimating ARD requires to set some priors on the hyperparameters (e.g., 𝛼1 and 𝛼2), and it may be sensitive to specific choice of these hyperparameters. A solution is to use an internal cross-validation for optimizing these parameters, but this approach can be computationally expensive. In the case of MCBR, the distributions of the hyperparameters are bound to a class and not to each feature. Thus, the proposed approach is less sensitive to the choice of the hyperparameters. Indeed, the choice of good hyperparameters for the features is dealt with at the class level.

On simulated data, our approach performs better than other classical methods such as SVR, BRR, ARD, and elastic net and yields a more stable prediction accuracy. Moreover, by adapting the regularization to different groups of voxels, MCBR retrieves the true support of the weights and recovers a sparse solution. Results on real data show that MCBR yields more accurate predictions than other regularization methods. As it yields less sparse solution than elastic net, it gives access to more plausible loading maps which are necessary for understanding the spatial organization of brain activity, that is, retrieving the spatial layout of the neural coding. On real fMRI data, the explicit clustering of Gibbs-MCBR is also an interesting aspect of the model, as it can extract few groups of relevant features from many voxels.

In some experiments, the variational Bayes algorithm yields less accurate predictions than the Gibbs sampling approach, which can be explained by the difficulty of initializing the different variables (especially 𝐳) when the support of the weight is overly sparse. Moreover, the VB-MCBR algorithm relies on a variational Bayes approach, which may not be optimal, due to strong approximations in model inference. A contrario Gibbs-MCBR is more time consuming but yields a better model inference. Finally, the variability in the results may be explained by the difficulty to estimate the model (optimality is not ensured).

The question of model selection (i.e., the number of classes 𝐾) has not been addressed in this paper. One can use the free energy in order to select the best model, but due to the instability of VB-MCBR, this approach does not seem promising. A more interesting method is the one detailed in [34], which can be used with the Gibbs sampling algorithm. Here, model selection is performed implicitly by emptying classes that do not fit the data well. In that respect, the choice of heterogeneous priors for different classes is crucial: replacing our priors with class-independent priors (i.e., πœ†1,π‘˜=10βˆ’2, π‘˜βˆˆ[1,…,𝐾]) in the intersubject analysis on size prediction leads Gibbs-MCBR to a local maximum similar to VB-MCBR.

Finally, this model is not restricted to the Bayesian regularization and can be used for classification, within a probit or logit model [35, 36]. The proposed model may thus be used for diagnosis in medical imaging, for the prediction of both continuous or discrete variables.

5. Conclusion

In this paper, we have proposed a model for adaptive regression, called MCBR. The proposed method integrates, in the same Bayesian framework, BRR and ARD and performs a different regularization for relevant and irrelevant features. It can tune the regularization to the possible different level of sparsity encountered in fMRI data analysis, and it yields interpretable information for fMRI inverse inference, namely, the 𝐳 variable (latent class variable). Experiments on both simulated and real data show that our approach is well suited for neuroimaging, as it yields accurate and stable predictions compared to the state-of-the-art methods.

Appendices

A. VB-MCBR Algorithm

The variational Bayes approach yields the following variational distributions:

(i)π‘ž(𝐰)βˆΌπ’©(π°βˆ£πœ‡,𝚺) with 𝐀=diag𝑙1,…,𝑙𝑝with𝑙𝑗=πΎξ“π‘˜=1π‘žξ€·π‘§π‘—ξ€Έπ‘™=π‘˜1,π‘˜π‘™2,π‘˜ξ‚΅π‘Žβˆ€π‘—βˆˆ{1,…,𝑝},(A.1)𝚺=1π‘Ž2𝐗𝑇𝐗+π€ξ‚Άβˆ’1,π‘Ž(A.2)πœ‡=1π‘Ž2πšΊπ—π‘‡π²;(A.3)(ii)π‘ž(πœ†π‘˜)βˆΌΞ“(𝑙1,π‘˜,𝑙2,π‘˜) with 𝑙1,π‘˜=πœ†1,π‘˜+12π‘βˆ‘π‘—=1π‘žξ€·π‘§π‘—ξ€Έ,𝑙=π‘˜(A.4)2,π‘˜=πœ†2,π‘˜+12π‘βˆ‘π‘—=1ξ€·πœ‡2𝑗𝑗+Ξ£π‘—π‘—ξ€Έπ‘žξ€·π‘§π‘—ξ€Έ;=π‘˜(A.5)(iii)π‘ž(𝛼)βˆΌΞ“(π‘Ž1,π‘Ž2) with π‘Ž1=𝛼1+𝑛2,π‘Ž(A.6)2=𝛼2+12(π²βˆ’π—πœ‡)𝑇1(π²βˆ’π—πœ‡)+2ξ€·TrπšΊπ—π‘‡π—ξ€Έ;(A.7)(iv)π‘ž(𝑧𝑗=π‘˜)∼exp(πœŒπ‘—π‘˜) with πœŒπ‘—π‘˜1=βˆ’2ξ€·πœ‡2𝑗+Σ𝑗𝑗𝑙1,π‘˜π‘™2,π‘˜ξ€·πœ‹+lnπ‘˜ξ€Έ+12Ψ𝑙1,π‘˜ξ€Έξ€·π‘™βˆ’log2,π‘˜πœ‹ξ€Έξ€Έ,(A.8)π‘˜=exp{Ξ¨(π‘‘π‘˜βˆ‘)βˆ’Ξ¨(π‘˜=πΎπ‘˜=1π‘‘π‘˜)}𝑑,(A.9)π‘˜=πœ‚π‘˜+𝑝𝑗=1π‘žξ€·π‘§π‘—ξ€Έ,=π‘˜(A.10) where Ξ¨ is the digamma function Ξ¨(π‘₯)=Ξ“ξ…ž(π‘₯)/Ξ“(π‘₯). The VB-MCBR algorithm is provided in pseudo-code in Algorithm 1.

Initialize π‘Ž 1 = 𝛼 1 , π‘Ž 2 = 𝛼 2 , 𝑙 1 = πœ† 1 , 𝑙 2 = πœ† 2 and 𝑑 π‘˜ = πœ‚ π‘˜
Randomly initialize π‘ž ( 𝑧 𝑗 = π‘˜ )
Set a number of iterations max steps
repeat
 Compute 𝐴 using (A.1), Ξ£ using (A.2) and πœ‡ using (A.3).
 Compute 𝑙 1 using (A.4) and 𝑙 2 using (A.5).
 Compute π‘Ž 1 using (A.6) and π‘Ž 2 using (A.7).
 Compute 𝜌 𝑗 π‘˜ using (A.8).
 Compute πœ‹ π‘˜ using (A.9) and 𝑑 π‘˜ using (A.10).
until max steps;
Return   πœ‡ .

B. Gibbs-MCBR Algorithm

With Θ=[𝐰,πœ†,𝛼,𝐳,πœ‹], we have the following candidate distributions (i.e., the distributions used for the sampling of the different parameters):

(i)𝑝(π°βˆ£Ξ˜βˆ’{𝐰})βˆπ’©(π°βˆ£πœ‡,𝚺) with ξ€·π—πšΊ=𝑇𝐗𝛼+π€βˆ’1ξ‚€πœ†with𝐀=diag𝑧1,…,πœ†π‘§π‘ξ‚,(B.1)πœ‡=πšΊπ›Όπ—π‘‡π²;(B.2)(ii)βˆπ‘(πœ†βˆ£Ξ˜βˆ’{πœ†})βˆπΎπ‘˜=1Ξ“(πœ†π‘˜βˆ£π‘™1,π‘˜,𝑙2,π‘˜) with 𝑙1,π‘˜=πœ†1,π‘˜+12𝑝𝑗=1𝛿𝑧𝑗,𝑙=π‘˜(B.3)2,π‘˜=πœ†2,π‘˜+12𝑝𝑗=1𝛿𝑧𝑗𝑀=π‘˜2𝑗;(B.4)(iii)𝑝(π›Όβˆ£Ξ˜βˆ’{𝛼})βˆΞ“(π‘Ž1,π‘Ž2) with π‘Ž1=𝛼1+𝑛2,π‘Ž(B.5)2=𝛼2+12(π²βˆ’π—πœ‡)𝑇(π²βˆ’π—πœ‡);(B.6)(iv)𝑝(π‘§π‘—βˆ£Ξ˜βˆ’{𝐳})∝mult(expπœŒπ‘—,1,…,expπœŒπ‘—,𝐾) with πœŒπ‘—π‘˜1=βˆ’2𝑀2π‘—πœ†π‘˜ξ€·πœ‹+lnπ‘˜ξ€Έ+12logπœ†π‘˜;(B.7)(v)𝑝(πœ‹π‘˜βˆ£Ξ˜βˆ’{πœ‹})∝Dir(π‘‘π‘˜) with π‘‘π‘˜=πœ‚π‘˜+𝑝𝑗=1𝛿𝑧𝑗.=π‘˜(B.8)

The algorithm is provided in pseudocode in Algorithm 2.

Initialize 𝛼 1 , 𝛼 2 , πœ† 1 , πœ† 2 and πœ‚ π‘˜
Randomly initialize 𝑧
Set a number of iterations burn number for burn-in
Set a number of iterations max steps
Repeat
 Compute Ξ£ using (B.1) and πœ‡ using (B.2).
 Sample 𝐰 in 𝒩 ( 𝐰 ∣ πœ‡ , 𝚺 ) .
 Compute 𝑙 1 using (B.3) and 𝑙 2 using (B.4).
 Sample πœ† in ∏ π‘˜ = 𝐾 π‘˜ = 1 Ξ“ ( πœ† π‘˜ ∣ 𝑙 1 , π‘˜ , 𝑙 2 , π‘˜ ) .
 Compute π‘Ž 1 using (B.5) and π‘Ž 2 using (B.6).
 Sample 𝛼 in Ξ“ ( π‘Ž 1 , π‘Ž 2 ) .
 Compute 𝜌 𝑗 π‘˜ using (B.7).
 Sample 𝐳 in m u l t ( e x p 𝜌 𝑗 , 1 , … , e x p 𝜌 𝑗 , 𝐾 ) .
 Compute 𝑑 π‘˜ using (B.8).
 Sample πœ‹ π‘˜ in D i r ( 𝑑 π‘˜ ) .
until max steps;
return Average value of w after burn number iterations.

Acknowledgment

The authors acknowledge support from the ANR Grant ViMAGINE ANR-08-BLAN-0250-02.