Computational Intelligence and Neuroscience

Volume 2016, Article ID 1068434, 22 pages

http://dx.doi.org/10.1155/2016/1068434

## Evaluation of Second-Level Inference in fMRI Analysis

Department of Data Analysis, Ghent University, H. Dunantlaan 1, 9000 Ghent, Belgium

Received 9 July 2015; Revised 21 August 2015; Accepted 4 October 2015

Academic Editor: Pierre L. Bellec

Copyright © 2016 Sanne P. Roels et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We investigate the impact of decisions in the second-level (i.e., over subjects) inferential process in functional magnetic resonance imaging on (1) the balance between false positives and false negatives and on (2) the data-analytical stability, both proxies for the reproducibility of results. Second-level analysis based on a mass univariate approach typically consists of 3 phases. First, one proceeds via a general linear model for a test image that consists of pooled information from different subjects. We evaluate models that take into account first-level (within-subjects) variability and models that do not take into account this variability. Second, one proceeds via inference based on parametrical assumptions or via permutation-based inference. Third, we evaluate 3 commonly used procedures to address the multiple testing problem: familywise error rate correction, False Discovery Rate (FDR) correction, and a two-step procedure with minimal cluster size. Based on a simulation study and real data we find that the two-step procedure with minimal cluster size results in most stable results, followed by the familywise error rate correction. The FDR results in most variable results, for both permutation-based inference and parametrical inference. Modeling the subject-specific variability yields a better balance between false positives and false negatives when using parametric inference.

#### 1. Introduction

In cognitive neurosciences, functional Magnetic Resonance Imaging (fMRI) plays an important role to localize brain regions and to study interactions among those regions (resp., functional segregation and functional integration; see, e.g., [1]) The analysis of an fMRI time course in a single subject (first-level analysis) offers some insight into subject-specific brain functioning while group studies that aggregate results over individuals (second-level analysis) yield more generalizable results. In this paper, we focus on the mass univariate approach in which the brain is divided in small volume units or voxels, although alternatives exist (e.g., [2]). For each of these voxels, a general linear model (GLM) is used to model brain activation, at the first and the second level [3]. The activation is then judged at the voxel level, rather than based on topological features. The selection of activated voxels can be viewed as a sequence of different phases [4]. For first-level analyses, Carp [5] demonstrated the large variation in the choices made in each of these different phases which impacts results. In second-level analyses, to a lesser extent, different combinations of choices are possible too. We consider the following phases in the analysis of group studies: aggregation of data over subjects, inference, and correction for multiple testing.

In two commonly used software programs to analyze fMRI data (i.e., SPM and FSL [5]), the expected activation in each voxel is modeled in a two-step approach [6]. In the first-level analysis, the evidence per subject is summarized in a linear contrast of the parameters, necessary to model the study design. These contrast images are then passed to the second-level analysis in which the evidence is weighted over subjects. To pool this information over subjects, one can either take into account subject-specific variability in constructing the voxelwise test statistics or only rely on the estimated contrasts and not take into account this subject-specific variability [7].

After pooling the data, one proceeds to the second phase, the inference phase. While parametric inference offers the advantage of closed-form null distributions that can be used to obtain values, it depends on strong assumptions which are not easy to satisfy in practice [8] and have not been tested extensively [9]. An alternative is to use nonparametric methods such as permutation-based inference to create an empirical null distribution conditional on the observed sample [9–11].

Third, inference must be corrected for the huge multiple testing that is induced by the mass univariate approach in which simultaneously over 100.000 tests are performed. As Bennett et al. [12] and Lieberman and Cunningham [13] discuss, there was (and yet is) no golden standard to address the choice for multiple testing corrections. We consider three different multiple testing procedures: controlling the False Discovery Rate (FDR), controlling the familywise error rate (FWE), and an approach based on uncorrected testing combined with a minimal cluster size. While FDR [14, 15] and FWE control (see, e.g., [8]) have a strong theoretical background with a focus, respectively, on the proportion of false positives among all selected voxels and on the probability to observe at least one false positive, the third approach is purely empirical in nature [13].

These three corrections are designed to control the multiple testing problem at the voxel level. Other popular alternatives that focus on topological features such as cluster size (i.e., the size of a neighboring collection of voxels) or cluster height exist as well. In a recent study, Woo et al. [16] advocate against the use of cluster-based inference and demonstrate its problematic use when studies are sufficiently powered. By definition, it is cumbersome to interpret the findings resulting from “significant clusters” because these may not reflect a set of significant constituting voxels (see also [9]). On the other hand, the third approach [13] resembles cluster-based testing but instead of setting a threshold for cluster size based on cluster significance, a fixed prespecified threshold for the minimum cluster size is set. For completeness, we therefore also extend the third approach by choosing the threshold as in cluster-based inference. However, it is important to point out that we do not intend to investigate cluster-based testing which is fundamentally different from the approach taken here and relies on different topological assumptions. Instead, we focus on voxelwise testing (for an elaborate investigation of cluster-based testing, we refer to [4]).

The choices made in each of the 3 phases of a second-level analysis is crucial steps in the analysis of fMRI data and may consequently influence results. The use of such second-level analyses or group studies is widespread [6, 10, 17, 18] but the impact of varying procedures at the different phases has not yet been extensively validated. One can distinguish three different aspects in the evaluation of methods [4]: validity, reliability, and stability. The validity can be assessed by verifying whether the false positive rate is controlled at a predefined, nominal level. Further, the balance between type I errors (false positives) and type II errors (false negatives) has long been the main interest in the validation of testing procedures (e.g., [8]). One has also acknowledged the importance of investigating the reliability of methods (e.g., [19, 20]). The extent to which a method is reliable can be measured through the overlap between activated brain regions over repeated measures, for example, in test-retest settings.

The concept of data-analytical stability, originally developed in genetics [21], was recently introduced into the context of fMRI data analysis [4]. This measure allows us to quantify reproducibility of results through the variability on different measures, for example, the variance on the number of selected voxels over replications (either in simulation studies with a known ground truth or through subsampling of real data). Stable methods are characterized by a low variability on the number of selected voxels. Data-analytical stability is thus a useful additional criterion to distinguish between methods. In this paper, we assess the influence of different choices made in the three phases on the reproducibility of results. We hereby focus on the balance between false positives and false negatives and on the stability as measures for reproducibility.

In Section 2 we give a brief overview of the different techniques. Next, we describe the details and the results of our simulation study. In Section 4, we present the results and the details from the real data application. In Discussion, we summarize our findings and end with some recommendations for the practitioner.

#### 2. Methods

In this section we provide an overview on the different inferential techniques that we will consider in the simulation study and real data example. First, we describe the methods for pooling the evidence over subjects in the mass univariate GLM approach for fMRI data at the second level. Next, we summarize different multiple testing strategies that are frequently exploited in the fMRI literature, such as approaches that control the familywise error rate, approaches for control of the False Discovery Rate, and a two-step procedure based on an uncorrected threshold but requiring a minimum cluster size. Finally, we discuss the construction of test statistics under the null hypothesis that rely on parametric assumptions versus nonparametric approaches.

##### 2.1. Voxel-Based GLM Approach to Analyzing fMRI Data at the Group Level

Group-level inference typically proceeds via a two-step procedure [6]. In the first step, an analysis is conducted at the voxel level for each subject separately (with ), and an appropriate contrast of interest is constructed. In a second step, these contrast images are combined to weight evidence over the subjects.

###### 2.1.1. First-Level Analysis

For each subject , the BOLD signal is sampled on time points in every voxel (with ) during an fMRI experiment. For every voxel , a general linear model (GLM) is then used to relate the voxels’ time course (i.e., the BOLD signal) to the expected BOLD signal under brain activation in the experimental setup (the design matrix ) (see, e.g., [22–25]):

The design matrix is the product of a convolution of the stimulus onset function with a hemodynamic response function (HRF) (e.g., [26]). When fitting model (1), one needs to account for the residual correlation between consecutive time points. Let represent the variance-covariance matrix of in model (1). To deal with the temporal correlation, a matrix is typically constructed such that holds. If and are correctly specified, can be unbiasedly estimated via a simple least squares approach. By relying on “decorrelated” or whitened outcome and predictor, that is, and are premultiplied by , an unbiased estimator for the variance of the estimator for is obtained (see, e.g., [3, 27, 28]). Testing for specific differences between the activation in conditions for voxel is then possible by testing the appropriate contrasts of the elements of with a contrast vector , that is, test .

###### 2.1.2. Second-Level Analysis

Next we focus on the group level analysis for a specific voxel (). For ease of notation, we will drop the voxel index in the text below. For the contrast of interest, let denote , the estimated contrasts at the first level for subjects to . Obviously, those contrasts are not exactly known but estimated with some imprecision. Suppose for now that those contrasts are known and denoted by , then a GLM can be used to weight the group evidence (e.g., [18]):where denotes the design matrix. In the simplest case where one is interested in knowing whether there is activation over all subjects, the design matrix equals a simple column matrix consisting of elements 1. Alternatively, in the presence of between-subjects conditions or groups (e.g., one wants to know whether the activation is different between males and females), can take more complex forms with additional regressors. Furthermore is the group error vector, with with the identity matrix of dimension and the between-subject variance.

In practice however is unknown, and instead is used as outcome:with and . Since , it follows that the variance-covariance matrix consists of the sum of two parts:

The first term in the right hand side of (4) is inherent to the uncertainty associated with the estimation of , the within-subject variability, while the second term is related to the variability in the estimation of , that is, the between-subjects variance.

In the literature on multisubject fMRI data analysis, two ways of dealing are frequently used. Below, we refer to these two approaches as the Ordinary Least Squares (OLS) approach and the Weighted Least Squares (WLS) approach, respectively.

*OLS: The Homoscedastic Case*. In the first case, described in Holmes and Friston [17], one assumes that within-subject variances do not differ over subjects and that the residual noise is homogeneous across all subjects. Assume that simplifies the form of (in model (6)) to This implies that the within- and between-subject variability cannot be disentangled.

Mumford and Nichols [18] demonstrate that in model (3) (p 1470, in ) can then be estimated as while the residual error variance is estimated as . Hence, this simply amounts to solving the normal equations in the simple linear regression case and inference proceeds as usual under the GLM [28]. This is implemented in FSL [29] under OLS while in SPM [30] this is the standard implementation. In AFNI [31] this is implemented under 3dttest++ (see also [32]).

*WLS: Allowing for Heteroscedasticity*. The WLS approach, or more generally the Generalized Least Squares (GLS) approach, explicitly models the two components of the variance-covariance of in (6):More specifically, a weighting matrix is constructed such that more variable estimates are down-weighted in the estimation of . In the special case where the design matrix only consists of a column of 1’s, the closed form expression for the estimator of equals [18]More generally, equalswith the weighting matrix:

Inference for the variance components is more complex since no closed form solutions exist. Several (restricted) maximal likelihood approaches have been suggested in the literature (see, e.g., [32]). In practice, the within-subject variance is often set to the first-level variance estimates ([18], also in the FSL software package).

In FSL this is implemented under Flame1 while in AFNI this is implemented under 3dMEMA (see also [33]).

##### 2.2. Dealing with the Multiple Testing Problem

It is well-known that the mass-univariate approach in which () voxels are tested simultaneously is faced with huge multiple testing problem, even at the second level. Indeed, if 100.000 tests for which is true are conducted simultaneously, each at a significance level of , then, by chance alone, 5000 voxels will be declared active. Hence, the number of false positives (FP, see Table 1) becomes unacceptably high. While the interest lies in minimizing both the number of FPs and false negatives (FNs), multiple testing procedures aim to control FP rates (type I error rates).