Computational Intelligence and Neuroscience

Volume 2016, Article ID 2961727, 15 pages

http://dx.doi.org/10.1155/2016/2961727

## Analysis of Residual Dependencies of Independent Components Extracted from fMRI Data

^{1}Dipartimento di Ingegneria dell’Informazione, University of Pisa, 56122 Pisa, Italy^{2}Laboratory of Clinical Biochemistry, Department of Experimental Pathology, University of Pisa Medical School, 56126 Pisa, Italy^{3}Fondazione Toscana Gabriele Monasterio, 56124 Pisa, Italy

Received 2 September 2015; Accepted 22 November 2015

Academic Editor: Ricardo Aler

Copyright © 2016 N. Vanello et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Independent component analysis (ICA) of functional magnetic resonance imaging (fMRI) data can be employed as an exploratory method. The lack in the ICA model of strong a priori assumptions about the signal or about the noise leads to difficult interpretations of the results. Moreover, the statistical independence of the components is only approximated. Residual dependencies among the components can reveal informative structure in the data. A major problem is related to model order selection, that is, the number of components to be extracted. Specifically, overestimation may lead to component splitting. In this work, a method based on hierarchical clustering of ICA applied to fMRI datasets is investigated. The clustering algorithm uses a metric based on the mutual information between the ICs. To estimate the similarity measure, a histogram-based technique and one based on kernel density estimation are tested on simulated datasets. Simulations results indicate that the method could be used to cluster components related to the same task and resulting from a splitting process occurring at different model orders. Different performances of the similarity measures were found and discussed. Preliminary results on real data are reported and show that the method can group task related and transiently task related components.

#### 1. Introduction

Functional magnetic resonance imaging (fMRI) is a widespread and well-established technique for the in vivo functional exploration of the brain. fMRI data analysis methods can be roughly classified as* confirmatory* or hypothesis-driven methods and* exploratory* or data-driven methods [1–3]. The former are used in order to test the validity of the experimenters’ hypotheses but do not allow the detection of unexpected phenomena, that is, effects that are not modelled a priori. On the other hand, data-driven methods provide results that are based on general assumptions about signal generation but are often difficult to be fully interpreted [4].

Independent component analysis (ICA) is one of the most used exploratory methods both for task-associated neural responses and for resting state signal processing and is based on the assumption of statistical independence of the components to be extracted [4]. This method has proven its capabilities of separating physiological components of different origins (e.g., vascular and respiratory), of detecting unexpected phenomena, such as activations transiently time-locked with the stimulus, and of isolating artefact-related signal changes, such as those due to head movements [5, 6].

Nonetheless, ICA approaches present different drawbacks. In fact, the extracted components are difficult to classify since they do not have an explicit order or relationship among themselves [7]. The stochastic nature of the algorithm, the finite number of the observations, and the presence of noise influence the solution reliability and stability. For this reason several approaches have been proposed to face this issue [8–10]. In [8] the reliability of the solution is investigated by clustering the results obtained with multiple runs. A different approach was proposed for solving the same issue in [9] where an alignment of different maps obtained from multiple runs of the algorithm is obtained using information criteria and then the reliability of the components is estimated looking at the consistency and variability with tests. In [10], the independent components are classified according to different measures, as maps dipolarity, which is thought to be related to physiological plausibility of the component, and a consistency measure obtained through the use of surrogate decomposition.

Another issue affects the estimation of ICA model. In fact, the best number of components to be extracted, that is, the model order, is not known a priori. Underestimating the model order may cause an information loss, while overestimating it may produce spurious results or the splitting of interesting components into more components [7, 11]. In [11], the effects of varying model order are investigated in simulated datasets. Noise and signal intensity values as well as dataset dimensions were highlighted as parameters largely influencing this choice. Moreover, the authors suggested that adopting a high model order might lead to different conclusions in the case of group analysis with respect to single subject analysis. Specifically, in the case of intersubject variability of the maps, increasing model order might lead to results difficult to be interpreted, while in single subject analysis a higher model order was seen as a possibility to distinguish subnetworks characterized by a different temporal behavior. In [12], the effects of increasing model orders in a probabilistic group ICA are highlighted. While the stability of ICA decomposition was found to decrease with increasing model order, the use of higher model orders allowed detection of interesting neuroanatomical and functional components. To overcome order indeterminacy problem, various methods have been proposed [13–18]. Specifically, information theoretic criteria, such as Akaike Information Criterion, Minimum Description Length, and Bayesian Information Criterion, have been applied to solve this issue in fMRI data [13–16]. In [13], the classical information theoretic criteria were adapted to account for temporal and spatial correlations in fMRI data using a subsampling technique to obtain independent and identical distributed samples. In [16], the model order selection was obtained by introducing a noisy ICA model within a Bayesian framework [17]. In [18], bootstrap stability technique as applied to PCA reduction step is proposed. At this time, however, there is no general agreement about the best approach to choose ICA model order.

Apart from stability and model order indeterminacy, the interpretation of ICA results must take into account the fact that the statistical independence among the estimated ICs is only approximated by available algorithms. In fact, the presence of noise and the finite number of observations for each measurement do not allow accurate estimation of the higher order statistics [19] used to search for statistical independence. Thus, residual dependence between the extracted components can still be found and can be used to reveal some structure in the dataset. A topographic approach was suggested for ICA [20], where the ICA model was modified to take into account a topographic order among the extracted components: the distance between two elements in the topographic maps is related to the residual dependency. In [21], a comparison of ICA approaches that exploit clustering techniques to estimate the model is performed. Specifically, the topographic approaches are compared with standard ones, as well as with an approach defining a tree-like dependency structure among the components [22]. This work demonstrated that the information inherent in the dependencies among the components can separate artifacts and task related components as well as detecting interesting relationships among components. In a work by Ma et al. [23], the residual dependencies among components were estimated using a similarity measure based on Mutual Information (MI) [24] between the components estimated using a kernel density estimation (KDE) approach [25]. Afterwards, the components are classified using a hierarchical clustering technique [26]. This latter work showed that this approach could group interesting components, as applied to group fMRI data acquired during rest, across different algorithms and model orders. Specifically, it is claimed that the higher dimensionality of clustered components at higher orders is related to the merging of split components.

In this work, we propose an approach similar to the one developed by Ma et al. [23] which differentiates for a number of methodological aspects. Firstly, we explored two different methods for MI estimation, specifically a histogram-based technique and the KDE approach. Moreover, we adopted a different MI-based measure that is a metric in a mathematical sense and a different criterion for the clustering of the algorithm results. Finally, in [23], the robustness of the approach against model order indeterminacy was deduced from the analysis results of real resting state fMRI data. Differently, here we performed an analysis on the ICs estimated from simulated dataset with increasing model orders. By the analysis of the results from simulated and real dataset, we investigated the possibility of improving the grouping of the components originated from a splitting process due to model order overestimation.

#### 2. Theory and Methodology

Datasets from fMRI are formed by time sequences of images or volumes, acquired while a subject is performing some sensory/motor or cognitive task at rest. In this work we focused on single subject fMRI data acquired during the execution of a block design task, alternating in time two different conditions. Synthetic data were simulated accordingly.

##### 2.1. Simulated Data

A synthetic brain dataset was used for our preliminary studies of the clustering method [27]. Signal increase in response to neural activation was simulated convolving the time course describing the task with a typical hemodynamic response function: we chose the three-parameter gamma variate function defined as [28], where is a constant proportional to the amplitude of the signal increase. The task description was modeled as a square wave alternating in time: 15 seconds ON with 15 seconds OFF conditions. The total time length of each simulated dataset was three minutes, for a total number of scans equal to 60 (simulated TR = 3 s). The activated regions were created modulating the baseline intensities of a group of voxels, selected using a mask, with the time series previously described. Activated regions are supposed to be limited in a region of connected voxels and were created using a mask formed by region smoothed with a Gaussian kernel to simulate the spatial point spread function due to the vasculature. The Gaussian kernel was a bidimensional one with 3 mm full width half maximum (FWHM) parameter. All the simulations that we present in this paper were obtained by defining two nonoverlapping activated regions of about 2 centimeters of diameter. In the following, these activated regions will be indicated as region of interest (ROI) number 1 and number 2. The signal change was about 2 percent as compared to the baseline level in the center of the activated regions. Different time delays between the activation time courses of the two regions were simulated: 1.25, 2.5, and 5 seconds.

The noise in the images was taken to be i.i.d. Gaussian distributed with zero mean and variance . For each delay value, different noise levels were simulated. The simulated noise standard deviation was 0.33%, 0.66%, 1%, and 1.33% of the mean image value at the baseline level. The contrast to noise ratio, defined as , where is the signal change following an activation, was equal to, approximately, 6, 3, 2, and 1.5 (resp.): this was the maximum value at the center of the activated regions.

##### 2.2. Real Data

Brain activity was measured in a 25-year-old right-handed healthy female. The subject signed a written informed consent prior to the enrolment into the study under a protocol approved by the Ethics Committee of the University of Pisa, Italy. The scanner used was a 1.5 Tesla, GE Signa Cv/i. An anatomical image was acquired with a 3D SPGR sequence. The functional scans were gradient echo-EPI with TR = 3 sec, TE = 40 msec, and FA = 90 degrees with bandwidth of 62.5 kHz. Twenty axial slices, covering the entire brain, were acquired with slice thickness of 5 mm, 240 mm FOV, and in plane 64 × 64 spatial matrix, and the number of time scans was 63, for a total acquisition time of 189 s. The subject head was restrained with foam in order to minimize head movements. The subject performed a simple finger tapping sequence with her right hand fingers: the task was a block design paradigm alternating 15 sec ON and 15 seconds OFF conditions. The number of scans was 60 for a total run length of three minutes. The images time series were interpolated to correct for slice timing effects and volume registered to a reference scan to reduce movement related effects. The images were then spatially transformed to the Talairach-Tournoux Atlas reference system [29]. These preprocessing steps were performed using AFNI [30].

##### 2.3. ICA Model

In spatial ICA we can write the observed data as , where is the th image or volume in the sequence and is a spatial index for each volume element (voxel). The observed data can be written as a linear mixing of spatial ICs : where . Both and the mixing matrix are unknown. In this model and are seen as random variables and is an index for the observations of each random variable.

The ICA problem consists in finding an unmixing matrix so that the estimated independent components can be written as . Under the hypothesis that the mixing matrix is invertible, , each estimated component can be written as a linear combination of the observed variables , with the th column of and . Each can be seen as a spatial map, individuating a value for every voxel. The th spatial fixed map is time modulated by the corresponding time course, given by the th column of . The components, whose associated time courses highly correlate with the paradigm, are considered as task related components or consistently task related (CTR). The components, whose activation is related only partially with the paradigm, are called transiently task related (TTR) components [5]. Several approaches to estimate the ICA have been described (see [6] and references therein) and for the reasons highlighted in the introduction they all lead to components that are only approximately statistically independent. One way to estimate the ICs is a method based on the maximization of the non-Gaussianity of the [31].

A fast fixed point algorithm [32] can be used to find the weights such that non-Gaussianity of is maximized. The fastICA algorithm can exploit different nonlinear functions to approximate negentropy whose maximization leads to non-Gaussianity maximization.

In this work the ICA decomposition was obtained using the fastICA algorithm with tanh as nonlinearity [33]. Different model orders were applied, specifically 5, 10, 15, and 20. After the extraction step, each independent map was transformed into map statistics to find the voxels contributing significantly to the corresponding component. Given an independent map , the map can be computed as , where is the mean of the values of and are their standard deviation.

##### 2.4. Proposed Method

We hypothesized that the components resulting from a splitting process due to a model order overestimation show a higher residual dependence with respect to other components. Under this hypothesis, the residual dependencies among the components could be explored to detect split components. We proposed to estimate the residual dependencies using pairwise distance measure between two components, and , based upon the definition of mutual information. A hierarchical clustering approach was then employed to classify and visualize the similarities, in terms of distance measure, between the extracted ICs. The clustering results can be visualized by a dendrogram that highlights the merging of the components due to the similarity criterion.

###### 2.4.1. Similarity Measure

The similarity measure between two components and is defined as follows:where is the joint entropy and is the mutual information between two sources. The choice to use rather than is based on the fact that the latter is not a distance metric in the mathematical sense [34]. In this work histogram-based technique and a kernel density estimation approach to compute were compared.

*(1) Histogram-Based Technique*. The probability that a variable value, that is, independent component, lies in the th interval can be estimated as the frequency of occurrence, so that we can write , where is the total number of observations for and is the number of times belongs to the th interval. This probability in the following will be referred to as .

The probability that the variable lies in the th interval while the variable lies in the th interval is given by , where is the number of times the couple belongs to the bidimensional bin . This quantity can be written as .

We estimated the joint entropy as while the Mutual Information was computed as where and are the number of states, or bins, for variable and , respectively.

The number of bins was chosen as [35], where is the number of observations, that is, the voxels in the image after masking out of brain voxels. A rank ordering operation was used before performing the histogram-based technique; that is, the data values were replaced by their ranks before the histogram is built. This operation facilitates the computation of mutual information as the distributions of the components are transformed into uniform distributions. Moreover, since we were interested in similarities between two distributions, this approach allows the results not to be strongly affected by the marginal distributions of the components. The correlations between two components are preserved and hence the validity of the results.

*(2) Kernel Density Estimation*. Kernel density estimation (KDE) can be used to estimate mutual information [25]. This method consists in estimating the probability density function (pdf) of a variable with a linear combination of kernel functions such thatwhere is the th value of the variable and is a probability density function itself.

A Gaussian kernel can be used to write the pdf asThe parameter is a smoothing parameter, called bandwidth: small values lead to taking into account finer details that can be not interesting or originate from noise; larger values can hide interesting features in the distributions. For Gaussian distributed variables, with variance , the values that minimize the mean square error in estimating the density have been found to be [36].

In the bidimensional case, the bidimensional kernel function is given bywith . The bandwidth is related to the underlying distribution as in the monodimensional case. For Gaussian variables an optimal value has been found to be with and is the average marginal standard deviation [37].

Given the probability density functions , , and the mutual information and the entropy between two variables and can be estimated asThese quantities can be estimated with standard procedures for numerical integration. We adopted adaptive Simpson quadrature approach.

###### 2.4.2. Clustering

The hierarchical clustering approach is based on the Ward method [38]. This method consists in merging every possible cluster pair and choosing the one which minimizes the information loss. To estimate this quantity, the error sum-of-squares (ESS) is used: where is the mean of cluster and are the data points. The distance between two clusters is defined byAt each stage, the Ward method merges two groups such that their will be minimized. The merging of the components can be visualized with a dendrogram: the abscissa of the dendrogram indicates the components, while the ordinate level, at which two components or groups of components are merged, is related to the change in the error sum-of-squares after joining groups.

##### 2.5. Method Validation

The performance of the proposed method can be evaluated on simulated datasets by verifying that the dendrogram correctly merges the components that were split by the ICA algorithm. To identify the components maps that were related to the simulated activations, receiver operating characteristics (ROC) curves [39] were used and applied to the obtained ICA maps. ROC curves are plots of true positive detection fraction against false positive detection fraction obtained varying the threshold level of the maps. In the case of simulated dataset, the true activated areas are known and serve as ground truth measure, so that it is possible to estimate true positives by looking at the activated voxels, that is, over threshold, in one independent component and by checking whether they correspond to the known activated regions. An area under curve (AUC) estimated from the ROC curve ranging from 0.7 to 0.8 is considered a result showing a fair accuracy of the test performed, whereas an AUC between 0.6 and 0.7 is considered a poor accuracy index [40]. An independent component whose AUC is significant for both ROIs is a component that merges both the two activated regions.

Since we focused on fMRI data acquired during a task execution, the classification of the results can take into account the corresponding time course of each component. Specifically, interesting CTR and TTR components can be highlighted.

#### 3. Results

##### 3.1. Simulated Datasets

Results from the application of the ICA algorithm on simulated datasets are summarized in Tables 1, 2, and 3. Each table is related to the result obtained from the datasets with time delays between the activations time courses of 1.25, 2.5, and 5 seconds, respectively. For each table different noise levels results are summarized. The two activated regions are indicated by ROIs #1 and #2. In each column, the ICs whose maps result in an AUC greater than 0.6 are shown. Specifically the IC index is shown, in brackets, along with the corresponding AUC.