Abstract

Alzheimer’s disease (AD) has been not only the substantial financial burden to the health care system but also the emotional burden to patients and their families. Predicting cognitive performance of subjects from their magnetic resonance imaging (MRI) measures and identifying relevant imaging biomarkers are important research topics in the study of Alzheimer’s disease. Recently, the multitask learning (MTL) methods with sparsity-inducing norm (e.g., -norm) have been widely studied to select the discriminative feature subset from MRI features by incorporating inherent correlations among multiple clinical cognitive measures. However, these previous works formulate the prediction tasks as a linear regression problem. The major limitation is that they assumed a linear relationship between the MRI features and the cognitive outcomes. Some multikernel-based MTL methods have been proposed and shown better generalization ability due to the nonlinear advantage. We quantify the power of existing linear and nonlinear MTL methods by evaluating their performance on cognitive score prediction of Alzheimer’s disease. Moreover, we extend the traditional -norm to a more general -norm (). Experiments on the Alzheimer’s Disease Neuroimaging Initiative database showed that the nonlinear -MKMTL method not only achieved better prediction performance than the state-of-the-art competitive methods but also effectively fused the multimodality data.

1. Introduction

Alzheimer’s disease (AD) is a severe neurodegenerative disorder that results in a loss of mental function due to the deterioration of brain tissue, leading directly to death [1]. It accounts for 60–70% of age related dementia, affecting an estimated 30 million individuals in 2011 and the number is projected to be over 114 million by 2050 [2]. The cause of AD is poorly understood and currently there is no cure for AD. AD has a long preclinical phase, lasting a decade or more. There is increasing research emphasis on detecting AD in the preclinical phase, before the onset of the irreversible neuron loss that characterizes the dementia phase of the disease, since therapies/treatment are most likely to be effective in this early phase. The Alzheimer’s Disease Neuroimaging Initiative (ADNI, http://adni.loni.usc.edu/) has been facilitating the scientific evaluation of neuroimaging data including magnetic resonance imaging (MRI) and positron emission tomography (PET), along with other biomarkers and clinical and neuropsychological assessments for predicting the onset and progression of MCI (mild cognitive impairment) and AD. Early diagnosis of AD is key to the development, assessment, and monitoring of new treatments for AD.

Recently, rather than predicting categorical variables in the classification, various studies started to estimate continuous clinical variables from brain images. Therefore, instead of classifying a subject into binary or multiple predetermined categories or stages of the disease, regression focus is on estimating continuous values which may help to assess patient’s disease progression. The most commonly used cognitive measures are Alzheimer’s Disease Assessment Scale (ADAS) cognitive total score, Mini Mental State Exam (MMSE) score, and Rey Auditory Verbal Learning Test (RAVLT). Regression analyses were commonly used to predict cognitive scores from imaging measures. The relationship between commonly used cognitive measures and structural changes with MRI has been previously studied by regression models and the results demonstrated that there exists a relationship between baseline MRI features and cognitive measures [3, 4]. For example, Wan et al. proposed an elegant regression model called CORNLIN that employs a sparse Bayesian learning algorithm to predict multiple cognitive scores based on 98 structural MRI regions of interests (ROIs) for Alzheimer’s disease patients. The polynomial model used in CORNLIN can detect either a nonlinear or a linear relationship between brain structure and cognitive decline [3]. Stonnington et al. adopted relevance vector regression, a sparse kernel method formulated in a Bayesian framework, to predict four sets of cognitive scores using MRI voxel based morphometry measures [4]. One of the biggest challenges in the prediction of inferring cognitive outcomes with MRI is the high dimensionality, which affects the computational performance and leads to a wrong estimation and identification of the relevant predictors. To reduce the high dimensionality and identify the relevant biomarkers, the sparse methods have attracted a great amount of research efforts in the neuroimaging field due to its sparsity-inducing property. Ye et al. applied sparse logistic regression with stability selection to ADNI data for robust feature selection [5] and successfully predicted the conversion from MCI into probable AD and identified a small subset of biosignatures.

It is known that there exist inherent correlations among multiple clinical cognitive variables of a subject. However, many works do not model dependence relation between multiple tasks and neglect the correlation between clinical tasks which is potentially useful. When the tasks are believed to be related, learning multiple related tasks jointly can improve the performance relative to learning each task separately. Multitask learning (MTL) is a statistical learning framework which aims at learning several models in a joint manner. It has been commonly used to obtain better generalization performance than learning each task individually [6, 7]. The critical issues in MTL are to identify how the tasks are related and build learning models to capture such task relatedness. The most recent studies [6, 8, 9] employed multitask learning with -norm [7] regularization and aimed to select features that could predict all or most clinical scores. The -norm is chosen to be the regularization. Thus, the -norm regularized regression model is able to select some common features across all the tasks. However, in these learning methods, each task is traditionally performed by formulating a linear regression problem, in which the cognitive score is a linear function of the neuroimaging measures.

Kernel methods have been studied to model the cognitive scores as nonlinear functions of neuroimaging measures. Recently, many kernel-based classification or regression methods with faster optimization speed or stronger generalization performance have been proposed and investigated by theoretically analyzing and experimentally evaluating [10, 11]. Multiple kernel learning (MKL) [12], which learns the optimal kernel for a given task by a weighted, linear combination of predefined candidate kernels, has been introduced to handle the problem of kernel selection. The multiple kernel learning method not only learns an optimal combination of given base kernels but also provides a flexible framework to exploit the nonlinear relationship between MRI measures and cognitive scores.

In building the predictive model for classification or regression in AD, kernel has been widely used; therefore, it is important to extend the existing kernel-based learning methods to the case of multitask learning. In this paper, we propose two nonlinear multikernel-based multiple learning methods in [13] for building regression models, to exploit and investigate the nonlinear relationship between MRI measures and cognitive scores. Moreover, an -norm is used to extend the traditional -norm. The goal of our work is to (1) predict subjects’ cognitive scores in a number of neuropsychological assessments using their MRI measures across the entire brain, (2) identify what the performance of the nonlinear method is compared with the linear -norm MTL and other MTL methods with different assumption. No previous studies have systematically and extensively examined the prediction performance by linear MTL and nonlinear MTL methods, and (3) identify what the learning capacity of the multikernel framework on fusing multimodality data is.

The rest of the paper is organized as follows. In Section 2, we provide a description of the multitask learning formulation. A linearized MTL and two multikernel-based MTL methods with -norm are provided in Section 3. In Section 4, we present the experimental results and compare the performance of linearized and kernelized MTL methods from the ADNI-1 dataset. The conclusion is drawn in Section 5.

2. Multitask Learning

Consider a multitask learning (MTL) setting with tasks. Let be the number of covariates, shared across all the tasks, and be the number of samples. Let denote the matrix of covariates, be the matrix of responses with each row corresponding to a sample, and denote the parameter matrix, with column corresponding to task , and row corresponding to feature .

The MTL formulation focuses on the following regularized loss function:where denotes the loss function and is the regularizer. In the current context, we assume the loss to be square loss; that is,where and are the th rows of and , respectively, corresponding to the multitask response and covariates for the th sample. We note that the MTL framework can be easily extended to other loss functions. Base on some prior knowledge, we then add penalty to encode the relatedness among tasks.

3. -Norm Regularized Linearized Multitask Learning, -MTL

The -norm was popularly used in multitask feature learning [14]. All the existing algorithms for multitask feature learning assume a linear relationship between MRI features and cognitive scores and aim to learn a common subset of features for all tasks. Since the -norm regularizer imposes the sparsity between all features and nonsparsity between tasks, the features that are discriminative for all tasks will get large weights. However, the -norm is a fixed nonadaptive penalty. To obtain an adaptive regularization and better suit different data structures, we extend the -norm to a larger class of mixed norm that can be adapted to the data. The objective function of linear -MTL is formulated:

When , problem (3) reduces to the -regularized problem; when , problem (3) reduces to the -regularized problem.

An efficient algorithm is based on the accelerated gradient method for solving the -regularized problem, which is applicable for all values of larger than 1.

First, construct the following model for approximating the composite function at the point :where . In the model , apply the first-order Taylor expansion at the point (including all terms in the square bracket) for the smooth loss function , and directly put the nonsmooth penalty into the model. The regularization term prevents from walking far away from , and thus the model can be a good approximation to in the neighborhood of , where .

The accelerated gradient method is based on two sequences and in which is the sequence of approximate solutions and is the sequence of search points. The search point is the affine combination of and aswhere is a properly chosen coefficient. The approximate solution is computed as the minimizer of :where is determined by line search, for example, the Armijo-Goldstein rule, so that should be appropriate for .

The key subroutine is (6), which can be computed as , where is the -regularized Euclidean projection () problem:

Note that the features in (7) are independent. In [15], the method can be used for ease of different independent groups; that is, , where is the independent groups. In our paper, we focus on how the method deals with multitask learning problem in (7), where is equal to , and each group denotes the corresponding feature shared across the multiple tasks. Thus, the optimization in (7) decouples into a set of independent -regularized Euclidean projection problems:

Then, the optimal solution of (8) can be gotten as follows:where , and thus and satisfy the following relationship: , is the unique root of , and is an auxiliary function, defined as with ; And and . Note that denotes .

The algorithm -MTL is summarized in Algorithm 1.

Input: , , ,
Output: .
(1)Initialize , , and .
(2)
(3)repeat
(4)Set ,
(5)Find the smalles such that
,
where
(6) and
(7)
(8)until convergence criterion is satisfied

4. Kernelized Multitask Learning

4.1. Multikernel Learning

The limitation in this traditional -norm MTL model is that subjects cognitive score under a task is modeled as a linear function of his/her MRI measures. The kernel methods, for example, SVM or SVR, can model the nonlinear distribution of the data by mapping the input data into a nonlinear feature space by kernel embedding. In this section, we consider the case that -norm regularized MTL is extended to kernel method. Let us define the kernel function , which maps the data samples from an input space to a feature space (a high-dimensional Hilbert space ), where denotes the dimensionality of the feature space and is a sample from the input space. A kernel function is capable of attaining the inner product of two mapped datasets in : in the original space without explicitly computing the mapped data. The associated Gram matrix has entries .

The most suitable types and parameters of the kernels for a particular task are often unknown, and the selection of the optimal kernel by exhaustive search on a predefined pool of kernels is usually time-consuming and sometimes causes overfitting. Multiple kernel learning (MKL) attempts to achieve better results by combining several base kernels instead of using only one specific kernel. MKL assumes that can be mapped to different Hilbert spaces, , implicitly with nonlinear mapping functions, and the objective of MKL is to seek the optimal kernel combination , where is the kernel weight vector. The primal objective function of multiple kernel regression model is written as follows:

MKL learns both the weights of the kernel combination and the parameters of the regression by solving a single joint optimization problem.

Using to denote the Lagrange multipliers, the objective value of the dual problem of (10) can be written as follows:where is the combined Gram matrix and , is the given set of base kernels.

4.2. -Norm Regularized Multikernel Multitask Learning, -MKMTL

We follow the multiple kernel learning scheme and use the -norm to model the relationship between the tasks to learn a common kernel representation by imposing sparsity constraint on the kernel weight. The method, called -MKMTL, assumes that few base kernels are important for the tasks and encourages a linear combination of only few kernels and assumes few selected kernels are similar across the tasks. The formulation of -MKMTL can be expressed as follows:

We now rewrite this formulation in a convenient form which can be efficiently solved using mirror-descent based algorithms. We introduce some more notations: let and with slight abuse of notation let . Next, we note the following [16].

Lemma 1. Let and . Then, for defined as before,and the minimum is attained atwith the convention that is if and is if .

Using the result of the lemma (with ) and introducing variables , we have

Now introducing dual variables , and using the notion of dual norm [17], we obtainwhere . With this, the objective in the -MKMTL formulation can now be written as

Using to denote the Lagrange multipliers, this has the Lagrangian

Recall our foray into Lagrange duality. We can solve the original problem by doing

To begin, we attack the inner minimization: For fixed , we would like to solve for the minimizing and . We can do this by setting the derivatives of with respect to and to be zero. Doing this, we can findwhere is a vector corresponding to the th task in the -MKMTL formulation and is the data matrix with columns as . So, we can solve the problem by maximizing the Lagrangian (with respect to ), where we substitute the above expressions for and . Thus, we have an unconstrained maximization.

Here, is vector of scores of the th task training data points and represents the Gram matrix of the th task training data points with respect to the th kernel. Equation (21) is just a quadratic in . As such, we can find the optimum as the solution of a linear system.

Then, (17) can be written as follows:

The formulation can be transformed as follows:

The algorithm -MKMTL is summarized in Algorithm 2.

Input: , ,
Output: , ,
(1)
(2)repeat
(3)initiate and
(4)for   to do
(5)With fixed and , compute by using an SVR solver
(6)end for
(7)optimize with mirror-descent algorithm
(8)optimize : where .
(9)
(10)until convergence criterion is satisfied
4.3. --Norm Regularized Multikernel Multitask Learning, -MKMTL

The linearized -MTL assumed linear relationship between the MRI features and the cognitive outcomes. Such a model is the lack of capability to capture nonlinear predictive information from the features. Although the -MKMTL builds the nonlinear relationship for the features and task by mapping to high-dimensional space, it considers that tasks to be learned share a common subset of kernel representations without capturing the interrelationships between different cognitive measures over the feature space.

To overcome the weaknesses of the previous two methods, we project the original feature vectors to a high-dimensional space using multiple nonlinear mapping functions for performing regression task in a nonlinear manner and utilize multitask learning in the multiple kernel spaces for modeling the disease’s cognitive scores with a joint - sparsity-inducing regularizers. Moreover, we construct new features as orthogonal transforms of the given features, that is, , where is an orthogonal matrix which is to be learned. Again, low empirical risk over each task would imply minimizing the following quadratic loss: . Before describing the regularization term, we introduce some more notations: Let the entries of be , where is the dimensionality of the feature space induced by the th kernel. By we denote the vector with entries . The regularization term we employ is , where . Different from -MKMTL, the -norm in -MKMTL is employed over the kernels rather than the tasks.

Mathematically, the -MKMTL formulation can be expressed as follows:where represents the set of all orthogonal matrices of dimensionality . In the following text, we rewrite this formulation in a form which is convenient to solve using an MD based algorithm.

Using the result of Lemma 1 and introducing new variables , we havewhere . Again using the lemma and introducing new variables , the regularizer can be written as

Now, we perform a change of variables: . Using this, one can rewrite the -MKMTL formulation aswhere is a diagonal matrix with entries as .

Now, using to denote the Lagrange multipliers, this has the Lagrangian of

This can be solved like -MKMTL:

Again, we substitute the above expressions for and . Thus, we have the following form:

Denoting by and eliminating variables , and ’s lead to

The difficulty in working with this formulation is that the explicit mappings ’s are required. We now describe a way of overcoming this problem and efficiently kernelizing the formulation (refer to [1] also). Let and the compact SVD of be . Then, we introduce a symmetric positive semidefinite with the same rank as that of such that . By eliminating , we can rewrite the above problem using aswhere . Note that calculation of does not require the kernel-induced features explicitly and hence the formulation is kernelized. It can be transformed as follows:where is a block diagonal matrix with entries as .

can be solved by mirror-descent. The gradient of with respect to is calculated as follows:where is the value obtained using optimal obtained while evaluating .

The algorithm - MKMTL is summarized in Algorithm 3.

Input: , , > 0
Output: ,
(1)repeat
(2)optimize with mirror-descent algorithm
(3)for   to do
(4)with fixed , compute by using an SVR solver
(5)end for
(6)
(7)until convergence criterion is satisfied

5. Experimental Results and Discussions

5.1. Experimental Setup

We use 10-fold cross valuation to evaluate our model and conduct the comparison. In each of ten trials, a 5-fold nested cross validation procedure is employed to tune the regularization parameters. Data was -scored before applying regression methods. The range of each parameter varied from to . The candidate kernels are as follows: six different kernel bandwidths (), polynomial kernels of degrees 1 to 3, and a linear kernel, which totally yields 10 kernels. The kernel matrices were precomputed and normalized to have unit trace. The reported results were the best results of each method with the optimal parameter. For the quantitative performance evaluation, we employed the metrics of Correlation Coefficient (CC) and Root Mean Squared Error (rMSE) between the predicted clinical scores and the target clinical scores for each regression task. Moreover, to evaluate the overall performance on all the tasks, the normalized mean squared error (nMSE) [7, 18] and weighted R-value (wR) [4] are used. The nMSE and wR are defined as follows:where and are the ground truth cognitive scores and the predicted cognitive scores, respectively.

A smaller (higher) value of nMSE and rMSE (CC and wR) represents better regression performance. We report the mean and standard deviation based on 10 iterations of experiments on different splits of data for all comparable experiments.

In ADNI, all participants received 1.5-Tesla (T) structural MRI. The MRI features used in our experiments are based on the imaging data from the ADNI database processed by a team from UCSF (University of California at San Francisco), who performed cortical reconstruction and volumetric segmentations with the FreeSurfer image analysis suite (http://surfer.nmr.mgh.harvard.edu/) according to the atlas generated in [19]. Totally, 48 cortical regions and 44 subcortical regions are generated. For each cortical region, the cortical thickness average (TA), standard deviation of thickness (TS), surface area (SA), and cortical volume (CV) were calculated as features. For each subcortical region, subcortical volume was calculated as features. The SA of left and right hemisphere and total intracranial volume (ICV) were also included. This yielded a total of MRI features extracted from cortical/subcortical ROIs in each hemisphere (including 275 cortical and 44 subcortical features). Details of the analysis procedure are available at http://adni.loni.usc.edu/methods/mri-analysis/.

Ten widely used clinical/cognitive assessment scores [3, 20, 21] were employed in this study, including Alzheimer’s Disease Assessment Scale (ADAS) cognitive total score, Mini Mental State Exam (MMSE) score, Rey Auditory Verbal Learning Test (RAVLT) involving total score of the first 5 learning trials (TOTAL), Trial 6 total number of words recalled (TOT6), 30-minute delay score (T30), and 30-minute delay recognition score (RECOG), FLU involving animal total score (ANIM) and vegetable total score (VEG), and TRAILS including Trail Making test A score and B score.

5.2. Comparison with the State-of-the-Art MTL Methods

To compare the kernelized MTL with the other linearized one and illustrate how well the two multikernel-based MTL methods work by means of modeling the correlation among the tasks, we comprehensively compare our proposed methods with several popular state-of-the-art related methods. Representative comparable algorithms include(1)Ridge [22]: (2)Lasso [23]: (3)MKL [24]: , such that and (4)Robust Multitask Feature Learning (RMTL) [25]: RMTL (, subject to ), which assumes that the model can be decomposed into two components: a shared feature structure capturing task relatedness and a group-sparse structure detecting outliers(5)Clustered Multitask Learning (CMTL) [16]: CMTL (, where is an orthogonal cluster indicator matrix and the tasks are clustered into clusters) incorporating a regularization term to induce clustering between tasks and then sharing information only to tasks belonging to the same cluster. In the CMTL, the number of clusters is set to 11 since the 20 tasks belong to 11 sets of cognitive functions(6)Trace-norm regularized multitask learning (Trace) [17]: assuming that all models share a common low-dimensional subspace ()(7)Sparse regularized multitask learning formulation (SRMTL) [26]: SRMTL (, where ) containing two regularization processes: (1) all tasks are regularized by their mean value, and therefore knowledge from one task can be utilized by other tasks via the mean value; (2) sparsity is enforced in the learning with -norm.

Experimental results are reported in Tables 1 and 2 where the best results are boldfaced. A first glance at the results shows that -MKMTL generally outperforms all the other compared methods on both metrics and across all the cognitive tasks. Additionally, a statistical analysis is performed on the results. As can be seen, our proposed method achieves statistically significant results compared to all the other methods on most of the results. These results reveal several interesting points:(1)All the compared multitask learning methods (-MTL, -MKMTL, and -MKMTL) improve the predictive performance over the independent regression algorithms (Ridge, Lasso, and MKL). This justifies the motivation of learning multiple tasks simultaneously.(2)The two multikernel-based MTL methods outperform the linearized -MTL in terms of nMSE, and -MKMTL outperforms the linearized -MTL in terms of wR. It indicates that the nonlinear MTL models via kernel functions can capture complex patterns between brain images and the corresponding cognitive measures.(3)By the appropriate regularization, the -MKMTL model enables us (1) to obtain capture nonlinear associations between MRI and cognitive outcomes, (2) to obtain the intrinsic relationships between multiple related tasks in , and (3) to promote the sparse kernel combinations to support the interpretability and scalability. The outcomes demonstrate that -MKMTL outperforms -MTL and -MKMTL, both of which neglect the inherently nonlinear relationship between MRI and cognitive outcomes, and the correlation among multiple related tasks in the feature space.(4)Compared with the other multitask learning methods with different assumptions, our proposed methods belong to the multitask feature learning methods with sparsity-inducing norms, having an advantage over the other comparative multitask learning methods. Since not all the brain regions are associated with AD, many of the features are irrelevant and redundant. Sparse based MTL methods are appropriate for the task of predicting cognitive measures and better than the non-sparse-based MTL methods.

We also show the scatter plots of actual values versus predicted values for the score of ADAS, MMSE, TOTAL, and ANIM on testing data in Figure 1.

5.3. Multimodalities Fusion

To estimate the effect of combining multimodality image data with the linearized and kernelized MTL methods and provide a more comprehensive comparison of the results from the comparable MTL models, we further perform some experiments, and they are (1) using only MRI modality, (2) using only PET modality, (3) combining two modalities: PET and MRI (MP), and (4) combining three modalities: PET, MRI, and demographic information including age, gender, years of education, and ApoE genotyping (MPD). Different from the above experiments, the samples from ADNI-2 are used instead of ADNI-1, since the amount of the patients with PET is sufficient. From the ADNI-2, we obtained all the patients with both MRI and PET, totally 756 samples. The PET imaging data are from the ADNI database processed by the UC Berkeley team, who use a native-space MRI scan for each subject that is segmented and parcellated with FreeSurfer to generate a summary cortical and subcortical ROI, and they coregister each florbetapir scan to the corresponding MRI and calculate the mean florbetapir uptake within the cortical and reference regions. The procedure of image processing is described in http://adni.loni.usc.edu/updated-florbetapir-av-45-pet-analysis-results/. In the -MKMTL and -MKMTL, ten different kennel functions described in the first experiment are used for each modality. To show the advantage of the kernel-based methods, we compare them with linear -MTL method, which concatenated the multiple modalities features into a long vector features.

The prediction performance results are shown in Tables 3 and 4. From the results, it is clear that the methods with multimodality outperform the methods using one single modality of data. This validates our assumption that the complementary information among different modalities is helpful for cognitive function prediction. Regardless of two or three modalities, -MKMTL achieved better performances than the linear based multitask learning for the most cases, the same as for the single modality learning task above.

6. Conclusion

Many multitask learning methods with sparsity-inducing regularization for modeling AD cognitive outcomes have been proposed in the past decades. However, the current formulations remain restricted to the linear models and cannot capture the relationship between the MRI features and cognitive outcomes. To address these shortcomings, we applied two multikernel multitask learning methods with a joint sparsity-inducing regularization to model the more complicated but more flexible relationship between MRI features and cognitive outcomes and demonstrated their effectiveness compared with linearized multitask learning methods by applying them to the ADNI data for predicting cognitive outcomes from MRI scans. Extensive experiments on ADNI dataset illustrate that the multikernel multitask learning method not only yields superior performance on regression performance but also is a powerful tool for fusing multimodalities data.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (no. 61502091), the Fundamental Research Funds for the Central Universities (no. N161604001 and no. N150408001), and the National Science Foundation for Distinguished Young Scholars of China under Grants no. 71325002 and no. 61225012.