Computational and Mathematical Methods in Medicine

Volume 2017 (2017), Article ID 7691937, 14 pages

https://doi.org/10.1155/2017/7691937

## IPF-LASSO: Integrative -Penalized Regression with Penalty Factors for Prediction Based on Multi-Omics Data

^{1}Department of Medical Informatics, Biometry and Epidemiology, University of Munich (LMU), Marchioninistr. 15, 81377 Munich, Germany^{2}Department of Mathematics, University of Oslo, Moltke Moes Vei 3, 0851 Oslo, Norway^{3}Novartis Institutes for BioMedical Research, 250 Massachusetts Avenue, Cambridge, MA 02139, USA^{4}Biogen, 225 Binney Street, Cambridge, MA 02142, USA

Correspondence should be addressed to Anne-Laure Boulesteix

Received 20 January 2017; Accepted 14 March 2017; Published 4 May 2017

Academic Editor: Andreas Mayr

Copyright © 2017 Anne-Laure Boulesteix et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

As modern biotechnologies advance, it has become increasingly frequent that different modalities of high-dimensional molecular data (termed “omics” data in this paper), such as gene expression, methylation, and copy number, are collected from the same patient cohort to predict the clinical outcome. While prediction based on omics data has been widely studied in the last fifteen years, little has been done in the statistical literature on the integration of multiple omics modalities to select a subset of variables for prediction, which is a critical task in personalized medicine. In this paper, we propose a simple penalized regression method to address this problem by assigning different penalty factors to different data modalities for feature selection and prediction. The penalty factors can be chosen in a fully data-driven fashion by cross-validation or by taking practical considerations into account. In simulation studies, we compare the prediction performance of our approach, called IPF-LASSO (Integrative LASSO with Penalty Factors) and implemented in the R package ipflasso, with the standard LASSO and sparse group LASSO. The use of IPF-LASSO is also illustrated through applications to two real-life cancer datasets. All data and codes are available on the companion website to ensure reproducibility.

#### 1. Introduction

Most drugs cannot treat all patients with a given disease. It is thus crucial to identify biomarkers (genetic, genomic, proteomic, or any measurable biological entities) that can predict the patient’s response to a given therapy. Ultimately, the biomarkers are to be built into companion diagnostic kits. Ideally, the number of biomarkers should be small to reduce the labor and cost.

High-throughput molecular data, termed “omics data” in this paper, have been used for developing prediction models for more than fifteen years. As a well-known example, gene expression data have often been found to be useful for predicting survival response to therapy of cancer patients; the overwhelming enthusiasm in the initial years has meanwhile been tempered by more critical studies [1]. In the last few years, bioassay technology improvement and cost reduction have made collecting several types of high-dimensional data in the same study feasible.

For example, methylation data, copy number data, and mRNA expression may be available for the same patient cohort. Other data types include microRNA expression, proteomic data, metabolomic data, and single nucleotide polymorphisms (SNPs). In this paper, we denote each group of variables of the same type as a “modality” and the whole dataset as a “multi-omics” dataset. For example, in this paper, we consider as illustration a breast cancer dataset with a clinical modality and a gene expression modality [2] and a leukemia dataset from The Cancer Genome Atlas [3] with a clinical modality, a gene expression modality, and a copy number variation modality.

As multiple modalities of biomarker measurements become available for the same patients, the research interest starts to focus on the integration of data modalities to identify biomarkers and build prediction models with good accuracy [4, 5]. Although using omic markers for prediction has been a well-studied topic, it is not clear how the different modalities should be handled. The most straightforward yet naive approach is to merge all datasets and ignore the source of the variables. In contrast, other authors suggest analyzing each modality on its own and then merging the results [6], whereby merging can be performed at different stages of the analysis [7]. However, the literature is often vague on when to use different strategies.

The case of variables from one low-dimensional modality (typically, a few clinical variables relevant to the outcome to be predicted) and one high-dimensional modality (e.g., a microarray gene expression dataset) has been extensively investigated by De Bin et al. [8], where they assess the “residual” two-step approach and the “favoring” approach (see Section 2.2 for more details).

There has been a large amount of statistical and bioinformatic literature on the integration of multiple omics datasets investigating their correlation structure [9]. However, the focus of these works is not prediction. Our motivation here is to suggest a simple method based on a well-investigated framework, which takes into account the data modalities while integrating them into a sparse prediction model. Our method is based on -penalized regression (LASSO) [10] and takes the data structure into account by assigning different penalty factors to the modalities. The penalty factors are either determined by cross-validation or prespecified by the user. We name this method IPF-LASSO (Integrative LASSO with Penalty Factors).

In simulation studies, we show that IPF-LASSO performs better than the standard LASSO when the proportions of relevant variables are different in different modalities and generates parsimonious prediction rules compared with sparse group LASSO. An R package called ipflasso implementing this method is made publicly available on R/CRAN website. Being directly based on LASSO, our approach has two major advantages: its conceptual simplicity within a well-established framework and its* transportability* [11] (e.g., in the case of binary outcomes, users only need to know the fitted regression coefficients to apply the prediction rule).

This paper is structured as follows. After a short introduction into -penalized regression, the newly proposed method is described in detail in Section 2. The results from simulation studies and two real-life applications are presented in Sections 3 and 4, respectively. All data and codes are available on http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/ipflasso/ to ensure reproducibility.

#### 2. Methods

##### 2.1. IPF-LASSO

###### 2.1.1. Principle

We denote the standardized predictor variable measured from subject as and the centered (continuous) response values as , where and . The standard LASSO method [10] solves the -penalized regression problem by finding which minimizes , where denotes the -norm. The -penalty shrinks some of the coefficients to 0, thus leading to an intrinsic variable selection. For a historical overview of the development of LASSO regression and some variations, readers can refer to Tibshirani [12].

This framework can be generalized to logistic regression (in the case of a binary outcome) and to Cox proportional hazards regression (in the case of a censored time to event). The term is replaced by (where stands for the log-likelihood function and for the intercept) in the logistic LASSO and is replaced by (where stands for the partial log-likelihood) in the Cox LASSO. Our new method is a modification of LASSO dedicated to the case where multiple data modalities (data types) from the same subjects are to be used. Let us denote the variables from modality (for ) as and their values for subject (for ) as , where is the number of variables from modality . Similarly, denotes the coefficient of variable .

We propose the use of a weighted sum of the norms of the coefficient vectors of each modality () as the penalty term, aiming to account for their different relevancies. In our method, the estimated coefficients are those that minimizewhere is the penalty applied to the variables from modality . We call this method “IPF-LASSO,” standing for Integrative LASSO with Penalty Factors. The term “penalty factors” refers to the multiplicative factors applied to the penalty term. Without restriction of generality, we consider the first modality as reference modality—with penalty and penalty factor 1—and define the penalty factor of modality as .

Similar to the standard LASSO, our proposed framework can be applied to -penalized regression with linear, binary, or time-to-event outcomes. The rationale of the penalty term given in (1) is that in reality the proportion of relevant variables is often highly different from one modality to another; hence, it makes sense to penalize the modalities differently.

The Bayesian interpretation of the LASSO is useful to outline the motivation of the different penalty parameters. Park and Casella [13] show that the LASSO estimate for linear regression parameters can be interpreted as a Bayesian posterior mode estimate when the regression parameters have independent Laplace (i.e., double-exponential) priors. In this perspective, using different penalties for different modalities amounts to setting different parameters for the Laplace priors. It can be seen as a way of using the available prior information to improve the estimation of coefficients and, ultimately, prediction accuracy.

Note that our approach may also be seen as connected with the adaptive LASSO [14], in the sense that the coefficients of variables that are identified as informative are less penalized than the coefficients of noninformative variables. However, in contrast to adaptive LASSO [14], this modification of the penalty strength does not happen through a first LASSO step for each variable individually but at the level of the whole modality.

###### 2.1.2. Estimation

From a computational point of view, IPF-LASSO with fixed penalty factors is not more complex than the respective form of LASSO (linear, logistic, or Cox) with the same penalty for all variables, in that estimates can be simply obtained with any standard LASSO algorithm by preliminarily scaling the variables using their respective penalty. More precisely, the standard estimation algorithm is run with the same penalty parameter for all variables on the transformed data Estimates are obtained and rescaled as to obtain the IPF-LASSO estimates.

##### 2.2. Connections between IPF-LASSO and Other LASSO Variations for Omics Data

There have been LASSO variations for single and multiple data modalities proposed by several groups. In this section, we discuss the connections of IPF-LASSO to these methods. In the scenario investigated by De Bin et al. [8], we have two modalities (). The first modality includes only a small number of clinical variables, such that a classical regression approach can be applied to this modality (the rule of thumb that the number of variables times 5 or 10 should not exceed the number of observations is typically satisfied). The second modality is high-dimensional with . In this case, it is sensible to penalize only the second modality, that is, to consider the penalty term . In the terminology of De Bin et al. [8], the above is denoted as a “favoring” method, because the smaller clinical modality is not penalized; in other words, it is “favored.” Another method, namely, the “residual” method as proposed in De Bin et al. [8], takes two steps fitting the data. First a classical (linear, logistic, or Cox) regression is fit to the first modality to estimate ; the resulting linear predictor is then considered as an offset during the estimation of through LASSO regression. These two methods, however, cannot be applied when there are multiple high-dimensional modalities because it would not be feasible to estimate the coefficients. Moreover, they may lead to a decrease of accuracy if the favored modality is in reality not the most relevant for prediction.

Another two-step approach for prediction is proposed by Zhao et al. [6]: they first apply LASSO regression to multi-omics data to select a small number (10 in their application) of variables from each modality and then use the selected variables in a -penalized Cox regression model. This approach does not take correlations between variables from different modalities into account.

Group LASSO [15, 16] and sparse group LASSO [17] represent another category of LASSO extensions for data with a group structure. In the case of multiple modalities, the term “group” is essentially “modality.” The principle of group LASSO is that variables from the same group should be either all selected or all discarded. It makes sense, for example, when each group consists of the dummy variables coding the same multicategorical variable. The penalty considered in the group LASSO method has the form With multiple large omics modalities considered in our paper, it is most likely that at most a few variables from each modality are truly relevant for prediction; hence, this “none-versus-all” assumption is not reasonable in this case.

Sparse group LASSO [17] relaxes the “none-versus-all” assumption by introducing some sparsity within groups. This is achieved by combining the penalty of group LASSO with an penalty, yielding the penalty term , where is a so-called “mixing parameter” comprised between 0 and 1. Sparse group LASSO can be used in cases where IPF-LASSO is aimed at addressing where a modality is treated as a group. However, these two methods are fundamentally different. In sparse group LASSO, a single mixing parameter balances the impact of group structure and overall sparsity; thus, a model that strongly reflects the group structure is obtained at the price of reduced sparsity; moreover, the degree of shrinkage is the same for all groups (modalities) as controlled by , which often does not reflect reality. IPF-LASSO, on the other hand, is more flexible in varying the shrinkage parameters for different modalities—at the price of more tuning parameters (one for each modality) in the case of more than two modalities. In Section 3, we will compare the performances of IPF-LASSO and sparse group LASSO.

Another recently proposed approach handling two modalities in the framework of penalized regression is collaborative regression [18]. The idea is using a penalty not only based on the - or -norms of the coefficients but also penalizing the difference between the fitted linear predictors resulting from each of the two modalities. The two modalities “collaborate” in the sense that they are forced to yield similar contributions to prediction. In the mathematical terms and adapted to our notation, the penalty term considered in Gross and Tibshirani [18] is , where is the general notation for a penalty term that can, for example, be based on the - or -norm and denotes the data matrix for modality (with here). Note that Gross and Tibshirani [18] state that this method is not well suited for prediction but rather finds common patterns shared by the two modalities by forcing the fitted linear predictors from each modality to be similar.

For the sake of exhaustiveness, let us also mention an applied paper on plant breeding [19] using the idea of applying different penalties to variables from two different modalities (genetic markers and metabolomic traits in their case), however, in the different context of ridge regression (i.e., -penalty as opposed to LASSO) for a continuous outcome. In their study, published in a genetics journal and focusing on the agricultural application, they apply this method to their dataset and do not investigate it from a methodological point of view. A similar approach based on -penalized logistic regression [20] formalizes and extends this idea with the purpose to better integrate external data such as annotation or external values.

In summary, the IPF-LASSO proposed here is aimed at using multiple high-dimensional data modalities in a flexible way by weighing them differently in feature selection and prediction modelling, which is a critical yet unsolved problem in biomedical research.

##### 2.3. Cross-Validation for the Choice of the Penalty Parameters

In this section, we discuss the choice of the parameters . Similar to in the case of the standard LASSO, values for the penalty factors can be determined by cross-validation (CV) based on prediction performance. In our study, we use 5-fold CV with 10 repeats as a good compromise between performance and computation time [21]. Common metrics quantifying prediction performance include the mean squared error for continuous outcomes, the misclassification rate (or ), the area under the ROC curve (AUC) for binary outcomes, or the partial likelihood for time-to-event outcomes. In practice, we implement the procedure as follows. We consider different candidate vectors of penalty factors of the form , with ; for each candidate vector of penalty factors, we apply CV with the chosen performance metric to select the optimal ; the vector of penalty factors whose optimal yields the best fit according to the chosen performance metric is finally selected.

##### 2.4. Software Implementation

IPF-LASSO is implemented in our new R package ipflasso, which is publicly available from the CRAN. It is based on the R package glmnet and includes the following features and improvements.

*Rescaling Procedure*. The rescaling procedure described in Section 2.1.2 is implemented in the R package glmnet [22] through the argument penalty.factor. This argument has the form penalty.factor=c(rep(1,p1),...,rep(pfM,pM)) in IPF-LASSO, where p1,…,pM are the sizes of the modalities and pfM stands for (note that the result is invariant against multiplication of the vector of penalty factors by a scalar). The functions from our package ipflasso automatically generate the argument penalty.factor when given the indices of the variables from each modality and the values ().

*Cross-Validation for the Choice of *. For a fixed set of penalty factors, can be selected using the function cv.glmnet from package glmnet [22]. However, cv.glmnet cannot perform repeated CV in its current version. An extended version of cv.glmnet allowing repetition of CV is implemented in the function cvr.glmnet from our package ipflasso. Repeated CV applied in combination with penalty factors for variables from different modalities is implemented in the function cvr.ipflasso.

*Cross-Validation for the Choice of Penalty Factors*. Finally, the R package ipflasso also includes a function, cvr2.ipflasso, to perform CV in the two dimensions of the grid: choice of for fixed penalty factors and choice of the penalty factors (). It takes the candidate sets of penalty factors as an argument. The function cvr2.ipflasso allows one to set a maximal number of variables to be included in the final model. The CV-based choice of the parameters is then performed only over values yielding models of this size or sparser.

As an example, the following simple code performs 5-fold cross-validation repeated 10 times to choose the best penalty factors out of , , , , and , where the 200 predictor variables come from two modalities (one consisting of the 50 first variables and the other consisting of the 150 last variables).

> X<-matrix (rnorm(50200),50,200)

> Y<-rbinom (50,1,0.5)

> cvr2.ipflasso(X=X,Y=Y,

family="binomial",type.measure="class",

standardize=TRUE, blocks=list (block1=1 : 50,block2=51 : 200),

pflist=list(c(1,1),c(1,2),c(2,1),c(1,4),c(4,1)),nfolds=5,ncv=10)

The criteria used for cross-validation currently implemented in ipflasso are the mean squared error for continuous outcomes, the misclassification rate or the area under curve (AUC) for binary outcomes, and the partial likelihood for time-to-event outcomes.

#### 3. Simulations

##### 3.1. Simulation Design

The goal of simulation studies is to investigate the performance of IPF-LASSO and compare it with other methods. We consider a binary dependent variable and two high-dimensional data modalities. The two modalities of variables vary in (i) their total numbers of variables and , (ii) their numbers of truly relevant variables and , and (iii) the effects and of the relevant variables. In all settings, datasets of size are successively randomly generated as follows. The binary class is drawn from the Bernoulli distribution with probability of success . The variables are then drawn from the multivariate normal distributions: where the covariance matrix is set to the identity matrix in the main design and the mean vector is given as

In the main design, we consider the settings (i.e., combinations of , , , , , and ) displayed in Table 1. Setting A reflects the unrealistic situation of two modalities that are perfectly identical in terms of size (), number/proportion of relevant variables (), and effects (). In setting B, the proportions of truly relevant variables are the same in both modalities () and their effects are also equal, but modality 1 is much smaller () than modality 2 (). In setting C, the sizes of the modalities are as in setting B and the effects are also equal, but the numbers of truly relevant variables () are such that the* proportions* of truly relevant variables are different in the two modalities ( versus ). This difference is more pronounced in setting D: the proportions are 0.20 for modality 1 and 0 for modality 2, a quite common situation in practice (“useless omics data”). Setting E also reflects a common situation: the small modality 1 () contains strong predictors (), which is, for instance, often the case of clinical variables or a small hypothesis-driven biomarker panel. In contrast, the large modality 2 () contains weak predictor variables (). Finally, in setting F, the sizes of the modalities are the same as those in setting E but there are more truly relevant variables in modality 1 () and less ones in modality 2 (), and their effects are equal (). This situation, which is intermediate between settings D and E, is also common in practice.