BioMed Research International

Volume 2017, Article ID 3017948, 10 pages

https://doi.org/10.1155/2017/3017948

## Joint Covariate Detection on Expression Profiles for Selecting Prognostic miRNAs in Glioblastoma

College of Information and Computer Engineering, Northeast Forestry University, Harbin 150001, China

Correspondence should be addressed to Xudong Zhao; nc.ude.ufen@gnoduxoahz

Received 4 October 2016; Revised 18 January 2017; Accepted 27 February 2017; Published 20 March 2017

Academic Editor: Xia Li

Copyright © 2017 Chengqi Sun and Xudong Zhao. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

An important application of expression profiles is to stratify patients into high-risk and low-risk groups using limited but key covariates associated with survival outcomes. Prior to that, variables considered to be associated with survival outcomes are selected. A combination of single variables, each of which is significantly related to survival outcomes, is always regarded to be candidates for posterior patient stratification. Instead of individually significant variables, a combination that contains not only significant but also insignificant variables is supposed to be concentrated on. By means of bottom-up enumeration on each pair of variables, we propose a joint covariate detection strategy to select candidates that not only correspond to close association with survival outcomes but also help to make a clear stratification of patients. Experimental results on a publicly available dataset of glioblastoma multiforme indicate that the selected pair composed of an individually significant and an insignificant miRNA keeps a better performance than the combination of significant single variables. The selected miRNA pair is ultimately regarded to be associated with the prognosis of glioblastoma multiforme by further pathway analysis.

#### 1. Introduction

Survival analysis, which is a branch of statistics for analyzing time-to-event data, is commonly used in cancer research. In particular, it helps to assess the prognosis of patients having specific types of cancer in informing not only the categories of patients with differing survival outcomes but also the possible molecular cause of the risk of death. Narrow down to gliomas, expression profiles are utilized to discover the subtypes of patients with different survival risks [1]. This kind of data provides a supplementary predictor of survival due to the limited effectiveness of current clinical diagnoses. Numerous studies which attempted to use selected signatures from expression profiles for discrete stratification (e.g., recurrence, metastasis, and chemotherapy efficacy) have shown the effectiveness. Correspondingly, several methods that classified patients into subgroups with differing survival time have also been performed.

Considering the continuity of the observations’ survival time with right censoring, Cox proportional hazards regression analysis [2] was extensively utilized to seek covariates associated with the overall survival of patients in invasive breast cancer [3], non-small-cell lung cancer [4], follicular lymphoma [5], glioblastoma [6–8], and so forth. Due to the requirement of more observations than covariates, Cox proportional hazards regression model was combined with some methods for dimension reduction or shrinkage such as partial least squares [9] and principle component analysis [10]. However, these strategies can only provide a combination of variables other than reporting meaningful covariates. Since projections derived from these variables are made, one can only tell these variables together but not which variables are effective. Besides, top-down methods of tree-structured survival analysis [11] and random survival forests [12] associated with hazards regression were proposed for selection of covariates. Unlike bottom-up enumeration strategies, these heuristic approaches may get local optimal solutions although they infinitely approximate to global optimal solutions despite their efficiencies.

Hence, univariable regression analyses have been placed firmly in the mainstream. Due to the high-dimensional space of variables compared to the small observation size, a penalized Cox hazards model using least-angle regression was proposed in order to solve the overfitting problem of parameter learning [13]. In addition, a sparse kernel method was proposed on condition that the correlation between the logarithm of the hazard ratio and covariates was linear, and a survival supporting vector machine that maximized the classification margin other than Cox regression was presented [14]. In practice, univariable Cox regression analysis was applied to each variable, which was regarded to be significant considering its correlation with survival time or its distinct stratification of patients. Significant variables were selected using either Wald -test on regression coefficients [15] or log-rank test with permutations after dividing patients into high- and low-risk groups by univariable risk-score analysis [16, 17]. A risk score of each observation was obtained using a linear combination of the expression levels of selected variables weighted by multivariable regression coefficients. A cut-off threshold was derived from the median risk score or was determined by receiver operating characteristics (ROC) analysis [18], and patients within the training set were divided into high- and low-risk groups. The major problem of univariable Cox regression strategy roots in the assumption that covariates are derived from individual variables, each of which is significantly associated with survival outcomes. In essence, a meaningful set of covariates are probably composed of different variables, each of which is either correlated with or apparently unrelated to survival outcomes.

In order to solve this problem, we propose a joint covariate detection strategy for selection of variable pairs instead of a combination of variables individually correlated with survival time from expression profiles. Meanwhile, stratification of patients is also considered. That is, predictors not only associated with survival outcomes but also helpful to classify patients into high-risk and low-risk groups are chosen. Cox proportional hazards regression is used in order to detect variable pairs that are most associated with survival time. In order to overcome the overfitting problem, variable pairs which may most possibly help to stratify patients with differing survival risks are further selected. In particular, patients are stratified according to the corresponding risk-score analysis derived from Cox proportional hazards regression. Besides, log-rank test is performed for further confirmation whether the selected variable pairs contribute mainly to the stratification or not. In order to show the effectiveness of our method, miRNA expression profiles containing 548 patients with glioblastoma multiforme (GBM) downloaded from the Cancer Genome Atlas (TCGA) database are introduced in. The final selected miRNA pair of significance as representing the covariates not only most associated with survival outcomes but also effective to stratification of patients is ultimately testified using KEGG pathway analysis.

#### 2. Materials and Methods

##### 2.1. Microarray Data

We use the miRNA expression data (Level 3) of 548 patients with GBM downloaded from TCGA (http://cancergenome.nih.gov) in order to illustrate the effectiveness of identifying prognostic miRNAs in glioblastoma using the joint covariate selection method. In total, these 548 GBM cases with overall survival information are selected from 581 miRNA expression profiles, which were downloaded during May, 2015. We choose all the patients, for we discover that splitting samples using a random dichotomy or by balancing survival outcomes between training and testing group cannot achieve the same set of variables as it is derived from the whole samples. That is to say, how to reasonably split samples into training and testing ones is still under discussion. The reason derives from two aspects. One is that survival outcomes are continuous compared to discrete stratification (e.g., recurrence, metastasis, and chemotherapy efficacy). Thus, the distribution of survival outcomes is to be estimated before splitting samples. The other is that it is hard to estimate the distribution of survival outcomes because of including censored following time. Moreover, the survival time of each patient is recorded, which ranges between 0 and 3881 days. Among them, 450 are dead (uncensored) during the study and 98 are still alive (censored) at the end of the study.* MatlabR2013b* is selected as the experimental platform. Coefficients of Cox regression are obtained by calling the library function* coxphfit*.

##### 2.2. Joint Covariate Detection

Here, it represents a twofold consideration on detection of variables, which are both associated with survival outcomes and helpful to classify patients into different risk groups. In order to seek variables associated with survival outcomes, Cox hazards regression is firstly introduced. The partial likelihood function is given by the expressionwhere the product is over the distinct ordered survival time without any follow-up of right censoring assuming that there is no tied time. and denote the th expression levels and the regression coefficients of the detected variables, respectively. The summation in the denominator is over all subjects in the risk set at ordered survival time , denoted by . The maximum partial likelihood estimator is obtained by differentiating the right hand side of the logarithm transformation of (1) with respect to , setting the derivative equal to zero, and solving for . As to each component of , a Wald statistic that represents the ratio of the estimated coefficient to its estimated standard error is presented. That is,

The value of the th component of is obtained by looking up a table assuming that the Wald statistic in (2) follows the standard normal distribution. In order to enlarge the sample size, we make a permutation test by reordering the survival outcomes for times. And the corresponding value is expressed as follows:where denotes a null statistics by a random rearrangement of survival outcomes. Enumeration on each single variable or on each pair is made. Therefore, covariates significantly associated with survival outcomes are selected according to the individuals or the pairs with smallest values.

Meanwhile, we consider a best stratification of patients with differing survival outcomes as an indicator for selection of covariates. In practice, patients are commonly classified into low-risk and high-risk groups, which conforms to the daily doctors’ decision making process. Following the case, the risk score is the linear portion of Cox regression model, of which the estimator for the th sample containing covariates isMedian risk score is utilized as a cut-off value for stratification, in order to keep the equivalent number between high-risk and low-risk patients. Assuming that the survival function is the same in each of the two groups, the estimator of the expected number of deaths in high-risk group is expressed as

where and represent the number at risk and of deaths at the observation of ordered survival time , respectively. denotes the number at risk in high-risk group. Correspondingly, the estimator of the variance of on the hypergeometric distribution is defined as follows:where denotes the number at risk in low-risk group. Under the null hypothesis that survival functions of the two groups are the same, the statistic of log-rank test is expressed as follows:

The corresponding value is obtained using the distribution with one degree of freedom. In the same way, we make a permutation test similarly expressed in (3). That is,where also represents a null statistics by a random rearrangement of survival outcomes. After enumerating on each individual variable or on each pair, covariates that significantly categorize patients with differing survival outcomes are detected according to smallest values.

By enumeration on each variable and each pair, significant covariates most associated with survival time are chosen on condition that each component keeps a small value as expressed in (3). Moreover, the variables for stratification of patients using the risk score defined by (4) correspond to small values as seen in (8). In fact, this conception derives from Integrative Hypothesis Testing (IHT) proposed by Xu [19]. The obtained covariates may indicate not only a close correlation with survival time but also distinct stratification of patients.

##### 2.3. KEGG Pathway Analysis

In order to show the effectiveness of our method, we submit the final selected miRNA pair, which is not only most associated with survival outcomes but also effective to stratification of patients to low-risk and high-risk groups, to DIANA miRPath [20]. We only use TarBase [21] to select the targets of the miRNA pair, considering that it is a database of published experiments validated miRNA-gene interactions. Focusing on the pathways related to the selected miRNA pair instead of those corresponding to each component of the selected pair, we can find significant pathways, which may support our finding and show the effectiveness of our method.

#### 3. Results

##### 3.1. Joint Covariate Detection for GBM Survival Analysis

In this part, we apply joint covariate detection to seeking miRNAs which are associated with the risk of death and the stratification of high-risk and low-risk patients in GBM. The representation of “joint” is twofold. First, it is a strategy that combines Cox regression for seeking survival-associated variables with log-rank test on risk scores for evaluation of the classification results. Second, it also exhibits the steps from enumerations on each individual variable to those on enumerable covariate tuples. Considering the computational cost, joint covariate detection terminates after finishing enumeration on miRNA pairs.

For each miRNA, values expressed in (3) and (8) were obtained after 10000 rounds of permutations. The miRNAs with values ≤ 0.01 were regarded to be individually significant. We obtained six significant miRNAs, as listed in Table 1. Using the expression levels of the selected significant miRNAs, we made Kaplan-Meier survival analyses on high-risk and low-risk groups derived from cut-off values by calculating the median risk scores expressed in (4), as illustrated in Figure 1. Besides, values of each significant miRNA were also shown in Figure 1. On assumption that hazard ratio (HR) is constant over survival time, we listed HRs in Figure 1, too.