Abstract

In the past decade, researchers in oncology have sought to develop survival prediction models using gene expression data. The least absolute shrinkage and selection operator (lasso) has been widely used to select genes that truly correlated with a patient’s survival. The lasso selects genes for prediction by shrinking a large number of coefficients of the candidate genes towards zero based on a tuning parameter that is often determined by a cross-validation (CV). However, this method can pass over (or fail to identify) true positive genes (i.e., it identifies false negatives) in certain instances, because the lasso tends to favor the development of a simple prediction model. Here, we attempt to monitor the identification of false negatives by developing a method for estimating the number of true positive (TP) genes for a series of values of a tuning parameter that assumes a mixture distribution for the lasso estimates. Using our developed method, we performed a simulation study to examine its precision in estimating the number of TP genes. Additionally, we applied our method to a real gene expression dataset and found that it was able to identify genes correlated with survival that a CV method was unable to detect.

1. Introduction

In the past decade, researchers have predicted survival in a cancer patient based on gene expression data [14]. Revealing the relationship between gene expression profiles and the time to an event of interest (e.g., overall survival, metastasis-free survival) can improve treatment strategies and establish accurate prognostic markers. The Cox proportional hazard model is the most popular method for relating covariates to survival times [5]. However, due to the high dimensionality of gene expression data (i.e., the number of genes expressed exceeds the number of patients), it is not possible to take an estimation approach based on the Cox log partial likelihood. To overcome this problem, a penalized estimation approach, which includes a shrinkage estimation of coefficients, is frequently taken [68].

In penalized estimation approaches, the least absolute shrinkage and selection operator (lasso) [9, 10] is often used because of its attractive ability to simultaneously select the genes correlated with survival and estimate the coefficients in the Cox model. The lasso shrinks most of the coefficients towards zero exactly by adding norm to the Cox log partial likelihood, and the amount of shrinkage is dependent on the tuning parameter. The value of the tuning parameter is often determined by a cross-validation (CV), which maximizes the out-of-data prediction accuracy [11].

Several researchers have investigated the operating characteristics of the lasso. Goeman [12] used the lasso to analyze a publicly available gene expression dataset, obtained from the articles of van’t Veer et al. [2] and van de Vijver et al. [3] in which a 70-gene signature for prediction of metastasis-free survival in breast cancer patients had been established. This data included 295 patients with 4919 genes that were prescreened from 24,885 genes based on the quality criteria in van’t Veer et al.’s work [2]. The lasso selected 16 genes with which to develop a prediction model of overall survival when using the tuning parameter that was determined using a CV. Goeman [12] also conducted ridge regression using all 4919 genes to develop a model by adding norm to the Cox log partial likelihood. The prediction accuracy of the lasso and ridge regression were compared, and the ridge regression with 4919 genes slightly outperformed the lasso with 16 genes. Goeman [12] concluded that the lasso potentially passes over genes that are correlated with survival in order to develop a simple prediction model. Bøvelstad et al. [7] reached the same conclusion in a review of the survival prediction methods available for analyzing breast cancer gene expression datasets. Table 1 summarizes a typical result of gene selection by the lasso.

The CV method determines the value of the tuning parameter by considering the trade-off between the number of true positives (TP) and false positives (FP), and so the possibility of identifying false negatives (FN) cannot be eliminated. One solution for identifying more outcome-predictive genes is to monitor the number of TP in several values of the tuning parameter and, subsequently, determine its final value. In this study, we developed a method for estimating the number of TP for a series of values of the tuning parameter. We assumed a mixture distribution with components of TP and FP for the lasso estimates, and these could be used to estimate the number of TP and FP. It is possible to generate the solution path that includes the lasso estimates for a series of values of the tuning parameter using the methods developed by Goeman [12]. Here, we proposed an algorithm to sequentially fit the mixture distribution for this solution path, and we used a simulation study to test the precision of the algorithm when estimating the number of TP. We further demonstrated the proposed algorithm using a well-known diffuse large B-cell lymphoma (DLBCL) dataset comprising overall survival of 240 DLBCL patients and gene expression data of 7399 genes [1].

2. Materials and Methods

2.1. Lasso in the Cox Proportional Hazard Model

The Cox proportional hazard model is the most popular method for evaluating the relationship between gene expression and time to an event of interest [5]. The hazard function of an event at time for a patient    with the gene expression levels is given bywhere is a parameter vector and is the baseline hazard, which is the hazard for the respective individual when all variable values are equal to zero. In the general setting where , the coefficients are estimated by maximizing Cox log partial likelihood as follows:where is an indicator, which is 1, if the survival time is observed, or 0, if censored. is the risk set of the individuals at .

In the lasso for the high-dimensional setting where , the coefficients are estimated by maximizing the following penalized likelihood function [9, 10]:where is the tuning parameter, which determines the amount of shrinkage.

2.2. Solution Path of the Lasso Estimates

Goeman [12] introduced a method to calculate the solution path of the lasso estimates as a function of , , which is based on the algorithm developed by Park and Hastie [13]. The method maximizes at a fixed based on a combination of gradient ascent optimization with the Newton-Raphson algorithm. are calculated for successively, starting from (which gives because the value has zero gradients). is chosen arbitrarily but is often set to in analyses of gene expression data [14]. The lasso estimates at a current step are set to initial values for calculation of the subsequent step. Step length is the minimum decrement to change the number of selected genes ; that is, only one gene is newly selected or excluded from to .

2.3. Mixture Distribution for Estimating the Number of TP in the Lasso Estimates

To estimate the number of TP in the lasso estimates at a fixed value of , we assumed a mixture distribution developed in our previous study [15]. We introduced the mixture distribution based on the two features of the lasso: (i) the lasso selects at most genes because of the nature of the convex optimization problem when [16, 17] and (ii) in the Bayesian paradigm the lasso estimates are the posterior mode with the independent Laplace prior distribution , where is the probability density function of Laplace distribution with location parameter and scale parameter [9]. Therefore, the mixture distribution assumed for the lasso estimates at was as follows:where and are mixed proportions ; is the probability density function of the normal distribution with mean ≠0 and variance in component ; is the number of components, which is determined by model selection criteria; and is the constant value, which is boundlessly close to 0; for example, . The unknown parameters, , , , , and , are estimated by maximizing the log-likelihood function of (4) by using the Newton-Raphson method.

The mixture distribution defined in (4) is formulated on the basis of the following concepts: since the lasso selects a maximum of genes when , the coefficients for genes are exactly zero; therefore, (4) consists of 2 terms ( term and term). In the term, the Laplace distribution with location parameter 0 and scale parameter was assumed to be the distribution for the FP on the basis of the lasso feature (ii) discussed above, while the component normal distribution with location parameter and scale parameter was assumed as the distribution for the TP. In the term, the Laplace distribution with location parameter 0 and scale parameter was assumed as the distribution of genes based on the aforementioned lasso feature (i).

The with location parameter 0 and scale parameter was assumed to be the distribution for the FP on the basis of lasso feature (i), discussed above. The with location parameter and scale parameter was assumed as the distribution for the TP. The of the term was assumed as the distribution of genes based on the aforementioned lasso feature (ii). Given a cut-off value (>0), the estimated proportions of the FP and TP are the area under the estimated Laplace and normal distribution in the term of (4), respectively, and can be written as follows:Figure 1 illustrates the calculation in (5) when the number of components, , is 1. Using (5), the number of TP and FP was estimated by

2.4. Algorithm for Estimating Number of TP in a Series of Values  

Here, we propose an algorithm to sequentially fit the mixture distribution in (4) to the solution path of the lasso estimates, which was described in Section 2.2. In this algorithm, we assumed that the number of TP changed when the newly selected or excluded gene from to was truly correlated to survival, based on the maximum log-likelihood of (4). First, we approximated and in (5) by assuming a suitably small cut-off value (≈0). We then obtained and from (6) and (7), respectively, where is an estimate of the number of TP in component . For , the proposed algorithm was as follows.

Step 1

Step 1.1. In this step, we assumed that the newly selected or excluded gene from to was FP. denotes the proportion of FP and is set asFor the other components, , set .

Step 1.2. Given and , calculate the maximum log-likelihood of (4), .

Step 2

Step 2.1. Set .

Step 2.2. In this step, we assumed that the newly selected or excluded gene from to was TP. For the component , setFor the other components, set and .

Step 2.3. Given and , calculate the maximum log-likelihood of (4), .

Step 2.4. Set . Repeat Steps 2.2 and 2.3 until .

Step 3. In this step, we determined whether the newly selected or excluded gene from to was TP or FP based on the maximum log-likelihood which was calculated in Steps 1.2 and 2.3. If was the largest in , we assumed that the newly selected or excluded gene was FP; if not, we assumed that it was TP. Therefore, calculate . If , update as follows:If , update as follows:Here, calculate the estimated TP at by .

3. Results

3.1. Simulation Study

We performed a simulation study to examine the precision of our estimated TP. In this study, the number of patients, , was set to 200. The number of genes, , was set to 1000, which included the =5 or 30 outcome-predictive genes that are randomly chosen from genes in each simulation. The coefficient for gene , , was set to 1.5 for the outcome-predictive genes and 0 for the remaining none-outcome-predictive genes. We set to 5 and the number of components, , to 1 throughout (although was determined using a model selection criterion in practice). The gene expression levels for patient , , were generated from the multivariate normal distribution with mean vector and covariance matrix so that the variance was 1 and the correlation or [18]. The survival time for patient was generated based on the exponential model where is the uniform random variable between 0 and 1 [19]. In order to evaluate the precision of the estimated TP for various values of , we report a number of selected genes, including true TP, and estimated TP and FP, for .

Table 2 shows the average of , a number of selected genes, true TP, and estimated TP and FP, through 1000 repeats. We observed that the precision of estimated TP varied depending on the value of both and (see Table 2). When , the precision of the estimates was sufficient for , 50, 100, and 150, while TP was slightly underestimated for . However, when , the precision of the estimates was sufficient for , 10, and 150, while TP was overestimated for and 100. For example, when , , and , the average number of true and estimated TP was 29.9 and 35.3, respectively. The values of did not greatly affect the accuracy of the estimated TP.

3.2. Real Data Analysis

To illustrate how our proposed algorithm could be used to determine , we applied it to the DLBCL dataset, comprising survival of 240 DLBCL patients and gene expression data from 7399 genes [1]. In the gene expression data from the 240 patients, we identified 434 genes with complete sets of gene expression values; all other genes had missing expression values, with an average of 24.7 missing values per gene. Here, we used 0.0 as the missing expression value for descriptive purposes. Similar to Rosenwald et al. [1], we divided the data into two: training data consisting of 160 patients and validation data consisting of 80 patients.

For the training data, we obtained the solution path of the lasso estimates; . was calculated as described in Section 2.2. We set according to Simon et al. [14].

We applied our proposed algorithm to the obtained solution path. We assumed three mixture distributions on the lasso estimates with , 2, or 3 and compared their goodness of fit for the by the Akaike information criterion (AIC). As a result, we chose because it had the best AIC for all .

Figure 2 shows the estimated number of TP in a series of values of . We found that the lasso selected at most 42 TP, with the number of selected genes at 96, when = 0.86 as . Therefore, we selected as the optimum , and the estimated mixture distribution for the value of was as follows:In order to identify the 42 TP from the 96 selected genes, we arranged the 96 in descending order of and identified the first 42 listed genes with a cut-off value . Subsequently, the model that included these 42 genes is identified as the “42 TP-model.”

In comparison to the 42 TP-model, we performed CV. Briefly, the -fold CV was given bywhere and are the log partial likelihood and the lasso estimate with left th fold out, respectively. The optimal value of was obtained by maximizing . On the basis of 5-fold CV, 12 genes were selected with =1.43 as . Subsequently, the model including these 12 genes is identified as the “CV-model.” Notably, both the 42 TP-model with 42 genes and the CV-model with 12 genes selected 4 genes in common. Table 3 shows the GenBank accession number and description for each of the 4 genes selected by both models.

We compared the prediction accuracy of the 42 TP-model and the CV-model using validation data consisting of 80 patients. For this data, we calculated 3 values that served as comparison criteria: values for the log-rank test and prognostic index and the deviance. The 80 patients were categorized into 2 groups, the “better” and “worse” prognostic groups, using the boundary of the median of prognostic index . The Kaplan-Meier curves between the 2 groups were compared with a log-rank test. Next, we calculated the value for the parameter multiplied by the prognostic index in the Cox proportional hazard model . Finally, the deviance was calculated by , where and are the Cox log partial-likelihood function for the estimated coefficients by using the training data and zero vector , respectively. For each criterion, the lower value suggested better prediction accuracy.

Table 4 shows the values of the 3 criteria for each model. We found that the values of all 3 criteria for the 42 TP-model were lower than those for the CV-model, suggesting that the model based on the proposed method was more accurate (see Table 4). Additionally, Figure 3 shows that the Kaplan-Meier curves for the 42 TP-model distinguished the “better” and “worse” prognostic groups more definitely than those for the CV-model (42 TP-model, ; CV-model, ). Therefore, by using our proposed algorithm, we determined and were able to select important genes, likely to be correlated with survival, in which the CV was unable to select.

4. Discussions

In this study, we proposed an algorithm for estimating the number of TP on the solution path of lasso estimates. Monitoring and determining the number of TP for a series of values are important because they can increase the probability of uncovering all outcome-predictive genes. The number of TP should be estimated with appropriate accuracy. To confirm the accuracy of our TP, we conducted a simulation study using a typical gene expression dataset. We found that the precision of our algorithm for estimating the number of TP was adequate, although an overestimation occurred with some values of . However, the overestimation occurred when the true number of TP was saturated, and so it may not cause a problem by passing over genes that truly correlated with survival. In the simulation study where and , the maximum average estimated number of TP was 35.3 at (see Table 2). Using this to select TP, an average selection of 29.9 TP within 30 outcome-predictive genes can be made, with the number of TP genes that are passed over being negligible in practice.

The data that have been provided in Table 2 showed that the number of false positives increased, while the number of true positives increased and then plateaued as the tuning parameter decreased. To decrease the number of FP identified while maintaining an adequate number of TP, we should determine the value of by monitoring both the number of TP and the false positive rate in the proposed method.

Additionally, our proposed algorithm was applied to DLBCL data. We determined the value of the tuning parameter based on the maximum number of estimated TP uncovered by the algorithm. We identified 42 TP genes among 96 selected genes based on the ranking of the absolute values of the lasso estimates. We can also identify TP based on model evaluation criteria such as AIC among all possible combinations of 42 genes from 96, that is, (>1027) combinations in total; however, calculation of AIC for all possible gene combinations is a distant approach. To evaluate the efficiency of the approach using the ranking of the lasso estimates, we calculated the AIC for 10,000 randomly chosen models among all the possible models and subsequently compared it with the AIC of our approach. From 10,000 models, the AIC of 425 models (4.25%) was better than that of our approach. This result indicated that our ranking-based approach has a satisfactory performance in practice with respect to the identification of 42 genes. Although investigation of all possible gene combinations is ideal, our approach is a good alternative.

In the application to DLBCL data, in comparison to a CV method by which 12 genes were identified, we identified 42 TP genes with our algorithm, and we improved the prediction accuracy of the model. In practice, some researchers might be satisfied with identifying a few promising genes and would not be unduly worried about passing over others. In such a situation, the CV would be preferable because it developed the model to uncover a few genes with just a small loss of prediction accuracy. However, genes that are selected by the lasso are often investigated with greater scrutiny by genetic researchers, and so passing over outcome-predictive genes by the lasso could represent a major problem. Indeed, if the lasso passes over outcome-predictive genes, some genetic research may not take place. Therefore, when identifying all outcome-predictive genes is a priority, our proposed algorithm will be most useful.

5. Conclusions

We developed a method for estimating the number of true positives for a series of values of a tuning parameter in the lasso. We demonstrated the utility of the developed method through a simulation study and an application to a real dataset. Our results indicated that our developed method was useful for determining a value for the tuning parameter in the lasso and reducing the probability of passing over genes that are truly correlated with survival.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.