Abstract

Microarray technology results in high-dimensional and low-sample size data sets. Therefore, fitting sparse models is substantial because only a small number of influential genes can reliably be identified. A number of variable selection approaches have been proposed for high-dimensional time-to-event data based on Cox proportional hazards where censoring is present. The present study applied three sparse variable selection techniques of Lasso, smoothly clipped absolute deviation and the smooth integration of counting, and absolute deviation for gene expression survival time data using the additive risk model which is adopted when the absolute effects of multiple predictors on the hazard function are of interest. The performances of used techniques were evaluated by time dependent ROC curve and bootstrap .632+ prediction error curves. The selected genes by all methods were highly significant . The Lasso showed maximum median of area under ROC curve over time (0.95) and smoothly clipped absolute deviation showed the lowest prediction error (0.105). It was observed that the selected genes by all methods improved the prediction of purely clinical model indicating the valuable information containing in the microarray features. So it was concluded that used approaches can satisfactorily predict survival based on selected gene expression measurements.

1. Introduction

Head and neck squamous cell carcinoma (HNSCC) is the sixth most common cancer worldwide that has variation across subsites (oral cavity, oropharynx, larynx, or hypopharynx) in many characteristics including age, sex, ethnicity, histologic grade, treatment modality, and prognosis [13]. HNSCC as the most common smoking related cancer after lung cancer is one of the most aggressive malignancies in human population [1, 3] and causes about 438,000 smoking-attributable mortalities in the world each year. The most common anatomic site of HNSCC counting for approximately 50% of all HNSCC is oral squamous cell carcinoma (OSCC) which contains two most commonly diagnosed oral premalignant lesions (OPL) including Leukoplakia and erythroplakia [3].

Although tremendous efforts were committed to early detection, prevention, and treatment during the last decades, HNSCC prognosis remains very poor with the rising incidence in developed countries and younger population [1]. Even for HNSCC diagnosed at early stages, surgery (current standard care) is a debilitating, substantially morbid procedure that leads to severe impairments in quality of life for the patients [3]. Therefore, developing new approaches including diagnosis of the disease before the cancerous stage and preventing development of invasive cancers such as HNSCC is very substantial. The HNSCC is a multistep process and genetic factors play an important role in its etiology [1]. Generally, it is well accepted that malignant tumors, on the molecular level, are disease of genes [1].

The rapid development of biotechniques during the past decade has brought in a wealth of biomedical data including DNA microarrays which can be used to measure the expression of thousands of genes in a sample of cells or to identify hundreds of thousands of single nucleotide polymorphisms for an individual at the same time [4]. This kind of information leads to high-dimensional and low-sample size (HDLSS) data sets (i.e., where and are the number of covariates and patients) which pose tremendous challenges to effective statistical inference especially for the time-to-event due to the presence of censoring and the use of much more complicated models. In this case, a number of the outcome variables are typically censored. Besides, establishing link between high-dimensional biomedical data and patient’s survivals would be informative in the presence of clinical data.

In this regard, a fundamental objective is to identify a small set of genes from a huge number of features whose expression levels are significantly correlated with a given clinical outcome such as cancer. Identified genes are then often used to create a predictive model for conducting prediction of outcome for new patients [5].

Recently, several variable selection techniques based on the maximization of a penalized likelihood have been proposed for HDLSS time-to-event data under the Cox proportional hazards model. Some of the most common penalization methods are the least absolute shrinkage and selection operator (lasso) [6], smoothly clipped absolute deviation (SCAD) [7], Dantzig selector [8], LARS [9], and the smooth integration of counting and absolute deviation (SICA) penalty [10]. These techniques can yield sparse models and hence perform simultaneous variable selection and estimation. To select ideal penalty functions for model selection Fan and Li [7] advocated penalty functions giving rise to estimators with three desired properties of sparsity, unbiasedness, and continuity. This class of penalty functions has been studied by Lv and Fan [11] and Fan and Lv [12] in (generalized) linear models contexts. The reasonable performance of penalization techniques such as Lasso, SCAD, and SICA was also investigated by Lin and Lv [4].

Despite the extensive study of the proportional hazards model for survival data with high-dimensional covariates, a few authors have utilized variable selection methods based on additive risk model which may provide more insights beyond the proportional hazards model analysis [4, 13]. Additive models assume that the covariate effects under consideration contribute additively to the conditional hazard [13].

The aim of the present study was to compare three sparse variable selection methods of Lasso, SICA, and SCAD for high-dimensional low-sample size data using an additive hazard approach to predict survival time in patients with OSCC and to determine the influential genes on survival time.

2. Methods

2.1. Data Source

The proposed techniques are illustrated with publicly available microarray data from patients with OPL [3]. The dataset consists of gene expression measurements for 29096 genes and survival outcomes to develop oral cancer on 86 patients. These 86 patients were selected from the 162 patients who were involved in a chemoprevention trial. The number of 35 patients who developed oral cancer during the study period was selected for gene expression profiling. Besides, 51 samples (ad hoc choice) from patients who did not develop OSCC were randomly selected among 106 patients. The median follow-up time was 5 years for this sample. The data analyzed in this study is available from the NIH Gene Expression Omnibus database at www.ncbi.nlm.nih.gov/geo under the accession number GSE26549. Potentially important clinical covariates were age, histology at baseline, deltaNp63, and podoplanin expression at baseline.

2.2. Variable Selection via Additive Hazards Model

Regularization techniques are particularly useful for variable selection in high-dimensional setting where the number of variables is much greater than the sample size and have gained increasing popularity. These estimation techniques pose a penalty term on the coefficients in objective function and shrink the estimates of the coefficients towards zero relative to the maximum likelihood estimates. The goal of this shrinkage is to prevent overfitting which arises due to either collinearity of the covariates or high-dimensionality [14]. When the outcome is survival time, the objective function is usually written based on hazards at the failure time and posing penalties on the coefficients can yield sparse models and hence perform simultaneous variable selection and estimation.

Consider a sample of independent observations. Let be the time to an event like death from cancer for the th subject and conditionally independent of the censoring time , given the p-dimensional possibly time-dependent covariate vector . Let and for right censored data, where is the indicator function. Then the observed data consists of . The hazard function of a failure time given a -vector of possibly time-dependent covariates based on Lin and Ying additive hazards model is as follows: where is an unknown baseline hazard function which is common to all subjects and is a -vector of regression coefficients [4, 15]. The pseudo score linear in function for Lin and Ying [16] model can be defined as follows: where is the observed-failure counting process, is at-risk indicator, , and is the maximum follow-up time. Performing some algebraic manipulation results in where and . Because is positive semidefinite, integrating with respect to results in the least squares type loss function as Then, the penalized estimator is a solution to the regularization problem where is the likelihood of beta for additive model, , is a penalty function based on the regularization parameter and is often rewritten as [4].

The present study considered three commonly used sparse penalty functions which correspond to Lasso, SCAD, and SICA. In this regard, the Lasso uses the -penalty; that is, . The SCAD penalty is given by the derivative with some as a shape parameter and SICA penalty takes the form and is a shape parameter. Estimation of is then accomplished via the coordinate descent algorithm [4].

Selecting the optimal regularization parameter is conducted through 10-fold cross-validation based on the following cross-validation score function: where is the least squares type loss function computed from the th part of the data, and is the estimate from the data with the th part removed [4].

The SCAD and SICA have one additional tuning parameter . For the SCAD penalty, was suggested by Fan and Li [7] from a Bayesian perspective. Selecting for the SICA penalty requires a little more caution, because small values of that often needs to yield a superior theoretical performance may sometimes lead to the computational instability [4]. This study used the method proposed by Lin and Lv [4] to determine in SICA.

2.3. Properties of Used Penalties

In spite of choosing the penalty function, the performance of the regularized estimators depends on various factors, such as the dimensionality of the model, the correlation among the variables, and the choice of the regularization parameter. Also, penalization techniques may suffer from some drawbacks.

Although the Lasso method enjoys the advantage of computational simplicity, it suffers from several shortcomings. All coefficients are shrunken toward zero by Lasso. So, the large elements of coefficients tend to be underestimated. The SCAD penalty is proposed to eliminate the bias caused by the Lasso and does not have this drawback. Also, it has been shown to enjoy the oracle property. It means that the resulting estimator performs asymptotically as well as the oracle estimator which knew the true sparse model in advance [4]. SCAD does not require the irrepresentable condition to consistency. In spite of nonconvexity of SCAD, algorithms are available to compute its solution [7, 17]. On the other hand Lv and Fan showed that the regularity conditions needed for the Lasso can be substantially relaxed by using concave penalties due to nonasymptotic weak oracle property [11]. The SICA family proposed in their study has the noticeable feature that it can be used to define a sequence of regularization problems with varying theoretical performance and computational complexity [4].

2.4. Predictive Performance

Assessment of the performance of three variable selection methods was performed by using time-dependent receiver operator characteristic (ROC) curves [18] and bootstrap .632+ prediction error curves [19] which were used to evaluate the performance of the models. The prediction error was obtained as squared difference between the true state (0 for being still under risk and 1 if an event of cancer occurred) at time t and predicted survival probability. Lower prediction errors suggest better performance.

2.5. Software

Analyses were performed by using the R software programming (http://www.r-project.org) based on a publically available R package which has been provided by Lin and Lv [4] (http://www-scf.usc.edu/~linwei/software.html). In addition, evaluating the predictive performance of used methods was utilized using “pec,” “peperr,” and “survAUC” R packages. Besides, to select the most effective genes, “timeROC” package was utilized.

3. Results

Three variable selection techniques of Lasso, SCAD, and SICA were implemented on microarray gene expression data of 86 leukoplakia samples of OPL patients based on additive hazards model. The frequency of occurrences of the genes along with means of coefficients and standard errors over 100 replicates, obtained from three techniques, was shown in Table 1. As shown in Table 1, the number of selected genes by Lasso was greater than SCAD and SICA. The SICA selected only a small set of genes due to the sparseness. In addition, there are seven common genes (7905589, 7908407, 8106919, 8126931, 8161169, 8174970, and 8180388) among three variable selection techniques.

In order to assess predictive performance, the median area under ROC curve over time (AUC()) was computed and plotted based on 69 training and 17 test sets for each method. Results are shown in Figure 1.

For the follow-up period, the median AUC for Lasso, SCAD, and SICA was 0.95, 0.91, and 0.83, respectively. As can be seen from Figure 1, the predictive performance of the Lasso was superior to SCAD and SICA in this data analysis.

In addition, to evaluate prediction performance improvement by including selected genes over a purely clinical model, bootstrap .632+ prediction error curves were plotted based on bootstrap samples drawn without replacement. Figure 2 shows the results. As shown, including selected microarray features in the models clearly improved prediction performance over the purely clinical model indicating the valuable information containing in the microarray features. Also, it can be seen that the performance of SCAD (bootstrap .632+ prediction error = 0.105) was slightly superior to Lasso (bootstrap .632+ prediction error = 0.124) and SICA (bootstrap .632+ prediction error = 0.132). In order to compare, the prediction error curve of the method utilized by Saintigny et al. [3] to select effective genes named component-wise likelihood-based boosting was also provided in Figure 2. As shown, based on prediction error curve, the performance of SCAD was better compared to component-wise likelihood-based boosting.

Finally, five most effective genes (8126931, 8120206, 7971191, 8161169, and 7897663) were selected using area under ROC curve by means of timeROC package. The median AUC of these genes was 0.825, 0.816, 0.787, 0.786, and 0.702, respectively. Genes 8126931, 8120206, and 7897663 were significant (, , and ) by using additive model and the expression of all three genes decreased the survival time.

4. Discussion

Due to the low observations for microarray data, only a small number of influential genes can reliably be identified [20]. Therefore, fitting sparse models with only a few nonzero coefficients is interesting [20].

In this study, the performance of three sparse variable selection techniques was investigated based on an additive hazards model for survival data, in the presence of high-dimensional covariates. It was indicated that, based on AUC criterion, Lasso penalty resulted in higher capability of prediction than other methods. The vast majority of genes selected by Lasso approach had frequency greater than 90 percent indicating the stability of the selected gene in this approach. In a study conducted by Ma and Huang [13] the high-dimensional survival time modeled using the additive hazards model. Their results, based on two real data sets, showed appropriate performance of Lasso. Also, in a study conducted by Lin and Lv [4] performance of Lasso in an additive hazards model was reasonable.

Although Lasso performs shrinkage and variable selection simultaneously for better prediction and model interpretation, if there is a group of variables among which the pairwise correlations are very high, then the LASSO tends to arbitrarily select only one variable from the group [21]. This restriction might lead to some problems in the analysis of gene expression data where identification of an entire set of correlated genes may lead to an improved understanding of the biological pathway [5]. Superior performance of the elastic net penalty compared to Lasso was confirmed by Engler and Li based on a Cox proportional hazards model. It is suggested to use the elastic net penalization approach to deal with high-dimensional low-sample size survival data in an additive manner.

Despite the lower number of selected genes, the AUC of the SCAD and SICA was also comparable to the Lasso. On the other hand, based on the prediction error criterion, the SCAD showed minimum value indicating better prediction. Lin and Lv [4] conducted a simulation study to investigate and to compare some sparse variable selection techniques. Their results showed that the performance of SCAD and SICA was good with slight superiority of SICA.

In addition, the selected genes exhibited some consistency across three used methods in terms of selecting common genes. However, variability was observed due to the low-sample size and ultrahigh-dimensionality as well as different penalty functions which were utilized by different techniques [4]. Reducing the dimensionality by some techniques including sure independence screening [22] is suggested before using the regularization methods to gain more stability. Fan and Lv [22] showed that the prediction performance of all methods tends to improve.

Based on the AUC, three genes were diagnosed as the most effective genes on OPL patients’ survival. Accordingly, these genes can predict the survival time of the patients with OPL, if they are taken into account simultaneously in an additive hazards model. The expression of genes 8126931, 8120206, and 7897663 can decrease the survival time.

Saintigny et al. [3] conducted a study on the same dataset using a boosting approach which is also a sparse technique. They utilized Cox proportional hazards model and identified a small cluster of genes expression. However, due to the sparseness of utilized techniques, there is only a small overlap with the present study, with only one common microarray feature (8095441). Also the performance of boosting based on AUC (AUC = 0.95) was comparable with three methods used in the present study. However, based on prediction error curve the performance of SCAD was superior. It is suggested that simulation studies are conducted to further assessment of the performance of these techniques as well as different data sets.

One important restriction in the present study was that gene expression data were not identified for a large part of the dataset. Eliminating this part of dataset may result in selection bias in the present study’s result. Besides, we could not assess the biological mechanisms and the effect of the selected genes on the pathogenesis or progress of OPL; therefore, it is suggested that the pathogenic effect of reported genes is evaluated in the future studies.

The present study introduced a new set of influential microarray features in predicting OPL patients’ survival from a different perspective of the proportional hazards. According to the result of the present study, despite the small number of selected genes three methods of Lasso, SCAD, and SICA showed reasonable performance in additive manner and the selected genes improved prediction performance over a purely clinical model.

The additive risk models and used variable selection approaches provide a useful alternative to existing dimension reduction techniques based on Cox’s model for survival data with high-dimensional covariates. The performance of the different models and dimension reduction techniques are data dependent with no method dominating the others [23]. Comprehensive simulation studies and data analysis will be required to draw more definitive conclusions.

5. Conclusion

The present study indicated that three feature selection approaches of the Lasso, SCAD, and SICA performed reasonably well in handling high-dimensional time-to-event data corresponding to OPL patients based on additive risk. However, the SICA selected smaller number of genes than the Lasso and SCAD. Furthermore, the selected genes by these methods improved the prediction of purely clinical model and can satisfactorily predict survival.

Conflict of Interests

The authors declare that there is no conflict of interests.