Research Article  Open Access
Simpler Evaluation of Predictions and Signature Stability for Gene Expression Data
Abstract
Scientific advances are raising expectations that patienttailored treatment will soon be available. The development of resulting clinical approaches needs to be based on welldesigned experimental and observational procedures that provide data to which proper biostatistical analyses are applied. Gene expression microarray and related technology are rapidly evolving. It is providing extremely large gene expression profiles containing many thousands of measurements. Choosing a subset from these gene expression measurements to include in a gene expression signature is one of the many challenges needing to be met. Choice of this signature depends on many factors, including the selection of patients in the training set. So the reliability and reproducibility of the resultant prognostic gene signature needs to be evaluated, in such a way as to be relevant to the clinical setting. A relatively straightforward approach is based on cross validation, with separate selection of genes at each iteration to avoid selection bias. Within this approach we developed two different methods, one based on forward selection, the other on genes that were statistically significant in all training blocks of data. We demonstrate our approach to gene signature evaluation with a wellknown breast cancer data set.
1. Introduction
The era of personalised medicine is being widely heralded, with claims including “all human illness can be studied by microarray analysis, and the ultimate goal is to develop effective treatments or cures for every human disease by 2050” [1]. The first step is the selection of predictors from the many (often tens of) thousands of measurements that can be made using modern microarray technology.
As is well known amongst statisticians, genes identified from a microarray analysis as predictors of outcome, and so included in a molecular signature, depend greatly on the selection of patients in the training set. Hence it has been advocated that validation by repeated random sampling (RRS) should be used [2]. RRS is advocated elsewhere, for example for selecting genes for classification [3]. The RRS approach can quickly become extremely computer intensive. Alternatively, the far less computer intensive method of 10fold crossvalidation has been shown empirically, for two microarray data sets, to perform very well when compared with repeated random sampling [4].
fold crossvalidation is a resampling method that is widely used in practice for estimating performance of a prediction rule (“signature”) when there is insufficient data to split the data into the ideal 3 parts: training, validation, and testing. For microarray experiments where the number, , of measurements (say geneexpression values on an Affymetrix chip) is extremely large compared with the number, , of samples (chips), it is starting to become clear that some traditional data statistical practices may not always be appropriate.
There is considerable confusion about the application of crossvalidation and many investigators have not understood the importance of signature selection at each of the separate steps [4]. If the signaltonoise ratio of the underlying function is relatively strong then the signature may not vary very much. However, if it is not strong, and this is very likely when the number of measurements is very large compared with the number of samples, then the signature may vary appreciably; see [5, 6]. Further, it has been noted that “the most striking finding when comparing the significant lists [from different studies] is the virtually complete lack of agreement in the included genes … the present lack of coherence warrants further examination” [7].
The reason for the differences could be attributed to a number of factors, including methodology (model, algorithm, gene selection, preprocessing steps such as normalisation, transformation, etc.), data (different samples, heterogeneous samples, poor measurements, poor experimental designs leading to the presence of confounders, etc.), and the underlying biology (different genes from the same pathway, considered later). The fragility of a prediction model has been demonstrated [8] by showing that published results could not be replicated using the same data, when training and tests sets were interchanged and the same algorithms applied.
Here we consider a relatively large study [9] that was used to predict distant metastasis of lymphnodenegative primary breast cancer based on determination of a single signature. The data are available in series GSE2034 at the NCBI Genebank GEO. These data consist of gene expression values from 286 Affymetrix U133a chips each with gene expression values for over 22 000 “genes”. Of these patients, 209 were oestrogen receptorpositive () and 77 patients were oestrogen receptornegative (). Briefly, the samples were a subset of frozen tumour samples from patients with lymphnode negative breast cancer which had been submitted for steroidhormone receptor measurement from an intake of 25 hospitals; further details in [9]. Metastasis status was determined from followup examinations or confirmed following patient report. The (statistical) sample selection was nonrandom and the data show considerable variation in many known breast cancer prognostic factors, such as therapy type, age, menopausal status, tumour grade, and stage for example. Data on these factors were not available on the web.
In the following, the patient data are chosen for illustrative purposes, noting that this selection also controls some of the overall patient heterogeneity. Use of 10fold crossvalidation is explored to determine those genes that are included in molecular signatures (lists). In other words, signature stability is evaluated, as is the variability of prediction measures.
2. Methods
First, gene expression values were log transformed (base 2) after zero values were set to 0.1.
Two methods, both of which correct for selection bias [4], were developed and applied to these breast cancer data. Before detailing the methods we outline the basic steps of fold crossvalidation:
(a) randomly divide the data into separate and approximately equal parts called validation sets;
(b) leave out one of the validation sets, then (b_{1}) perform the analysis using the combined data of the remaining 1 parts (referred to as the “training set”) and, usually, (b_{2}) validate the analysis (that is often making prediction/s) on the “validation set”;
(c) repeat (b) times, so that following the (usual) b_{2} step each one of the (separate) validation sets is used once (e.g., for prediction/s).
For method (i), within each training set at step (b_{1}), we first did an initial screening by applying a univariate Cox proportional hazards regression model (Cox PHM) to select those genes in each of the training sets that were statistically significant for metastasisfree survival. A (partial) likelihood ratio test, comparing the partial likelihood estimated for the gene to the null model, was used. Genes were retained when the likelihood ratio test was statistically significant at level . Next, restricting attention to just these retained genes, we used a Cox PHM with forward selection to select amongst the genes available following the initial gene selection. The selection was terminated when the statistical significance for inclusion into the Cox PHM, using a log likelihood ratio test, was greater than level . For step (b_{2}), the resulting model was used to predict relapsefree survival at 5 years on the validation data set. To evaluate predictive performance, a binary outcome was defined [2], namely whether the predicted probability of a patient was relapsefree, was greater than, or less than, 0.5. Then, three measures of predictive performance (also called accuracy of classification) were estimated, namely the true positive fraction (TPF), also known as sensitivity, the false positive fraction (FPF) and the proportion of correctly predicted samples. Specificity is 1FPF. In a clinical setting, sensitivity and specificity are more relevant than the proportion correctly predicted (i.e., 1 – classification error) that is often the measure reported in the bioinformatics literature. It is straightforward to show that the proportion of correctly predicted samples is a weighted average of sensitivity and specificity. Note that the proportion correctly predicted can be misleading if the outcome is rare or very common.
Method (ii) followed method (i) except that instead of forward selection we selected those genes which were significant at level in all training sets. Then all these genes were used in the signature with a multivariate Cox PHM being fitted at each iteration. The validation at step (b_{2}) in the second iteration was as described for method (i). This approach has been used successfully on data from proteomic profiling to distinguish malignant pancreatic cancer from benign disease [10].
There is no general rule for the value of , and , 5, 10, and (where s is the sample size) have all been suggested; the tradeoff is between bias and variance [5]. Following Ambroise & McLachlan [4], we chose . This choice seems to balance the tradeoff between bias with higher values of and variability as observed for low values, particularly (the socalled leaveoneone cross validation). Note that any two 10fold crossvalidation training sets have approximately 80% of the total sample in common.
Here the crossvalidation was applied in stratified form, so that approximately equal numbers of distant (i.e., beyond 5 years) metastases patients were included in the validation sets. We found that our substantive conclusions changed little if this restriction was relaxed.
The Cox proportional hazards model was chosen for modelling metastasis free survival, , because of censoring in the data. The Cox PHM models the conditional hazard as the product of a baseline hazard, , and an exponential form of a linear function of covariates, z, here log_{2} gene expression values, as follows:
The assumption of proportional hazards was evaluated and found adequate for a small number of significant genes.
Known gene function was examined using a text search on annotation available from the Affymetrix web site. The R software [11] with the package “survival” was used for analysis.
3. Results
3.1. Method (i)
The initial screening of genes at step (b_{1}), using a univariate Cox regression analyses, with identified from 210 to 342 genes in the 10 training sets. Altogether across the 10 training sets, 675 unique genes were found.
The 10 signatures built using a Cox PHM with forward selection in each training set and with , consisted of 11 to 18 genes. Genes were generally uncorrelated within each signature.
The 10 signatures have very few genes in common; that is, the signatures are very unstable. For example, one gene occurred in 6 signatures (the most common), 4 genes occurred in 4 signatures and 8 genes in 3 signatures. Seventy genes occurred only once. Also very few of the 60 genes in Wang et al.’s [9] signature for the subset of patients occurred in the 10 signatures. Three of the signatures included no genes in common with the Wang signature and the signature with the largest number of genes in common had only four genes in common.
The discrepancies between the ten signatures might be thought to be occurring because correlated genes in the same pathway/s are being selected. Table 1 shows the number of genes found in each signature with different functional classes. These classes are based on those in Wang et al. (Table ) [9, 12]. Functional classes that were found in none or only one signature are not shown. This analysis is not meant to be definitive but to illustrate the lack of consistency in apparent function between the signatures. We note that the variation demonstrated here may reflect the complexity of signalling processes in possibly many pathways, and/or the presence of tumour subtypes [13] and/or the heterogeneity in the sample.

Table 2 shows signature performance indices estimated on the training sets. As is well known now, when the performance is estimated on the same set as used to train the signature, the performance is over optimistic (i.e., biased [4]). Table 2 shows that each signature predicts distant metastasis very well for the data on which they were trained, even though the signatures were composed of a relatively small number of genes (average 15). However, Table 3 shows that the performance is considerably worse when the signature is used to predict metastasisfree status on the validation set. The test error, estimated as the proportion misclassified, using distant metastasis within 5 years as the defining mark, varies between 0.19 () and 0.55 (). The average unbiased estimate of error rate is 32.6%, while the biased estimate of error rate is only 11.5%. The variance of the unbiased test error is 122.7.


3.2. Method (ii)
We demonstrate the second method using 2 values of for the initial screening. With , there were 59 genes that were statistically significant in all training sets for metastasisfree survival, using a univariate Cox PHM. To assess performance of this method we estimated the Cox PHM using these 59 genes and predicted relapsefree survival at 5 years in each corresponding validation set. The true positive fractions (TPF) and false positive fractions (FPF) are plotted on the left in Figure 1. The location of the average pair (TPF, FPF) = (0.793 0.581) is shown. The average TPF is slightly lower than for method (i) but so also is the FPF which is desirable. Given the variability of the estimates, these averages can be considered approximately the same. The lower range of the FPF is less here than for method (i), whereas the comparison of the FPF values in the figure and with corresponding column in Table 3 indicates an improvement over method (i). The average error rate from the validation sets is 33.3%, which is similar to method (i), but with variance 74.4 which is much lower.
When was used for the initial gene selection, 14 genes were identified for inclusion in the signature. The pairs (TPF, FPF) estimated on the validation set for this signature are plotted on the right of Figure 1. The averages of TPF and FPF are 0.896 and 0.548, respectively with associated variances 0.009 and 0.042. The average estimate of error rate has decreased to 25.5%, with variance 50.0.
4. Discussion
Crossvalidation is shown to be a useful tool to assess performance and stability of molecular signatures used for prediction from microarray data. Such an evaluation is necessary since the number of samples is small relative to the number of measurements. Alternatives to crossvalidation for estimating predictive performance include the bootstrap and repeated random sampling [3, 4] for example; but for clarity in demonstrating instability of predictors, and because it is much less computer intensive, we concentrated here on crossvalidation.
Within the crossvalidation framework for developing gene signatures, there are many choices, including which initial screening approach to use and which method to build the signature. Each choice may lead to different genes being included in a signature. The choices made in the initial gene selection step here are somewhat arbitrary. Our initial choice, to select genes with significant () association with breast cancer survival, was a strategy to reduce the number of genes so that the potential number was of the same order of magnitude as the number of samples. This initial gene selection step is not required for fold crossvalidation in applications where the number of variables is closer to sample size. This screening of genes has interesting connections to the recent, independently derived, proposal of iterative sure independence screening (ISIS) for the linear model in ultrahigh dimensional feature space. ISIS was motivated by the need to deal with problems of large dimensionality for which accuracy of estimation and computational cost are two concerns. For the linear model, Fan and Lv [14] proposed reducing the dimensionality from high to a moderate scale that is below the sample size, by iteratively implementing a variable screening procedure and fitting the proposed model.
The presence of censored data indicates the use of a survival model for building the signatures. Since the Cox PHM has been found to be reasonably robust under misspecification [15], is well known in medical fields, and was used by Wang et al. [9], it was chosen for developing the signatures. This is preferable to the approach used by EinDor et al. [16] that was based on genes correlated with survival, as “correlation” is not statistically appropriate in this context.
In method (i), to train the signature on the training data we used a stepwise procedure. It is well known that such methods find only a fraction of the models that fit the data well and can have the undesirable effect of over fitting the data. However, efficient screening of all models is not computationally possible given the number of genes (variables). Again, the use of a significance level of is somewhat arbitrary. In further analyses (not reported) increasing increased the number of genes and the lack of agreement between the signatures. The use of crossvalidation helps to ascertain the extent of any over fitting but there is no known solution yet to finding the “optimal” model in this setting. It is good practice to use a number of approaches and to be satisfied only when substantive conclusions are unaltered by the algorithm, and basic assumptions are satisfied in so far as they can be tested. The importance of implementing the same gene selection rule/s for each of the training sets, in order to obtain an unbiased estimate of the prediction error, is stressed. Here there are two gene selection steps, and both need to be implemented at each training step. This is necessary because neither gene selection method is independent of the prediction method.
Method (ii) is novel, with its incorporation of selection based on variables that are common to all blocks using a crossvalidation type of approach. This would appear to overcome the highly unstable list of genes identified as predictors of prognosis that is found using either repeated random sampling [3] or method (i), but needs to be further evaluated on other data sets. Note that method (ii) is comparatively simple, and computationally fast. Current research concerns evaluation of the performance and stability of signatures of both our variants of crossvalidation with the SISstyle approaches to screening before fitting of the model/s. Since the data used for the selection of genes are not fully independent of the data used for validation, some bias may exist in estimates of performance [17] and alternative methods, including the use of 2external crossvalidation will be investigated.
There are many ways in which crossvalidation can be developed so as to correct for selection bias [4], and we have used two. Another approach that has many similarities to our methods is to first consider a range of the numbers of genes to be used in the signature, say to . Then within each of the training blocks, perform 1fold crossvalidation, and evaluate the signature for each value of , selecting the value that gives the best classification rate for that block. The value of selected may vary between the training blocks; see [18] for further details. With the additional layer of crossvalidation, as well as consideration of a range of values of d within each training block, this is more computerintensive than our methods, although the associated variability may well be less. Comparative performance of these approaches is a future research project.
It is important to distinguish between the use of statistical models for prediction and for explanation [19]. In much of the literature this distinction is blurred. Here we have been concerned with choosing signatures for prediction. Some investigators attempt to use a signature to build pathway/s for explanatory purposes but such explanations are dubious when the signatures can be very unstable.
Sample design is an important, but often overlooked, consideration in microarray studies. It has been identified as a key issue [20], and we emphasize that poor sample design cannot be overcome with increased sample size. Ideally the sample should be representative of the population on which the signature is to be used. Again resampling methods can help to highlight problems. Note that most of the theoretical work underpinning justification for methodologies assumes that the samples are representative. There are many unknowns about how nonhomogeneity in the sample affects the performance of the predictors.
In this data set there were confounders, including age, menopausal status, T stage, Grade, and PR status, that may be important predictors, including their possible interaction/s with particular genes, have not been considered here as they are not publicly available. Although Wang et al. [9] showed that these and other confounding factors were not statistically significant after adjusting for their gene signature in the training set, their inclusion may be useful in predicting “new” samples. Their usefulness will arguably depend on how well thesefactors are “accounted for” by the genes in the signature. The “best” signatures will in the end, most probably, be developed by a combination of biological expertise and computational algorithms.
Here we have demonstrated also the usefulness of using false positive and false negative fractions for assessing the performance of signatures. Particularly for medical applications, these measures of performance are more appropriate for assessing relevance of test performance than the overall classification (or misclassification) rate.
A promising approach has been to apply a proteinnetworkbased method [21] that, instead of identifying markers as individual genes, considers them as subnetworks as extracted from protein interaction databases. Unfortunately the results suffer from selection bias. Two data sets were used [9, 22], although one of the data sets [22] included both node negative and node positive patients; just node negative patients could have been selected for inclusion in the analyses. It is important to correct for selection bias when estimating classification accuracy from use of the subnetwork markers. As outlined above, to remove selection bias, the subnetwork features should be identified at each step of the crossvalidation, not “identified using all microarray samples before classification” [21, page 4].
5. Conclusion
We demonstrate the usefulness of the 10fold crossvalidation framework for assessing performance of signatures on a wellknown Breast Cancer data set. Within this approach we developed two different methods, one based on forward selection, the other on genes that were statistically significant in all training blocks of data.
This paper demonstrates a lack of agreement between signatures estimated using a forward selection method on randomly selected subsets of the same data. Our novel method overcomes the instability of the forward selection method. It is computationally fast and is shown to have lower test error variance and FPF.
The 10fold crossvalidation framework is a very useful and less computer intensive method than some other resampling methods. It allows estimation of unbiased misclassification error as well as an assessment of the signature’s sensitivity and specificity. Such evaluations are of utmost importance especially when the number of samples is small relative to the number of measurements.
The use of the pair of measurements, true positive fraction (sensitivity) and false positive fraction (1specificity), provides a more clinically relevant evaluation of overall performance than the single measure given by the correctly predicted fraction (or its complement).
Although the hope of finding a single diagnostic tool is laudable, it is still too early to be making any strong claims. Given the complexity of human breast tumours, we agree that it is unlikely that “robust and internationally agreed signature gene lists will be accumulated in the near future” [7].
Finally, we have as yet to fully understand the enormous quantities of data that the rapidly evolving microarray technologies are giving us. Moreover, “although the use of geneexpression profiles in clinical practice is very appealing, we should be very cautious …” [23].
Acknowledgments
Part of this research was supported by the Australian Research Council (ARC) Grant DP0343727 and by the ARC Centre of Excellence for Mathematics and Statistics of Complex Systems (MASCOS). The authors thank the reviewers for their helpful comments.
References
 M. Schena, Microarray Analysis, WileyLiss, New York, NY, USA, 2003.
 S. Michiels, S. Koscielny, and C. Hill, “Prediction of cancer outcome with microarrays: a multiple random validation strategy,” The Lancet, vol. 365, no. 9458, pp. 488–492, 2005. View at: Publisher Site  Google Scholar
 S. G. Baker and B. S. Kramer, “Identifying genes that contribute most to good classification in microarrays,” BMC Bioinformatics, vol. 7, article 407, pp. 1–7, 2006. View at: Publisher Site  Google Scholar
 C. Ambroise and G. J. McLachlan, “Selection bias in gene extraction on the basis of microarray geneexpression data,” Proceedings of the National Academy of Sciences of the United States of America, vol. 99, no. 10, pp. 6562–6566, 2002. View at: Publisher Site  Google Scholar
 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics, Springer, New York, NY, USA, 2001.
 R. Simon, “Diagnostic and prognostic prediction using gene expression profiles in highdimensional microarray data,” British Journal of Cancer, vol. 89, no. 9, pp. 1599–1604, 2003. View at: Publisher Site  Google Scholar
 T.K. Jenssen and E. Hovig, “Geneexpression profiling in breast cancer,” The Lancet, vol. 365, no. 9460, pp. 634–635, 2005. View at: Publisher Site  Google Scholar
 R. Tibshirani, “Immune signatures in follicular lymphoma,” The New England Journal of Medicine, vol. 352, no. 14, pp. 1496–1497, 2005. View at: Google Scholar
 Y. Wang, J. G. M. Klijn, Y. Zhang et al., “Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer,” The Lancet, vol. 365, no. 9460, pp. 671–679, 2005. View at: Publisher Site  Google Scholar
 C. J. Scarlett, R. C. Smith, A. Saxby et al., “Proteomic classification of pancreatic adenocarcinoma tissue using protein chip technology,” Gastroenterology, vol. 130, no. 6, pp. 1670–1678, 2006. View at: Publisher Site  Google Scholar
 R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, 2005.
 Gene Expression Omnibus, http://www.ncbi.nlm.nih.gov/geo.
 J. Downward, “Cancer biology: signatures guide drug choice,” Nature, vol. 439, no. 7074, pp. 274–275, 2006. View at: Publisher Site  Google Scholar
 J. Fan and J. Lv, “Sure independence screening for ultrahigh dimensional feature space,” Journal of the Royal Statistical Society B, vol. 70, no. 5, pp. 849–911, 2008. View at: Publisher Site  Google Scholar
 J. M. Neuhaus, “Misspecification,” in Encyclopedia of Biostatistics, P. Armitage and T. Colton, Eds., vol. 5, pp. 3305–3307, John Wiley & Sons, Chichester, UK, 2004. View at: Google Scholar
 L. EinDor, I. Kela, G. Getz, D. Givol, and E. Domany, “Outcome signature genes in breast cancer: is there a unique set?” Bioinformatics, vol. 21, no. 2, pp. 171–178, 2005. View at: Publisher Site  Google Scholar
 I. A. Wood, P. M. Visscher, and K. L. Mengersen, “Classification based upon gene expression data: bias and precision of error rates,” Bioinformatics, vol. 23, no. 11, pp. 1363–1370, 2007. View at: Publisher Site  Google Scholar
 G. J. McLachlan, J. Chevelu, and J. Zhu, “Correcting for selection bias via crossvalidation in the classification of microarray data,” in Beyond Parametrics in Interdisciplinary Research: A Festschrift to P.K. Sen, N. Balakrishnan, E. Pena, and M. J. Silvapulle, Eds., IMS Lecture NotesMonograph Series, pp. 383–395, Institute of Mathematical Statistics, Hayward, Calif, USA, 2007. View at: Google Scholar
 B. Ripley, “Computer intensive methods,” in Encyclopedia of Biostatistics, P. Armitage and T. Colton, Eds., vol. 2, pp. 1071–1075, John Wiley & Sons, Chichester, UK, 2004. View at: Google Scholar
 E. Biganzoli, N. Lama, F. Ambrogi, L. Antolini, and P. Boracchi, “Prediction of cancer outcome with microarrays,” The Lancet, vol. 365, no. 9472, pp. 1683–1686, 2005. View at: Publisher Site  Google Scholar
 H.Y. Chuang, E. Lee, Y.T. Liu, D. Lee, and T. Ideker, “Networkbased classification of breast cancer metastasis,” Molecular Systems Biology, vol. 3, article 140, pp. 1–10, 2007. View at: Publisher Site  Google Scholar
 M. J. van de Vijver, Y. D. He, L. J. van 't Veer et al., “A geneexpression signature as a predictor of survival in breast cancer,” The New England Journal of Medicine, vol. 347, no. 25, pp. 1999–2009, 2002. View at: Publisher Site  Google Scholar
 A. AbdullahSayani, J. M. BuenodeMesquita, and M. J. van de Vijver, “Technology insight: tuning into the genetic orchestra using microarrays—limitations of DNA microarrays in clinical practice,” Nature Clinical Practice Oncology, vol. 3, no. 9, pp. 501–516, 2006. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2009 Yvonne E. Pittelkow and Susan R. Wilson. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.