Abstract

Accurate differential analysis of microarray data strongly depends on effective treatment of intergene correlation. Such dependence is ordinarily accounted for in terms of its effect on significance cutoffs. In this paper, it is shown that correlation can, in fact, be exploited to share information across tests and reorder expression differentials for increased statistical power, regardless of the threshold. Significantly improved differential analysis is the result of two simple measures: (i) adjusting test statistics to exploit information from identifiable genes (the large subset of genes represented on a microarray that can be classified a priori as nondifferential with very high confidence], but (ii) doing so in a way that accounts for linear dependencies among identifiable and nonidentifiable genes. A method is developed that builds upon the widely used two-sample t-statistic approach and uses analysis in Hilbert space to decompose the nonidentified gene vector into two components that are correlated and uncorrelated with the identified set. In the application to data derived from a widely studied prostate cancer database, the proposed method outperforms some of the most highly regarded approaches published to date. Algorithms in MATLAB and in R are available for public download.

1. Preamble

In certain ways, this paper represents a departure from current trends in scientific publishing. The Worldwide Web has made available extraordinary resources in the form of databases for comparative analysis of methods in bioinformatics and numerous other disciplines. The benefits of using common sets of real data to compare and contrast new algorithms are obvious. In some fields of investigation, especially, perhaps, research in early states of knowledge (e.g., genomics), there is an equally obvious drawback in using real data—that the “correct answers are not known,” making it difficult to ultimately interpret differences in performance as anything but differences.

Lest the reader be preparing for an argument promoting classic simulation studies, we hasten to state at the outset that this argument is not forthcoming. Before the age of the internet, simulation studies using reasonably justified data models (Gaussian errors, etc.) were a time-honored standard in all areas of math, science, and engineering. The ready availability of rich data resources makes it irrational to advocate to a return to “pure simulation” using models that are untested against these existing data sets. The authors of this paper in no way promote a return to such methods and appeal to the reader to recognize that “models detached from reality” are not used in any way in this paper.

This research centers on some rather straightforward adjustments to classic hypothesis-testing procedures for use in differential analysis of microarray expression data. The salient result is a “reranking” of the order in which gene expression data are considered for “truly differential” status. In an effort to objectively compare these modified methods with the performance of established algorithms, it was decided to create a database of microarray expression simulations in which the data retained as much second-order statistical character as possible relative to a widely used prostate cancer database. The up- and downregulations of gene expressions were indeed synthetically modulated onto a carefully constructed baseline. The consequence is a set of comparative performance results that are objectively based on data that were “guided by nature” but which were, quite openly stated, synthesized with this natural guidance. Rather than dismissing these results as “simulations,” the reader is urged to consider whether there is merit in moving the uncertainty to the data generation side (if that uncertainty can be intelligently controlled), if it permits objective results on the data analysis side. The current modus operandi is to accept uncertainty in the performance results with the benefit of authenticity in the data generation. The authors hope that this question might engender some debate and research.

Using the testing approach employing “nature-guided simulation,” the results for the reranking method presented here are remarkably good relative to two established methods developed by respected authorities. Many people have reviewed these results, including some very eminent statisticians and bioinformaticians. Reactions have ranged from encouragement and amazement to deep skepticism. The comment “too good to be true” has been used at least twice, including, once, in a constructive criticism, by a distinguished editor of this journal. We understand this response: unfortunately (or fortunately, depending on one's perspective), no one has been able to find a flaw in the methods. We suggest two possibilities. Some aspect of the simulation procedure is creating a bias toward the developed detection method. Authors with a somewhat different perspective (two signal processing engineers and a cancer researcher, with over a century of combined research experience, but without extensive work in bioinformatics) were able to see some relatively simple algorithmic adjustments that eluded researchers focusing on deeper issues.

Rather than viewing the results of this paper as a claim of superiority of the new method over respected algorithms, the authors appeal to the reader to accept the report in the spirit it is offered: interesting, potentially useful, results that raise many questions, and possibilities for further research. The “intelligent stimulation” approach in itself may offer some grounds for innovation in the field. Attempts to verify that the present results are, indeed, “too good to be true” may reveal technical information benefiting differential expression detection methods. This is the classical way in which research moves forward. We are grateful to the ISRN Journal of Bioinformatics for the opportunity to bring these ideas to the attention of the research community.

2. Introduction

The DNA microarray was initially touted as a tool that would revolutionize the understanding of complex diseases and usher in an era of personalized medicine. This optimism is on display in Lander's 1999 Nature Genetics article entitled “Array of hope” [1]. It is not unusual, however, for near-term impacts of emerging technologies to be overestimated when first deployed, then to have the expectations moderated as the technologies reveal new complexities in the problems they are designed to solve. Over the past decade, early optimism about the microarray has given way to a pragmatic understanding of challenges and the need for further research and development. This normal course of events led to Frantz's 2005 article in National Review of Drug Discovery entitled “An array of problems” [2]. The study of microarray data has shown the need for exceeding care in the design and regularization of experiments and in the data collection and preprocessing, but the biggest hindrance to progress has been the lack of definitive methods for interpretation of microarray results.

One of the main challenges to proper analysis is the presence of significant correlation among gene expressions manifest in the microarray results [3, 4]. One measurable indication of the uncertainty caused by correlated differential-expression tests is the resultant increase in the variance of the false discovery rate (FDR) [3]. Linear statistical dependence among gene-expression correlations, therefore, can be quantifiably linked to higher-risk detection algorithms for the discovery of active genes. Among many causes, intergene correlation is attributable to coexpression of genes [5] and to unmodeled factors that introduce systematic effects across genes [6, 7]. As a result, for most real data, the assumption of independence or weak dependence among gene expressions is unfounded, and methods treating correlation are necessary [8, 9]. In fact, a few investigators have even questioned the adequacy of accounting for correlation alone and have examined the implications of nonlinear dependence on the discovery of genes [1012].

Correctly detecting differentially expressed genes—or the related task of estimating the FDR—in the presence of substantial intergene correlation is a challenging problem that has received much recent attention since reported in papers by Owen, Efron, and others (e.g., [3, 4, 9, 13]). For example, Storey et al. [14] present an approach to the notion of sharing information across scores, which they describe as “borrowing strength across the tests” for a potential increase in statistical power. Tibshirani and Wasserman [15] discuss a quantity called the “correlation-shared” -statistic and derive theoretical bounds on its performance. Hu et al. [16] examine the covariance structure of the expression data and discover benefits of linking coexpression and differential expression in a distance measure—reflecting the more recent interest in characterizing broader statistical patterns in microarray data.

Recent research that is jointly concerned with differential expression and coexpression has also yielded results and methods that could ultimately benefit the gene discovery problem. Because the differential coexpression research is often concerned with differing phenotypes, rather than with different treatment conditions, two given research efforts involving differential coexpression might seek answers to different sets of genetic questions through expression data. Like the “treatment conditions researchers,” however, the “phenotype” researchers have encountered their own forms of confounding dependencies, notably the relative gene locations, the expression time sequencing, and phase information (e.g., [1719]). Papers have been published addressing these issues, including the exposition of new statistical approaches—for example, “CorScor” [20], the “ECF-statistic” [21], gene-set coexpression analysis [22], fuzzy expression level assignments [23], expression-profile mining with decorrelation [24], and detection of microarray outliers [25]—as well as new clustering methods—for example, a web-based expression analyzer [26], high-order preclustering methods [27], and the “BioSym” distance measure [28]. A recent review of clustering methods in genomics appears in the paper by Dalton et al. [29]. A more general examination of the performance of classifiers of microarray expressions appears in the paper by Ancona et al. [30].

This paper is focused exclusively on the differential expression problem. Research in this area has largely focused on understanding harmful correlation effects on the choice of the threshold demarcating the statistical boundary between differential and nondifferential expression. In fact, however, the nominally confounding correlation can be used to advantage in increasing statistical power of microarray studies. This paper presents a differential analysis method that exploits identifiability and uses a gene expression reranking criterion that accounts for intergene correlation. The framework is readily generalizable for use in studies with multiple or continuous covariates, as well as to other multiple comparison applications. An example method presented here builds upon the widely used two-sample -statistic approach with a decomposition of the expression vectors into subspaces of correlated and uncorrelated components.

3. Problem Formulation

Suppose that expression data for genes are measured on microarrays, resulting in a gene expression matrix, say , with element . Each of the microarray experiments takes place under one of two conditions (indexed by ) such as control and treatment. These two subsets of the data are called treatment groups in the paper.

Based on the gene expression matrix, , we seek to identify a “small” number, , of genes that are significantly differentially expressed between the two groups. One widely used strategy (e.g., [31, 32]) is to hypothesize that each gene is not differentially expressed. We refer to this as the null hypothesis, denoted . For convenience, we use the shorthand notation to indicate that the null hypothesis is known to be true for gene , and, conversely, indicates that gene does not satisfy . Gene is tested against using a two-sample t-statistic, say, . The magnitudes of the statistics , establish a gene ranking and the genes with the largest -scores are reported as statistically significant discoveries. The investigator can either supply a value for or rely on an estimation of the number of false discoveries (type I errors, false positives), say , or, equivalently, the FDR, defined as , to find a maximal with the allowable or (e.g., [3, 5, 9, 13, 3336]).

As discussed in [4, 11], for an “overpowered” matrix, there may be significantly fewer tail-area null counts than expected, whereas for an “underpowered” , the situation can worsen with an excessive number of tail-area null counts. It is important to note that techniques for estimating the FDR change the numbers of genes in reported lists, not the gene rankings. The present research was motivated by the hypothesis that, for an “underpowered” , it would be possible to exploit correlation across scores to establish a gene ranking with more statistical power than the raw -based ranking. The method that resulted from an exploration of this question indeed seems to improve the statistical power of all matrices.

The new method uses a vector of -statistics, and an estimate of the covariance matrix of the vector , to output a substantially revised version of , denoted , the entries of which provide an improved gene ranking. For expediency, we will refer to the procedure that produces from as correlation adjusted reranking, or simply reranking. The essence of reranking is embodied in some fundamental data conditioning procedures to effect the two outcomes mentioned in the introduction—exploiting identifiability, and nullifying the effects of intergene correlation between identified and non-identified genes.

In the following sections, we develop the theoretical basis for the reranking process. The performance of reranking is then compared with that of state-of-the-art methods on data derived from real expression experiments. Results of some judiciously developed simulation studies are also reported for their value in understanding certain aspects of the performance.

4. Methods

4.1. Per Gene Summary Statistic

Let us first supply the details surrounding the vector of statistics introduced above. will be viewed as a random vector with mean vector and covariance matrix . Henceforth, we write . This is a theoretical model only, as and are generally unknown. No further distribution information is required.

Let denote the average differential expression level for gene , . Further, let be the average expression level for gene in treatment condition . Then, the (unpaired) -statistic for gene is computed as where is the pooled within-group sample standard deviation of gene . If , then we expect , where is the number of degrees of freedom, obtained from either the unpaired -test theory or the permutation null calculations [4]. Otherwise, we expect with and denoting the element of and the diagonal element of , respectively. For , and depend on the amount of up- or downregulation of the gene expression, the number of samples in each treatment group, and the number of degrees of freedom, .

4.2. Invoking Identifiability: The Zero Assumption

The direct use of -scores for ranking neglects some important information that is inherent in the microarray matrix . Fundamentally, the use of raw -scores does not exploit identifiability, the strongly justified assumption that certain genes almost surely satisfy . Formal backing for the creation of an identifiable set is found in Efron's zero assumption (ZA) [5], which states that a fraction, say , of the genes—those with the smallest statistics—satisfies . The ZA plays a central role in the literature on estimating the proportion of null genes, as in [13, 37]. The ZA is equally crucial for the two-group model approach developed in the Bayesian microarray literature, as in [3840]. Furthermore, the assertion that the method developed by Storey [35] improves upon the well-known Benjamini-Hochberg FDR procedure [33] (in terms of statistical power) crucially relies on an adaptive version of the ZA.

The use of the ZA is justified in the reranking procedure as long as the parameter is sufficiently small. For example, we set in experiments below based on the almost certain knowledge that ~50% of the genes in a cell would not be differentially expressed in the formation of prostate cancer [5]. Accordingly, in the initial step of the reranking process, we invoke the ZA to partition the genes into an identified set of elements assumed to satisfy and a candidate set of genes, so-named because they remain “candidates” to become “discovered” genes (i.e., to be among the genes declared to be differentially expressed). For convenience and without any significant loss of generality (especially for large ), we will assume that if the fraction, , rather than the cardinality, , is used to specify the size of the identified set, then is selected so that is an integer. That integer is therefore , and .

A simple corollary to the ZA is that the value for each represents a “noise” value in the -score for that gene. This is because the expected value of this statistic is whenever . Accordingly, we can view the value for as a random variation around the nominal value of zero differential expression. The reranking procedure exploits this information to adjust the values of the candidate gene statistics. This “adjustment” is a consequence of decorrelating candidate expression values from those in the identified set. That these two subsets would be correlated may, at first, seem counterintuitive because the “interesting” differentially expressed genes must, by definition, come from the candidate set. Would it not be the case, therefore, that the significant intergene correlation would occur among candidate expression values that similarly respond to the change in treatment condition? Whereas two coexpressed genes would likely have correlated expression differentials, it is not this dependence that potentially causes false discoveries. To the extent that the correlation between genes is a reflection of response to treatment conditions, the correlation is expected and informative. Correlation between truly expressed and unexpressed genes (possibly identified) reflects normal variations in expression unrelated to treatment condition. Correcting for such correlation (whether the effect is to increase or decrease the expression level) is important to the proper assessment of candidate -scores. Moreover, the existence of the identified set, with its “noise only” interpretation of -scores, makes possible this correction; ultimately resulting is reranked expression statistics.

To proceed, we must add some formality to the descriptions of the identified and candidate sets. Without loss of generality, we may assume that the complete set of genes is indexed so that in which is the number of genes declared null under the ZA. Then, genes with indices are assumed null and therefore comprise the identified set as defined above. Let us partition the set of statistics into those corresponding to genes declared null under the ZA, and those for the remaining genes that continue to compete for the nonnull designation (the candidate set), . For convenience, express the vector in terms of these two partitions: The random vector has the following moments: in which and . Recall that the goal of the reranking process is to find a vector the elements of which represent a reordering of the elements of , such that gene ranking represented by has better statistical power for detecting nonnull genes than that based on itself. In the light of the newly defined notation in (5), we can more specifically say that represents a reevaluation and reordering of the elements in . The remaining elements of the revised vector, comprising the vector partition, say “,” are effectively set to zero since these genes are assumed to represent null genes. Recall that the -scores in the vector (before processing) are assumed to represent noise variations around the nominal zero differential value for a null gene.

4.3. Theoretical Estimator of

Conditioned upon the random vector , we seek a revised vector, , in which the expression statistics for the candidate gene set (originally ) are uncorrelated with those of the identified set (originally ). There are many ways to derive the desired result, each with its own interpretation, but all, of course, ultimately equivalent. Whatever the approach is, it is expedient to remove the mean from the vector and work with the centered vector . The centered counterpart to the reranking vector, , will be denoted . The constant vector will be returned to the result at the end. Recall that the mean of the vector is , so that “centering” is unnecessary for (throughout the paper, the notation denotes the zero matrix (vector) in the Cartesian space ).

To derive the desired expression for , we adopt a simple approach based on well-known ideas from the theory of linear operators (e.g., [4144]). Let us view the space of random vectors in (we are assuming , or . If the converse is true, simply reverse the roles of the and spaces in this development). as a Hilbert space, , with inner product for all . The inner product induces a norm on given by , and, in turn, a metric Now let be a closed subset of . Given some , we wish to find closest element of , say , to , in the sense that in which is the metric in (6). The Hilbert projection theorem states the solution exists and is unique. Moreover, has the property that the vector difference between and is orthogonal to all vectors in the subspace ; that is, , in which is the orthogonal complement to subspace in . In particular, random vector is orthogonal to which means that by definition. Thus, , implying that and are stochastically orthogonal (uncorrelated if ).

The Hilbert-space formulation provides the structure in which to achieve the desired decomposition of the random vector into components that are correlated, and uncorrelated, with . Vector resides in the space . We seek a vector in a -dimensional subspace of (because is the dimension of ) that is close to (this will be the component of that is correlated with ). The problem at hand is only subtly different from one described in the generalities above. The difference is that we are not given a subspace “” in which to find an optimal vector; rather we are given only a vector which may reside in an uncountable number of subspaces of dimension . The goal will be to perform a linear operation on the given vector to “make it close” to . In the process, we will inherently construct the subspace in which the optimal result resides. The subspace and resulting vector will be of dimension because they are merely different representations of and its implicit initial subspace .

Given , let us conceptualize a -dimensional subspace of , , consisting of all the (random) vectors . represents some (yet unknown) linear operator of dimensions and of rank . According to the Hilbert projection theorem, for a fixed , there is a unique vector , hence a unique linear operator, , such that Since the metric is nonnegative, we may equivalently seek that minimizes Now, interpreting the expectation to be a mean-square average, we compute the gradient of with respect to (e.g., see [45]), and set the result to , the solution to which is : Recognizing that (recall (5)) and , the solution becomes

Now the random vector is as strongly linearly related to (correlated with) as is possible in the -dimensional subspace of that is spanned by the columns of operator . The correlation matrix relating and is, say, Although is of dimension , it is clearly singular (rank ) reflecting the inability of to be linearly related to any component of in the subspace . In fact is the component of that embodies the potentially destructive correlation of the identified genes with the candidate genes. The decorrelated version of that we seek is therefore Note that is precisely the difference vector that is guaranteed by the Hilbert space theory to reside in and therefore to be orthogonal, hence uncorrelated, with . Finally, reinserting the mean value, , of the candidate vector, we have

4.4. Sample Estimator of

Estimates of the elements of are required. For this purpose, we make several observations that are easily verified through simulation. Note that (15) does not require the covariance between and when .

Let and denote the scalar covariance and correlation, respectively, between real, scalar random variables, and : In these terms, we make the following observations.

Observation 1. If , then [3, 4]

Observation 2. If and (or conversely), then Equation (18) accommodates the possibility that the correlation between a null and a nonnull gene may change between treatment groups. If this does not occur, then (18) reduces to (17).

Observation 3. Furthermore, if (true for most microarray studies), then (18) simplifies to Equations (17) and (19) suggest that we may use the sample covariance to estimate : where denotes the expression level of the gene measured on the microarray after subtracting the average response within the treatment group to which belongs ( or ). The scale factor cancels when the terms and are multiplied in (15), so that estimating is not required.
In the light of (20), (15) takes the practical form where and are the partitions (similarly to (5)) of , the sample correlation matrix of the gene expression matrix (after removing the treatment effects). For notation compactness, we have defined . In principle, the set of elements in the vector of (15) embodies the gene-expression reranking in light of the compensatory measures taken to incorporate identifiability and to remove the effects of correlation. In practice, we rely on the estimate of   in (21).

5. Implementation

5.1. Gene Reranking Algorithm

A stepwise procedure for the gene reranking is given in Algorithm 1. The process begins by reindexing the genes based on their two-sample -statistics (equation (3)). Then, based on the ZA, the first genes are declared to satisfy . In the experiments reported below, , the fraction of genes declared to be identifiable is set to by default (the precise value is for the database used which has total genes. See the second paragraph of Section 4.2 for an explanation). Although the choice is somewhat arbitrary, this fraction is clearly justifiable and it has worked well empirically in the data sets tested.

Input: labeled gene expression matrix
       Desired size of differential gene list
Steps:
Calculate two-sample (unpaired) -statistics as in equation .
Re-index genes such that .
Create vector with elements ( scores) in order of ascending magnitude.
Set (default value ). Set first elements of   to zero ( partition).
Convert to by subtracting each gene's average response within each treatment group.
Compute sample correlation matrix of     (Section 4.4).
Find   as in (21). (See discussion of product   in  Section 5.2.)
Create a list of re-ranked genes in the descending order of the values , in which   is the th
     element of   .
Report genes with the largest re-ordered scores as statistical discoveries.
Output:  List of   (experimentally-determined) most differentially-expressed genes following
      the decorrelation processing.
For convenience, it is assumed that is chosen so that is an integer.
The original vector, , of raw -scores (Step ) is ordered by ascending values of   . This ordering
simplifies the definitions of certain quantities in the formal developments. Note: however, that the
output list (Steps and ) is created according to descending   values. This is a more natural
ordering for the end result as we are interested in only the largest values.

In order to nullify any genuine treatment differences, is converted to by subtracting each gene's average response within each treatment group. The sample correlation matrix of is subsequently computed. The critical step is to compute (equation (21)). The elements of   determine the gene ranking: gene is ranked more highly than gene if in which is the element of . The first genes in the reranked list are reported as differentially expressed.

5.2. Numerical Stability and Computational Complexity

Ordinarily, , so that the sample correlation matrix is severely rank deficient. A small quantity (typically ) is added to the diagonal entries of to make it invertible. After this augmentation, the algorithm above exhibits excellent numerical stability.

If implemented in a näive way, the matrix inversion to compute in (21) would be a prohibitive operation in most computing environments, since microarray data sets may have several tens of thousand genes. Determining the rightmost product in (21), , by solving the system of simultaneous linear equations for the vector is much faster than explicitly computing the matrix inverse and forming the product. In particular, we can employ the Cholesky decomposition to exploit the fact that the matrix is symmetric and positive definite (e.g., [46, Theorem ]). MATLAB implementation uses the built-in function linsolve with appropriate settings, which, in turn, uses the highly optimized routines of LAPACK (http://www.netlib.org/lapack/).

The prostate cancer data [47] used in the experiments of Section 6 includes genes and samples. For these data, the algorithm above implemented using MATLAB version R2006b on a computer with a 2.2 GHz dual-core AMD Opteron processor and 8 GB of RAM required ~40 seconds to report the final gene list. Similar implementation with explicit matrix inversions requires ~10 minutes. These times clearly indicate the relative benefit of avoiding the explicit matrix inversion, but the faster reporting time of ~40 sec should certainly not be interpreted as a lower bound for a problem of this scale. Indeed, workstations with 32 GB or more of RAM and with faster processors with eight or more processing cores are commercially available at modest costs. Of course, MATLAB is designed for modularity and ease of use, not computational efficiency. Dedicated, lower-level coding of the reranking steps, implemented on a faster machine with more parallelism could reduce the reranking time significantly.

6. Experimental Results

6.1. Technical Comparisons

The reranking method developed above is compared with two leading techniques, SAM (Significance Analysis of Microarrays [31]) and EDGE (Extraction and Analysis of Differential Gene Expression [14, 48]). SAM adds a small exchangeability factor to the pooled sample standard deviation when computing the two-sample -statistic: whereas EDGE is based on a general framework for sharing information across tests. EDGE is reported to show substantial improvement in statistical power over five prominent techniques including SAM [14], the -test [49, 50], the shrunken -test [51], the empirical Bayes local FDR [40], and the a posteriori probability approach [52]. It is noteworthy that the reranking procedure developed here shows a significant performance improvement over EDGE in the experiments conducted. To determine the performance quality of various techniques, we focus primarily on the numbers of false positives, , and the corresponding FDR values, , in the reported gene lists. Broadly speaking, the smaller the FDR, the better the technique.

6.2. Results
6.2.1. Prostate Cancer Data

The primary experiments reported in this paper are based on the prostate cancer data from the work of Singh et al. [47]. This database includes expression data for genes on oligonucleotide microarrays, comparing healthy males with prostate cancer patients. The purpose of the Singh study is to identify genes that might anticipate the clinical behavior of prostate cancer. The  .CEL files for the prostate study are publicly available at http://www-genome.wi.mit.edu/MPR/prostate. The general purpose of the present experiments is to compare performance of the reranking algorithm with the published state-of-the-art methods EDGE and SAM. The software RMAExpress [53] was used to obtain high-quality gene expressions from the posted data files. RMAExpress applied its in-built background adjustment; however, the quantile normalization was not used. To increase normality and stabilize across-group variances [54], each gene was represented in the final expression matrix by the log of its expression level.

Comparative algorithm performance and insight into the inner workings of the reranking method required expression matrices for which truly differentially expressed genes were known a priori. Of course, in the nascent field of genomics, such knowledge is not available, and it is the very purpose of techniques like those discussed in this paper to seek such information. We approached this circular problem by using the prostate database to create test expression data with intergene correlation in resembling that in the real microarray data. This was accomplished by first row standardizing the expression matrix from the prostate database. In particular, the true prostate matrix was transformed to by subtracting each gene's average response within each treatment group and by normalizing within group sample mean squares. That is, for the individual treatment groups (for , then for ) and for each : in which is the element of matrix . Each row represents one gene (and two conditions), so that with this transformation, all genes have equal energy and yet the same within group intergene correlation structure as the original . Normalizing within-group sample mean squares to unity is not implemented in the reranking algorithm. The normalization is done here prior to any processing as a first step in creating an expression matrix with known differentiation of expression across groups for each gene, but with realistic (derived from real data) intergene correlation.

To generate a test data set from , its 102 columns were randomly divided into groups of and . Next, () genes were randomly chosen for up- (down-) regulation by adding a positive (negative) offset () to the corresponding entries in group 2. The total number of truly differentially expressed genes is denoted ( is to be contrasted with , the number of genes determined by experimentation to be differentially expressed): We also denote by the proportion of truly differentially expressed genes, and, for future purposes, the ratio of the size of the desired gene-discovery list to the number of truly-differential genes, . In the experiments, various choices of the simulation parameters, were tested to represent a range of data scenarios encountered in practice. Also associated with each trial is a set of parameters characterizing the gene-discovery experiments, say, These parameter sets are detailed below.

In all experiments, results for the existing EDGE and SAM methods were obtained using the subroutines statex.r from the EDGE software package (http://www.genomine.org/edge/) and samr.r from the SAMR package (http://www-stat.stanford.edu/~tibs/SAM/), respectively. Both routines computed their native gene summary statistics given the matrix and corresponding column labels. These statistics, in turn, were used to determine the top genes. Results based on reranking proceed from the steps outlined in Algorithm 1 with values corresponding to the largest scores of the reranked list.

Three experiments (cases) involving the prostate data are reported here. The first two cases were designed, by choice of the proportion of truly differential genes, , to represent typical conditions of two general classes of gene-discovery problems. The third case was carried out to test robustness of the technique to small sample size.

Case 1 (Small ). In the first case, is small, , meaning that there are relatively few truly differentially expressed genes. The smaller is consistent with microarray investigations seeking genes that distinguish subtypes of cancer or diabetes, for example. The complete simulation parameter set for Case 1 is where means that and . For numerical simplicity, we based the experiments on of Singh's [47] gene expressions, so that . Two sets of experiment parameters are used in Case 1, differing only in the size of the list of discovered genes.
Subcase 1.1 (Small , Small ). In the first experiment, the parameter set is The size of the gene-discovery list, , is significantly smaller than , or . In practice, a relatively small would be chosen to identify high-quality, class-distinguishing features for expression-profiling-based clinical diagnosis and prognosis, in which the goal is to build accurate classifiers and predictors. Whereas Singh et al. [47] build a classifier around only 16 of 12625 features, they discuss the need to include as many reliable features as possible.
Figure 1(a) presents results for the test pair of Subcase 1.1. Remarkably, for 36 of 40 matrices, the reranking procedure reports gene lists with no false discoveries at all, while the other techniques fail to obtain a single gene list with . This result is typical of many “small ” experiments carried out with an array of parameter sets. In particular, the quality of the results notwithstanding (as measured by , see below) the reranking strategy uniformly outperformed EDGE and SAM in every scenario.
In any rational detection algorithm built around a parametrized stochastic framework, it is possible to find regions of the parameter space in which performance deteriorates. In the “small ” gene identification problem, for a fixed and , increasing (more specifically, increasing the ratio ) or decreasing the “signal” magnitudes of either up-( ) or down-() regulation, all create increasing probabilistic risk of false discoveries, . As was allowed to approach in Case 1 experiment above, the performance of all methods, EDGE, SAM, and reranking, all deteriorated as measured by , yet the reranking approach remained consistently superior to the others according to this measure. “Better,” however, does not always mean “good.” For illustration, we report a second Case 1 experiment.
Subcase 1.2 (Small , Large ). In this variation of Case 1 experiment, we take . This is still a “small ” situation, but with a “greedy” approach to gene discovery, an attempt to identify “all” genes (estimated to be) differentially expressed, . Such a strategy would be employed in a microarray study designed to liberally identify a set of genes to be explored further—experimentally or computationally—to gain better understanding of underlying gene networks.
Figure 1(b) shows plots of the number of false positives, , over 40 data sets for Subcase 1.2 experiment. The corresponding FDR, , is shown on the secondary ordinate axis. With and with relatively weak differential expression “signals” (), identifying a good gene lists is not an easy task, as evident from the results. Among all methods only reranking achieved sufficiently low values of to rescue a few matrices, but, clearly, even reranking would not provide scientifically useful or reliable gene lists in this high-risk environment.

Case 2 (Large ). The second case employs a larger , typical of studies comparing healthy versus diseased cell activities. Simulation parameters for this case are Relative to Case 1, there are many more truly differentially expressed genes in Case 2 (increased by factor 4), thus decreasing the risk of false discoveries, especially for a small ratio . This is akin to an increased prior probability of a differentially expressed gene in a Bayesian detection strategy. To further challenge the algorithms in the light of the “increased prior,” the up/downregulation of expression was made considerably weaker in Case 2 (reduced by factor five relative to Case 1), as in Case 1, two sets of experiment parameters were used in Case 2, differing only in the size of the list of discovered genes.
Subcase 2.1 (Large , Small ). In the first Case 2 test, the experiment parameter set is given by The test pair represents a large proportion, but relatively small . The experimental results for this case over 40 trials are shown in Figure 2(a). In spite of the decreased signal strength, the reranking procedure produces no false discoveries in a vast majority of trials, similarly to the small , small , experiment of Subcase 1.1. EDGE and SAM consistently report a very large proportion of false discoveries (typically 250, or 90%).
Subcase 2.2 (Large , Large ). A second Case 2 experiment was run to show the effects of “greedy” discovery lists, or large ratios. The parameters remain identical to those in Subcase 2.1 experiment, except that , so . Results are shown in Figure 2(b). Like the large experiment in Subcase 1.2, the reranking approach significantly outperforms EDGE and SAM, with typically for reranking and for the standard methods. The sample variances for and are notably smaller in Subcase 2.2 trials relative to similar trials in Subcase 1.2. Also as in Subcase 1.2, reranking consistently outperforms the standard methods for large throughout the range . At some application-dependent point in this range, reranking, although “better,” is not sufficiently “good” at producing reliable discoveries. Clearly, in the present experiment in which , the reranking result of typically is not indicative of reliable gene discoveries. For a few trials, the rate drops as low as , but even this best-case rate implies ~540 false discoveries in the reported list of 1200 genes.

Case 3 (Tests of Robustness to Small Sample Size). The number of microarray chips available to a study, , is ordinarily quite small compared with the number of genes investigated, . The gene discovery operation is therefore required to draw conclusions from a sparse sampling of the gene expressions. To gain some insight into the robustness of the gene-discovery methods to small sample sizes, variations on Case 2 (large ) experiments were repeated with the further stressor of a significant reduction in the sample space. and were each reduced to 20. So that observations could be more attributable to the sample size, , the signal levels were increased back to Case 1 values of . The simulation parameters used in Case 3, , are identical to of (29). As in the previous cases, we ran two experiments, the first (Subcase 3.1) with a “conservative” gene discovery list, or and the second (Subcase 3.2) with a “greedy” gene discovery list of size or .
To create the data set for Case 3, 20 columns per treatment group were chosen randomly from the original prostate cancer expression matrix . The data generation process (including row standardization) detailed in Section 6.2.1 was then applied to the selected columns. Some compensation for the reduction in the number of samples is potentially present in the increased differential signal. The and values over 40 trials for the three methods are shown in Figure 3(a) for Subcase 3.1 and in Figure 3(b) for Subcase 3.2. Even with the significantly reduced sample size, the reranking process provides consistently superior performance with respect to existing methods. As in previous experiments, however, the better performance in the large experiment of panel (b) does not mean that the results are necessarily reliable or useful. Nevertheless, these results suggest that the reranking procedure increases power in the analysis of small sample data sets.

6.2.2. Simulated Data

Before devising the test data setup of Section 6.2.1, the reranking method was tested on several simulated data sets. We discuss some of these simulation results that shed further light on the small sample behavior of the method.

Let us denote by the column of a simulated expression matrix . We assume that the random vector is multivariate Gaussian with (stationary with index ) mean and covariance matrix . Each such column represents genes, which, in their null expressions, are modeled by a covariance matrix that introduces roughly the same amount of linear dependence as found in the BRCA data of [55]. We chose simulation parameters and the experiments were run with parameters for the two list sizes () and ().

Figures 4(a) and 4(b) show plots of the and values over 40 trials for and , respectively. With a smaller , preeminence of the reranking method scales down. Nevertheless, for 26 out of 40 simulated realizations, reranking achieves an FDR and for 30 of 40 trials, .

Table 1 shows results for some of the realizations for (Figure 4(b)). Shown are the largest 100 values of and each corresponding original with concomitant rank. Even in this challenging case, the results indicate favorable aspects of the reranking procedure. First, it is noteworthy that reranking results in false discoveries in the list of 100 genes, whereas when raw statistics are used. Further, all but two of the false discoveries reported by reranking received an even higher ranking by -statistics. On the other hand, 46 of the correct discoveries by reranking would not have appeared in the list of 100 genes reported by -statistics. Results like these were observed repeatedly in our data analysis. Consistent with the results of the prostate data studies, the reliability and utility of all techniques lessen as , yet, reranking persistently outperforms the other methods.

It is notable that, with a smaller , SAM outperforms EDGE and the use of raw t-scores. This is not entirely surprising as a smaller can make the noise in the per gene pooled variance (and possibly the equivalent quantity in the EDGE algorithm) more prominent. SAM mitigates this issue in some measure by using the exchangeability factor to adjust the effective pooled variance [31].

7. Discussion and Conclusions

In most microarray data, there are at least three resources that can be used to advantage: (i) identifiability, (ii) parallel structure, and (iii) intergene correlation itself. Analysis in papers by Efron [56, 57] suggests this view of the rich information structure inherent in the data. In this light, reranking can be viewed as exploiting more than correlation as a means of sharing information across tests, as it also involves identifiability.

Limited time and resources often require biomedical researchers to work on only a small number of “hot (gene) prospects.” Even under such highly conservative conditions, however, misleading results can occur, as is evident in the results of Figures 14. For all their expert development and statistical power, even state-of-the-art tools like SAM and EDGE can report spurious gene lists. The extra statistical power of reranking promises to further guard against anomalous results that can have serious consequences for the study of gene function, causation, and interaction.

In summary, this paper has reported the development and testing of a novel framework for the detection of differential gene expression. The framework exploits identifiability—the fact that in most microarray data sets, a large proportion of genes can be identified a priori as nondifferential—to reduce the correlation in the expression data for the remaining gene candidates. When applied to the widely used two-sample -statistic approach, this viewpoint yielded a simple differential analysis technique which requires as inputs only a gene expression matrix, related two-sample labels, and the size of desired output gene-list . The method was tested on data constructed from the prostate cancer database of Singh et al. [47] and some simulated data. Compared with SAM [31], EDGE [14], and the raw -statistic approach itself, reranking shows substantial improvement in statistical power. As is the case with all published techniques, the reranking process' power tends to increase considerably with an increase in the number of microarray samples. However, even for small sample sizes, performance was significantly better than the alternatives in the experiments conducted here.

Protection of Human Subjects

Human subject data used in this study are publicly available and anonymous and are therefore exempt from continuing Internal Review Board scrutiny according to U.S. Health and Human Services Policy 45 CFR 36, Subpart A, 46.101 (2.b.4).

Acknowledgments

This work was supported in part by the Quantitative Biology Initiative at Michigan State University. The authors are grateful to the personnel in the MSU High Performance Computing Center for assistance in implementing the extensive computer simulations, and to Dr. K. H. Desai for many contributions to this work.