Abstract
Microarray data are used to determine which genes are active in response to a changing cell environment. Genes are “discovered” when they are significantly differentially expressed in the microarray data collected under the differing conditions. In one prevalent approach, all genes are assumed to satisfy a null hypothesis, , of no difference in expression. A false discovery (type 1 error) occurs when is incorrectly rejected. The quality of a detection algorithm is assessed by estimating its number of false discoveries, . Work involving the second-moment modeling of the z-value histogram (representing gene expression differentials) has shown significantly deleterious effects of intergene expression correlation on the estimate of . This paper suggests that nonlinear dependencies could likewise be important. With an applied emphasis, this paper extends the “moment framework” by including third-moment skewness corrections in an estimator of . This estimator combines observed correlation (corrected for sampling fluctuations) with the information from easily identifiable null cases. Nonlinear-dependence modeling reduces the estimation error relative to that of linear estimation. Third-moment calculations involve empirical densities of covariance matrices estimated using very few samples. The principle of entropy maximization is employed to connect estimated moments to inference. Model results are tested with BRCA and HIV data sets and with carefully constructed simulations.
1. Introduction
This work is motivated by analytical challenges that arise in the use of microarray data to discover genes that are differentially expressed across experimental conditions such as control and treatment. Although the discussion centers around this genomics task, the developed methods are quite general and should be useful in other multiple-testing applications in which there is substantial dependence among test measures, and in which a small sample size may cause significant fluctuations in statistics employed in the testing. The specific aim of this work is to develop a reliable estimator of the number of false discoveries (type I errors—denoted ) in a multiple-testing problem in such an adverse setting.
The classic and contemporary literature in cell biology, and the more recent literature in genomics (both in print and posted on the Internet), is replete with tutorial information at all levels about cell anatomy and physiology, and genomics, as well as the microarray technology deployed in the present application. A good entry point for accessing information about contemporary developments in the genomics field is the web site of the US National Genome Research Institute [1]. The papers by Page et al. [2] and Wang [3] provide relatively current reviews of microarray technology and methods. A brief description of the biological aspects of the genomics application underlying this work is found in Appendix A of this paper.
In the gene-discovery application, each gene is tested against a null hypothesis, , that the gene is not differentially expressed across experimental conditions. All genes are initially assumed to satisfy in this analysis, and is estimated conservatively. This “all-null’’ presumption is consistent with this application in which is true for a vast majority of the genes in any experiment. Beyond the gene detection problem, however, this presumption is realistic in many practical applications of large-scale testing in which the prior probability of null cases, say , is large, and in which the goal is to identify a small set of interesting “nonnull’’ cases [4]. With , it is also possible to impose identifiability (the strongly justified assumption that a given gene satisfies ) on some of the genes, yielding crucial information with which to condition the estimation of [5].
One of the earliest reports of research using the microarray (or “gene chip’’) appeared in a paper by Schena et al. in Science in 1995 [6]. Generally speaking, research that employs the microarray to analyze gene expression data has one (or both) of the following underlying aims: the discovery of gene coexpression, or the discovery of gene differential expression. To the extent that these problems have been investigated separately, the coexpression problem has frequently been addressed by clustering methods (e.g., [7–13]), whereas differential expression has been studied using variations of classical statistical hypothesis testing (e.g., [5, 14]). Whereas differential-expression/hypothesis-testing research was, and is, concerned with expression in response to differing cell conditions (normal versus pathology, medical treatment versus control, etc.), the early coexpression/clustering research was often focused on phenotypic manifestations of the gene expression.
As the technology has matured, the dichotomy suggested above has blurred with many current applications of the microarray involving “hybrid’’ research questions into both differential and coexpression. Application areas include discovery and exploration of gene regulatory systems, tissue and tumor classification, biomarker prediction, discovery and reverse engineering of gene expression networks—not to mention the microarray’s deployment in the study of protein synthesis, metabolism, evolution, and other areas related to cell biology. Technical approaches to these problems have gone well beyond classical clustering and hypothesis-testing methods. Indeed, in the past few years, statistical (i.e., “nonclustering’’) methods to address the coexpression problem have been reported (e.g., [15, 16]), while the hybrid of the two problems—that of detecting and analyzing differentially coexpressed genes—has been researched using an ever-increasing number of methods including clustering with complex and dynamic feature selection methods, image transformation and processing of expression data, biclustering, graph and network theory, hypothesis testing, and other statistical approaches (e.g., [17–28]). In this paper, we return to the focused problem of detecting differential expression across treatment conditions, but it will become clear—as it has to the community working on this problem—that differential expression cannot be studied independently of coexpression.
Early work on classical statistical techniques for microarray-based gene discovery is summarized in the 2002 review paper by Pan [29]. Initially, it was customary to treat gene expression outcomes as realizations of independent random variables (RVs). More recent papers, however—notably, those of Owen [30], Efron [31], and Pawitan et al. [32]—caution researchers of the detrimental effects of correlated gene-expressions on the validity of “discovered’’ genes. In particular, it was reported that highly correlated tests increase the variance of (or, its normalized counterpart, the false discovery rate, , where is the number of “discovered’’ genes), thus making estimates of less reliable. In particular, high variance renders the average, , an unreliable estimator of [30]. Among many causes, intergene correlation is attributable to coexpressed genes [4] and to unmodeled factors that introduce systematic effects across genes [33, 34]. As a result, for most real data, the assumption of independence or weak dependence among gene expressions is unfounded, and methods treating correlation are necessary [35, 36].
Accordingly, there has been significant recent interest in improving statistical gene detection methods in light of this detrimental correlation. For example, Storey et al. [37] present an approach to the notion of sharing information across scores, which they describe as “borrowing strength across the tests’’ for a potential increase in statistical power. Tibshirani and Wasserman [38] discuss a quantity called the “correlation-shared” -statistic and derives theoretical bounds on its performance. Hu et al. [39] examine the covariance structure of the expression data and discover benefits of linking coexpression and differential expression in a distance measure—thus, moving toward the “hybrid’’ problem described above.
Recent research into the hybrid differential coexpression problem has also yielded results and methods that could ultimately benefit the differential expression problem. Because the differential coexpression research is often concerned with differing phenotypes, rather than with different treatment conditions, two given research efforts involving differential coexpression might seek answers to different sets of genetic questions through expression data. Like the “dual conditions researchers,’’ however, the “phenotype’’ researchers have encountered their own forms of confounding dependencies, notably the relative gene locations, the expression time sequencing, and phase information (e.g., [40–42]). Papers have been published addressing these issues, including the exposition of new statistical approaches—for example, “CorScor’’ developed by Dettling et al. [21], the “ECF-statistic’’ of Lai et al. [22], “the gene-set coexpression analysis’’ of Choi and Kendziorski [15]—as well as new clustering methods—for example, the web-based expression analyzer of Xiang et al. [43], high-order preclustering method of Wong et al. [44], and the “BioSym’’ distance measure of Bandyopadhyay and Bhattacharyya [45]. A recent review of clustering methods in genomics appears in the paper by Dalton et al. [46]. A more general examination of the performance of classifiers of microarray expressions appears in the paper by Ancona et al. [47].
The present paper returns to the problem of gene discovery by statistical hypothesis testing, but with the concern for the effects of nonlinear dependencies on (the estimation of) the number of false results. Empirical work below provides cogent evidence that accounting for intergene correlation alone does not sufficiently mitigate the adverse effects of dependency. Recent work by Hu et al. [48] has shown the importance of accounting for nonlinear dependence in imputing missing values in microarray data. Modeling nonlinear dependencies is a challenging problem, and the present work makes only a modest—nevertheless, empirically significant—step into the realm of nonlinear dependence by modeling the third-moment characteristics of the quantity . In principle, the proposed extension admits any order moment, but computational constraints limit the present developments. However, even the single step to a third-moment extension under severe sampling fluctuations is very challenging, and, in spite of this modest modeling enhancement, it is a hard-won extension yielding significantly improved estimates for a range of real and simulated examples (see Section 5).
Thus, a central finding of this work is that null statistic histogram approaches can be improved by including third-moment skewness corrections. Advancing the techniques to model higher order dependencies is challenging, but the effort could have a substantial payoff. Errors in gene detection are expensive in financial terms, but the derailing of biomedical research resulting from a false gene discovery could be profoundly costly in many ways. Even modest improvements in genomic techniques are potentially very significant.
2. Notation and Terminology
Because this paper has a practical aim, we will assume, without comment, “friendly’’ mathematical conditions such as existence of distributions, measurability, and sure convergence of integrals. Even so, the mathematical developments in this paper necessarily involve extensive notation and we strive for consistency and clarity in its use. Quantities are generally formulated as RVs unless stated otherwise. This excludes obviously deterministic quantities like sequence indices, integers defined in the abstract (e.g., the number of microarrays, “’’), and statistical expectations. Precise formulations require that probability distributions ordinarily be denoted formally as, for example, for the joint distribution of RVs and , but the more common abusive notation “’’ is more expedient in a few cases. The notation may denote either a discrete or continuous (i.e., density) distribution, depending, of course, on the RV(s) being modeled. The meaning should be clear in context. We deliberately allow this ambiguity because it avoids some notational awkwardness as discrete distributions are fitted with densities. On the other hand, the notation is used to denote a probability assignment to a measurable event .
Many developments in this paper are centered on second-order statistical concepts. It is important to carefully define terminology used in this regard, since the vocabulary has nuanced differences across disciplines. The elementary notation for scalar RVs in Table 1 is standard and is used conventionally in this paper. A caveat arises in the discussion of related matrices, however. The term “correlation matrix’’ is used in this paper in a way consistent with its use in many statistical developments, but not in a way that is universal across disciplines. The following definitions are used throughout this paper.
Definitions 1. Consider a random vector with mean vector . Then, the covariance matrix associated with is defined as in which the element is . The correlation matrix of is in which the element is and is a diagonal matrix with element , the standard deviation of the th RV, .
In this paper, the term “correlation matrix’’ will refer to the definition in (2). On the contrary, in much of the engineering literature, the outer product is called the (spatial) correlation matrix. In this case the element of the matrix is the scalar correlation . In our definition the elements are correlation coefficients, which are, in fact, normalized covariances. One significant implication of this fact is that mean values of RVs have no effect on either matrix. This should be kept in mind in the developments to follow.
3. Problem Formulation
genes are to be studied using microarray experiments. The expression values are recorded in an matrix, . For analytical purposes, the expression quantities are generally RVs. Each of the microarray experiments takes place under one of two conditions (indexed by ) such as control and treatment. These two subsets of the data are called treatment groups in the paper. There are samples (i.e., microarrays) in treatment group . Based on the evidence in , we seek to identify a “small’’ number, , of genes that are significantly differentially expressed across the two treatment groups. One widely used strategy (e.g., [5, 14]) is to posit that each of the genes, for satisfies a null hypothesis, All genes are tested against this hypothesis using two-sample null statistics [14]. The magnitudes of scores establish a gene ranking, and the genes with the largest scores are reported as statistically significant discoveries.
Clearly, the list of discovered genes is only meaningful to the extent that is very small. Of course, can only be estimated since the state of any gene (i.e., whether or not it should be “discovered’’) is unknown. Strong causal relationships among genes give rise to highly correlated scores and greatly complicate the estimate of [30, 31]. Moreover, in spite of their declining cost, microarrays are still a relatively expensive technology. Consequently, the number of microarrays, , in an experiment is usually smaller than number of genes, , by as much as four orders of magnitude. Typically, existing microarrays record expression data for at least a few thousand genes. The fact that further complicates the problem because the knowledge about the underlying gene-gene correlation structure is critically sparse in the observations. At the same time, the consequences of correlation on differential analysis cannot be overlooked [35]. In fact, the present work will suggest that even nonlinear dependencies must be accounted for in order to properly estimate . Theoretical justifications for this contention are given momentarily.
This paper develops a moment-based estimator of by giving the null histogram a stochastic interpretation. The observed null counts are viewed as realizations of a more fundamental random model shaped by inter- dependence. A small zero-symmetric bin in the space of statistics is designated as the center area, and it is posited that no scores from “nonnull’’ genes fall in this range. Then, the null count in the center area, say , is observable, and by conditioning on , the variance of can be reduced significantly [5, 49]. We relate to through the discrete joint distribution . To obtain an approximation of , we estimate its first three moments, then fit the maximum entropy function (Appendix B). This approach inherently yields an estimate of the conditional distribution . A large number of estimates of the distribution of would theoretically be more useful than a point estimate because of the noisy nature of large-scale inferences [30]. Compared to histogram-curve-fitting techniques like empirical null [5], however, the present approach enjoys the attractive feature that covariance is separately estimated, and then explicitly incorporated into the inference.
Efron [31] reports that RVs and are found to be extremely negatively correlated in a number of real experiments. He provides an explanation for this finding, then employs these insights to develop a Poisson-model-based second-order estimator of which, like the present approach, relies on the center area concept. While Efron’s work is extremely important, his own research has gone on to show that purely second-order estimates suffer from over- and under-estimation effects. The second-moment estimates of reported later in the present paper (see Section 5), as well as those in the cited Efron paper, all show these adverse effects. There are three contributory factors: (i) is bounded below by zero, (ii) the mean of is small, and (iii) intergene covariance causes the variance of to inflate. All of these factors suggest that skewness corrections—reflecting nonlinear dependence—are vital.
4. Methods
4.1. Moments of the Joint Distribution
4.1.1. Count Models
The process begins by transforming test statistics to values as , where is the putative null cumulative distribution function (c.d.f.) of the test statistic, and is the inverse c.d.f. of the unit normal density, . The values, modeled as RVs, provide the analytical convenience of multivariate normal form while describing the joint -statistic behavior. We formally define the fundamental quantities: in which denotes the number of elements in the discrete set . is the sample space of values. The interval corresponding to count is called the center area, and the semi-infinite interval associated with count is the left tail area. For proper comparison with Efron’s results [31], we work with a left tail area; however, the present approach can employ right- or double-sided tail areas equally well.
The premise that very few nonnull scores fall in and, hence, that is practically observable is of prime importance. A similar assumption plays a central role in the literature on estimating the proportion of null genes, as in, for example, papers by Efron [31], Pawitan et al. [32], and Langaas et al. [50]. The empirical null approach [5] relies on similar reasoning. We exploit the observability of to: (i) estimate the moments of , (ii) use them to infer the distribution (estimate) , and then (iii) report (an estimated) which in turn could be used to find an estimator of conditioned upon . Initially, all cases are treated as null. Improvement is possible by estimating [50, 51].
4.1.2. Assumptions
Assumptions
The following assumptions underlie these developments:(1) is large, say (Efron discusses this bound in [4]).(2) is a unit normal variate [~] for all .(3)The scores are jointly Gaussian to the third order (but not uncorrelated).(4)Recall that [element of the expression matrix ] denotes (the RV model for) the expression of gene on microarray . Let denote the marginal RV for , that is, the model for the expression outcomes of gene . The realizations of are the elements of row of an observed . Then, it is assumed that for all , (justified below).
4.1.3. The -Value Histogram
It is convenient and computationally efficient to obtain the moments of using the moments of the -value histogram. We seek central moments because they facilitate working with the maximum entropy distribution (Appendix B). The moment estimation is carried out as follows. is partitioned into disjoint bins, , where the th bin has center and width (constant with ). Then, the -histogram bin counts are in which is the indicator function for the event “score falls in bin .’’
Consider bin with count for some . The mean of count is The second-order joint central moment covariance of the pair , where may equal , is Because of the bivariate normality of and , (7) can be approximated by: where is the Kronecker delta, and is the bivariate Gaussian density with mean vector , and covariance matrix (equivalent to the correlation matrix in this case) That is, defining the vector of arguments , where denotes the determinant. In the approximation (8), the density in (10) is evaluated at and .
As reflected in the second line of (10), the covariance matrix is fully specified by a a single-scalar parameter, for each pair. Thus, we can express the Gaussian density in (10) as being parameterized by this autocorrelation coefficient for the given -score pair, . Now, suppose that we can derive an empirical density [52], say , over the interval , fitted to the discrete set of autocorrelation coefficients, . This empirical distribution allows the summation over gene indices in (10) to be replaced by a continuous computation. This smoothed computation represents a further approximation of , which is stated in the form of a lemma below. This result is similar to those of Owen [30, Theorem 1] and Efron [31, Lemma 2].
Lemma 1. Let denote the empirical density of , derived from the -pair sample correlation coefficients, . Then, the second joint central moment of a histogram count pair , where may equal , is approximated by where , and
Further, the third joint central moment of a triplet , where two or more indices may be equal, is: where in which is the Kronecker sequence over .
The assumed trivariate normality of the scores implies that the joint distribution for each score triplet is specified by a covariance matrix. Let us denote the covariance (equivalent to correlation) matrix for the -value triplet by For each -score triplet, is an element of the space—call it —of all symmetric positive-semidefinite matrices in with element magnitudes no greater than unity. is continuously distributed over .
Again, we need an empirical way to compute the joint moments of the scores. Let be the empirical density of the correlation matrices, . This density must be inferred from the observed data. Of practical importance is the fact that, although each is distributed over a subspace of , the domain of is effectively a three-dimensional manifold of that space because each covariance matrix is unambiguously determined by its three values above or below its main diagonal (with unity diagonal elements). The argument may be thought of as a vector of these three elements (over the continuum of allowable values), but we will continue to denote it as a matrix as a reminder of its association with . We have, in conjunction with (6)–(14), the following useful approximation.
Lemma 2. The third-order joint central moment of a histogram count triplet , where two or more indices may be equal, is approximated by where , , , and are defined in (6) and (11), and is the -histogram bin width.
To obtain the moments of it is simply necessary to combine the moments of the corresponding counts. For example,
The key quantities in the lemmas for approximating moments are the empirical covariance densities. Obtaining these densities in the presence of severe sampling errors is discussed next.
4.2. Empirical Correlation Densities
4.2.1. Approach
Because of severe sampling fluctuations, the current methods can recover only from the data. This density is then used to estimate . For this purpose, as well as to facilitate the calculations of Lemmas 1 and 2, we seek to parameterize the requisite densities. For most real examples, a single omnibus parameter is found to be sufficient.
4.2.2. Data Normalization
For all-null false discovery rate calculations, normalization of the per-microarray expression results has been found to be beneficial [31, Remark E]. The columns of the data matrix are standardized to mean zero and unity variance. This standardization normalizes output “brightness’’ among microarrays [53, 54]. It also forces the sum of covariances, and approximately the sum of correlations, to be zero. This permits the fitting of a zero symmetric density to , which, in turn, has profound consequences for the form of .
Formally, let denote the residual expression matrix, obtained by subtracting from each gene’s average response within each treatment group, and let denote the marginal random variable modeling the residual expression outcomes for gene [like of Assumption (4), p. 5]. All further discussion of gene expression values will refer to these normalized residual values.
4.2.3. Obtaining
The empirical densities and , as well as others to be introduced below, clearly play a key role in moment estimation above. In each case, the empirical density—a surrogate for the true statistical density of the correlation coefficient(s) being modeled—is a distribution of a correlation function or matrix over a continuum, but it must be inferred from the data samples.
To deduce , we require —the empirical density of the gene expression correlation coefficients. The mapping between the domains of and is needed, in principle, to calibrate . However, for the usual two-sample -statistic, assuming independent columns in , [recall Assumption (4), p. 5]; hence, . We make the assumption of equality of these densities below.
Let denote the sample covariance between rows (genes) and of . For convenience, we define the notation These are the values to be fit with density .
To reduce the variability added by sampling errors, we apply the Fisher transform: For bivariate normal samples, the Fisher transform has remarkable normalizing and variance stabilizing properties [55], and each is well modeled by the distribution , where is the Fisher-transformed underlying correlation coefficient. Assuming a sampling model where is the distribution of the Fisher-transformed underlying correlation coefficients, we can interpret the histogram of Fisher-transformed sample correlations, say the “-histogram,’’ as a convolution of , the statistical correlation density on the scale resulting from the Fisher transform, and the histogram of sampling errors, say, the “-histogram,’’ also on the -scale. Then, the underlying is obtained by deconvolving this density from the convolved pair, . For a wide variety of microarray data sets studied in this work (also see [30]), the normal distribution fits nicely to the histogram. For bivariate normal samples, where , the estimate of , say , takes the normal form . It is this estimate that will serve as the empirical density of the Fisher-transformed correlations, .
Having obtained the underlying , we must, in principle, undo the mapping (20) to obtain , then deduce from . Recall, however, that we assume that the correlation coefficients of the and variables are identical [31], so that we may directly seek from . A distribution that fits the inverse-transformed well is a class of densities in the general Beta distribution family given by: where is the Beta function and and are nonnegative shape parameters. Comparing (22) and (23) gives a useful probabilistic interpretation of the correlation coefficient, say , modeled by the empirical density : We see from (22) that . Therefore, will be a factor of four greater than the variance of a Beta-distributed random variable with parameters . That is, Thus, using as an estimate of , we obtain the parameter , hence, the distribution .
4.2.4. Obtaining
We now pursue as an extension of . Like , the effective domain of is of only three dimensions. Also similarly to the scalar density, for the two-sample -statistic, . Hence, we can pursue indirectly by finding .
We seek a joint distribution on the space of all correlation matrices such that all the inherent marginal distributions (i.e., the distributions of for ) are equivalent to . Such a result can be obtained from the inverse-Wishart distribution whose marginalization properties are helpful when studying a subset of variables [56]. Suppose that the true statistical covariance matrix of , say, follows the inverse-Wishart distribution , , where is the single parameter that characterizes the distribution, and indicates the trace. The goal is to relate to parameter of (22) and to determine the distribution of any of the covariance submatrices of .
Following the separation strategy of Barnard et al. [57], we decompose into its variances and normalized covariances (i.e., correlation coefficients) as where is the diagonal matrix whose th diagonal element, , is the standard deviation of the gene residual expression [recall (2)]. is the correlation matrix of the residual expression matrix . Under the transformation , the Jacobian is given by [58, Theorem 3]. Thus, after marginalization over where is the th diagonal element of . The product arises because of independence of the elements. Substituting yields which leads to an expression for the probability density of the matrix : where denotes the principal submatrix of , and where we have used the fact that . For with probability density (30), the marginal density of its arbitrary correlation submatrix also has a useful expression.
Lemma 3. For a correlation matrix with the probability density (30), the correlation submatrix, , has the density
Proof. Suppose that the statistical covariance submatrix undergoes the transformation , where is the diagonal scaling matrix of appropriate standard deviations [recall (27)]. Then due to the marginalization property of inverse-Wishart, . Following steps(26)–(30) for yields (31).
Substituting in result (31) yields Note that this density is the function of a scalar argument, the single value of the off-diagonal elements of . We have indicated this by use of the subscript “’’ in the second term in the expression. The critical property of this result is that it has the same uniparametric form as (22). By setting we can force the inherent marginal densities of the entries () to equal —the specific aim of this derivation.
Finally, substituting in result (31), we obtain in which is the element of the evaluated matrix (in the abstract) . The density of a particular covariance matrix in , say , involves the use of the three elements in the upper triangle of the matrix, reinforcing earlier assertions that the domain of is a manifold of the matrix space. Recall that (assumed equivalent to ) is the original notation for the covariance matrix of a score triplet , and, by extension, . In the present discussion, assumes the role of a submatrix of , namely, .
For large , the assumption of the inverse-Wishart distribution in (26) is not well justified. However, the assumption is used here strictly for its value in deducing from . There is no concern for a model of the entire matrix . Further, single-parameter distributions on a positive definite matrix space are few. The inverse-Wishart distribution is chosen for its useful marginalization property. Of course, a tenuous assumption is not justified by a useful property if it leads to an unuseful procedure. The practical validation of the assumption is manifest in Section 5. The “Bayesian correlation priors’’ point of view from the work of Liechty et al. [59] was especially helpful in formulating these ideas. Exploring other ways to obtain is a worthwhile but challenging endeavor.
Finally, we have derived only up to a proportionality constant. However, for , normalization is straightforward. For it is necessary to resort to Monte Carlo methods which only require densities up to a proportionality constant.
5. Results
5.1. Testing on Real Data Sets
A MATLAB implementation of the approach based on the work above is available at the website http://www.egr.msu.edu/~deller/. The methods were tested on two real data sets, both showing a significant amount of intergene covariance but exactly opposite behaviors. Calculations below are for left-sided tail-area parameter , and center area parameter [recall (4)]. Comparisons with Efron’s [31] second-order estimator of are made.
The first data set is from the breast cancer (BRCA) study of Hedenfalk et al. [60]. These data record the expression of genes on microarrays with seven samples assigned to BRCA1 mutations and eight to BRCA2. The original research seeks to identify genuine mRNA activity differences between these two categories. In the present paper, the logarithm is applied to the mRNA levels to increase Gaussianity [61].
In a study of the human immunodeficiency virus (HIV), Van ’t Wout et al. [62] investigated genes over microarrays with four samples assigned to an HIV infected condition and the remaining four to the control. To produce the test cells, the control cells (CD4 T cell lines) were infected by the virus. The paper reports raw mRNA levels which, like the BRCA data, are converted to logarithms in the present work.
The present analysis reduces an entire expression matrix to two numbers: the center area null count, , and the omnibus parameter, . As is evident in Figure 1, the parametrization described in Section 4 is realistic. For the BRCA data, , and for HIV, . The Figure 1 caption provides details.
(a)
(b)
The next step is to compute the moments of per Lemmas 1 and 2. These calculations require parameters , , , , and . We set . These estimated moments are used to find the maximum-entropy (maxent) distribution (Appendix B). Figure 2, for example, reports the moments and the corresponding maxent distribution for the BRCA data. and exhibit strong negative correlation of −0.89, a value similar to that in [31, Table 1]. Furthermore, shows significant positive skewness, which causes to exhibit negative skewness. This is not surprising as is bounded below by zero, and yet has small mean but inflated variance. The third-moment provides an additional level of detail about the joint behavior of and .
During the maxent numerical optimization, a equispaced mesh was found sufficient for the BRCA data; however, for the HIV data the resolution had to be increased to . This is because, in addition to the larger , the HIV also exhibits more covariance. The BRCA optimization required 30 iterations, while HIV took ~70 iterations, to converge to an estimated distribution.
Figure 3 reports the estimated , where refers to the observed data. Second- and third-moment estimates are shown separately. In the framework of statistical inference such a distribution is the ultimate goal, but this result could later be used for other purposes like computing point estimates and associated confidence intervals.
(a)
(b)
If the mean of the estimated is used as a point estimate of , then for the BRCA data, third-moment calculations suggest 104 false discoveries versus 79 for the second-moment while the usual suggests only 20 false discoveries. These numbers must be put in perspective by noting that the actual count falling in the left-sided tail-area is 116. For the HIV data, the third-moment analysis found eight false discoveries compared to 19 for second-moment, while the mean estimator produces 48. This time the count in the left-sided tail area is 46. Clearly, extensive analysis of intergene dependence can lead to very different conclusions from the same data, relative to those of the mean estimate of false discoveries (and, in turn, the procedures built around it).
The second-order estimator designed by Efron [31] found 77 false discoveries for BRCA. Efron compares that to the results of nonparametric analysis in the same paper and concludes underestimation, but the issue is not further pursued. Our findings show that nonlinear dependence (as reflected in the present case by moments higher than second) is potentially very important in characterizing the null histogram.
We note in passing that the availability of permits the application of the bound as a control measure, as recommended by Lehmann and Romano [63]. It is not a trivial matter to choose and such that a fair comparison with other error measures is possible; however, for illustrative purposes, we set and . With this constraint, the present approach reports 174 discoveries (108 for second-moment) for the HIV . This compares favorably with the results of Efron [31], where the Benjamini-Hochberg procedure, with false discover rate control level 0.10 and an empirical null from [5], yields 180 discoveries. Without covariance modeling, the Benjamini-Hochberg procedure reports only 20 discoveries.
5.2. Testing on Simulated Data
Insight is gained by testing the approach on simulated data for which the “correct answer’’ is known. In the studies below, all cases are null (no treatment, residuals only). The goal is to see how well the realized left-sided tail-area count can be estimated from the center count. To maintain realism, we simulate raw mRNA levels. The testing scenario is a “two-group study,’’ so en route to -values we take the usual two-sample -statistic.
Let the mRNA expression level, , of gene measured by microarray , be distributed as the Gamma density: For , where is the Gamma function. and are called the shape parameter and the scale parameter of the distribution, respectively. This distribution is similar to the Gamma-Gamma model used by Newton et al. [64]. In (34) the shape parameter is common to all genes, while the scale parameters characterize varying mRNA levels from gene to gene, but are assumed i.i.d. as The intuition that genes with larger underlying mRNA levels would have higher variance is supported by model (34) since the mean of the th gene is and variance is .
The parameter set in (34) and (35) can be chosen on the basis of the overall gene expression histogram of real microarray data. Results for three such parameter sets: , and , are presented. These numbers were chosen to preserve the total sample variance, and the particular values are based on the HIV data of Van ’t Wout et al. [62] which were collected using Affymetrix microarrays. In particular, models variables that are exponentially distributed, models a unimodal distribution with heavy tails and a noticeable departure from Gaussianity. Case represents an approximation to a Gaussian distribution, but with slightly heavier tails.
Substantial row-wise covariance was added via the Gaussian copula technique: A matrix, say , of i.i.d. unit normal RVs was used to produce a correlated matrix, , via the mapping in which is the lower-triangular Cholesky factor (e.g., [65]) of , the correlation matrix of the actual expression matrix from the BRCA study, plus a small diagonal load to prevent singularity due to the fact that . Several other dense matrices, , generated through a different method [66], yielded similar results. This process imposes the covariance of the real BRCA data on the simulated substrate of independent Gaussian variables. The resulting elements were mapped to values, , then further transformed to simulated expression variables, , through the inverse Gamma c.d.f. as in (34). The result is the simulated expression matrix .
Figures 4 and 5 compare second- and third-moment estimates for and , respectively. In both cases, for the center area (see Section 6). For each , 800 matrices were processed. On each the approach was applied in its entirety and no additional knowledge was assumed. The a posteriori mean was used as the final estimate. The usual mean estimator consistently reported 20 for and 73 for , regardless of the particular .
(a)
(b)
(c)
(a)
(b)
(c)
Strikingly, for all three parameter sets , the third-moment skewness corrections make the estimation process more accurate. For some of the scenarios third-moment estimates saturate somewhat, but the effect is minor compared to that in the second-order approaches. To the extent that these parameter sets cover a wide range of realistic gene expression data, the practical utility of the proposed approach is evident.
6. Discussion
Advances in DNA microarray technology, improved standardization procedures, and a careful execution of laboratory protocols collectively lead to testing situations with marginally correct but strongly correlated null hypotheses. If correlation is the result of intrinsic gene-gene interactions, no experimental design can circumvent it. Correlation can cause the realized to vary significantly from case to case [63], and the control of via the usual may no longer represent the basic facts. The moment theory of the null statistic histogram can be used to deduce an estimator of which explicitly combines identifiability and covariance. Though we have explored these ideas in the differential analysis context above, the findings are quite general.
It is reasonable to question the necessity of the heavy mathematical machinery of the foregoing sections since it is possible to simulate a number of sets of -scores from the distribution , then estimate the moments. However, due to sampling fluctuations, the underlying covariance matrix is ordinarily unattainable; however, pursuing quantities like and is still possible. Also, as gets larger (~25,000 for recent microarray studies) computational demands, as well as the large number of -score sets required, become prohibitive.
Permutation calculations, as in [31, Section 4], offer an alternative way to estimate the moments. They too can run into computational difficulties, especially when the test statistic itself is computationally intensive. Further difficulty arises when samples are few. For a two-group study like HIV (four samples each condition), the data provide only 70 unique permutations.
When a direct extraction of inter-hypotheses covariance is not feasible, single omnibus parameter models remain useful in that the investigator can still use judgment to intelligently incorporate some form of covariance effect by setting a value of the parameter .
The distribution of interest resides over support domain as shown in Figure 6, and the maxent algorithm is adept at handling such complicated support regions. At a more fundamental level, through maxent, we seek to minimize the amount of unintentional prior information brought into the inference.
Apart from the numerical parameters (bin width) and the mesh resolution in maxent, the only open choice of parameterization in the present method is , the center area boundary. The selection of in this paper is based on the first eigenvector analysis of Efron [31] which suggests that (within certain approximations) the interval has completely opposite count behavior from the rest of the space.
One surprising result of this and similar studies is that more inter- covariance does not translate into more extreme covariance between variables and . In the BRCA data, for example, the coefficient between and is −0.89, while for the HIV data the covariance drops to −0.75. Further research to gain insight into this behavior would be very useful.
Finally, while covariance was viewed in the present paper as a “destructive factor’’ in the attempt to estimate , inter- covariance can, in fact, be exploited to increase power by finding a superior ranking of potential discoveries. The recent multiple testing literature has begun to address this possibility.
Ethical Approval
Human subject data used in this study are publicly available and anonymous and are, therefore, exempted from continuing Internal Review Board scrutiny according to US Health and Human Services Policy 45 CFR 36, Subpart A, 46.101 (2.b.4).
Appenddix
A. Genes and Microarrays
To understand the core statistical methods developed in this paper, it is only necessary to know some very general facts about the biological application. The “blueprint’’ for building the components of every cell in an organism (except some viruses) is encoded in macromolecules collectively called deoxyribonucleic acid or “DNA.’’ Each DNA molecule consists of a complementary pair of long sequences (polymers) of small molecular elements known as nucleotides. The characteristic “double helix’’ configuration of DNA molecules results from chemical forces as the nucleotides on the complementary polymers strongly bind to one another. Each nucleotide is built around one of the four nitrogen bases guanine, adenine, thymine and cytosine, commonly known by their initial letters G, A, T, and C. It is the sequence of these bases that encodes the information for producing proteins that, in turn, determine the structures and functions of cells. In the human genome (the complete set of life-sustaining instructions in the genetic material of an organism), the DNA consists of about three billion nucleotides organized into 23 sets of structures called chromosomes.
Genes are certain identifiable sections of the DNA—varying in length from a few hundred to a few million bases—that encode a particular trait or “hereditary unit’’ of biological information by specifying and regulating the types of proteins produced. Almost every gene is present in the DNA of every cell of a given organism, but only a relatively few genes are active in any given cell type. Genes comprise only about 1% of the DNA material in the human geneome. The reasons for the existence of the remaining 99% of the DNA remain the subject of much conjecture, hypothesis, and research. It is currently estimated that there are on the order of 25,000 genes in the human genome, 99% of which are identical among all humans. The slight variances in the other 1% of the genes (gene alleles) determine differences among individuals (e.g., eye, hair, and skin color, blood type, etc.). Apart from normal variations (alleles), mutations in genes can lead to the formation of abnormal proteins with beneficial, neutral, or negative consequences for cell function and replication.
A gene is defined by its sequence of nucleotide bases, which, in turn, can be expressed as a sequence whose elements come from the set of four letters G, A, T, and C as described above—thus reducing the information carried by the gene to a simple code like “ATCGCT’’. A three letter code (i.e., a three base set like “AGA’’), called a codon, ultimately specifies one of the 20 amino acids that are the building blocks of proteins. There are 64 possible codons, and 61 of these are used to indicate 20 amino acids, so there is redundant representation of the amino acids in the codons. The remaining three codons are used for regulating the protein synthesis.
The manifestation of genes as proteins is called gene expression. The degree to which a gene is “active’’ in a given collection of cells (e.g., a basal cell (skin) tumor) in a given set of conditions (e.g., untreated versus treated with radiation or chemotherapy) can be ascertained by measuring the levels of certain molecules [messenger RNA (mRNA) or complementary DNA (cDNA)] related to to proteins manufactured by the gene. A microarray consists of thousands of binding sites (“probes’’), each populated with DNA fragments that can be associated with particular genes. The extent to which a gene is expressed in a particular preparation determines the extent to which the related microarray site “lights up’’ as the phosphorescent mRNA or cDNA in the preparation binds to its site. A single microarray experiment can be used to simultaneously quantify the expression levels of thousands of genes in a particular tissue preparation.
The applied purpose of the statistical modeling work in this paper is to develop methods for determining from microarray data which genes are expressed at significantly different levels when a cell type is exposed to different conditions. Ultimately, this information can be used to understand normal and abnormal cell function and replication, to target genes for medical therapies, and to develop drugs for treating myriad diseases and systemic disorders at the cellular level.
B. Estimating by Maximum Entropy Optimization (MAXENT)
B.1. MAXENT Distribution
is a discrete distribution with support domain Any moment-based inference involves computation over whose increasing cardinality () makes processing difficult for a large . However, computation can be reduced substantially by truncating the domain to the set where, Chebyshev’s inequality guides the choice of parameter . For , the loss of accuracy due to truncation is negligible.
Computation can be reduced further by recognizing that distributions imposed on the basis of a small number of moment constraints often enjoy a high level of regularity so that a sparser mesh should be adequate. The computation-accuracy trade-off becomes much easier to analyze if the problem is posed as one of learning a density function over a continuous domain, say where,, and similarly for (see Figure 6). Note that the continuous domain is scaled by the factor to improve numerical stability. The moment constraints must be scaled accordingly. Let denote the space of feasible distributions . Then, for all : with . In (B.6), corresponds to the constraint , while together with imply that every has mean . We are concerned with problems for which . For convenience, we have defined the notation as a shorthand for the scaled joint central moment of and .
The selection of a unique is based on the principle of entropy maximization (MAXENT) which seeks a with maximum information entropy [67]. The information entropy essentially measures the spread of the distribution, and hence, maxent can be seen as a criterion, which, within one’s knowledge constraints, maximizes the representation of unknown information (“ignorance")—arguably, a suitable approach for statistical inference.
MAXENT seeks the following solution in : The solution takes the following exponential form: in which the are Lagrange multipliers. The derivation of the exponential form (B.8) and a procedure to determine optimal multipliers are given in the following subsection.
B.2. MAXENT Solution Details
The information entropy functional is concave [68], and the constraints in (B.6) are linear in . Thus, the problem in (B.7) is a convex program that be solved in a Lagrangian dual framework, where one works with an unconstrained upper bound that is easy to optimize (e.g., [69]). More importantly, in the present case, the framework allows the conversion of the original infinite-dimension problem of functional variation into a finite-dimension problem with as few variables as the number of constraints.
Lemma 4. The dual, , of the concave optimization problem (B.7) is given by: where is the set of Lagrange multipliers, ; is the multiplier corresponding to the constraint; and , and are defined in (B.6).
Proof. By the definition of the Lagrangian dual function Taking the functional variation of the bracketed term in (B.10) with respect to the unknown density , and using the fact that , we obtain the maximizer of (B.10) Inserting (B.11) into (B.10) yields where we have used the facts , , and from (B.6).
It is easy to verify that the Hessian, denoted , of (B.9) is positive definite and hence is convex. Suppose that is the minimum of . Then the corresponding primal solution —obtained via (B.11)—indeed maximizes (B.7). To verify this, let be the maximizer of (B.7). Then, from (B.10), for all . Now from general optimization theory, the functional variation of the Lagrangian with respect to evaluated at must be zero, which implies that can be written in the form (B.11) for some . But, then, . Consequently, , so that . Hence is the maximizer of (B.7).
Newton’s method (e.g., [70, Section 4.6]) specifies that if a multivariable function is twice differentiable and the initial value is chosen close enough to the optimal , then the sequence over indices , converges to . In (B.13), denotes the gradient of evaluated at . The parameter allows a finer control of step sizes to avoid numerical instabilities. Intuitively, at the th iteration, is replaced by its second-order Taylor expansion around and then minimized exactly, which produces the minimum . At the st iteration, becomes the point of expansion and the method continues until the desired convergence level is achieved.
The elements of the gradient are given by: where denotes the central moment of the distribution of (B.11) parameterized by . Similarly, the elements of the Hessian are given by: The gradient calculations that occur as a part of the Hessian essentially involve integration over the planar domain . Many advanced techniques for implementing numerical integration on a quadrangle like are available in the literature; however, an equispaced rectangular mesh is found to be sufficient for the present purpose. We initiate the sequence (B.13) with which implies a uniform distribution over .
Acknowledgments
This work was supported in part by the Quantitative Biology Initiative at Michigan State University. H. Wang received support from the National Science Foundation of China (Grant no. 61002003) and from the Zhejiang Provincial Natural Science Foundation of China (Grant no. Z1111051). The authors are grateful to the personnel in the MSU High Performance Computing Center for assistance in implementing the extensive simulations, and to Dr. K.H. Desai for many contributions to this work.