Abstract

Microarray data are used to determine which genes are active in response to a changing cell environment. Genes are “discovered” when they are significantly differentially expressed in the microarray data collected under the differing conditions. In one prevalent approach, all genes are assumed to satisfy a null hypothesis, 0, of no difference in expression. A false discovery (type 1 error) occurs when 0 is incorrectly rejected. The quality of a detection algorithm is assessed by estimating its number of false discoveries, 𝔉. Work involving the second-moment modeling of the z-value histogram (representing gene expression differentials) has shown significantly deleterious effects of intergene expression correlation on the estimate of 𝔉. This paper suggests that nonlinear dependencies could likewise be important. With an applied emphasis, this paper extends the “moment framework” by including third-moment skewness corrections in an estimator of 𝔉. This estimator combines observed correlation (corrected for sampling fluctuations) with the information from easily identifiable null cases. Nonlinear-dependence modeling reduces the estimation error relative to that of linear estimation. Third-moment calculations involve empirical densities of 3×3 covariance matrices estimated using very few samples. The principle of entropy maximization is employed to connect estimated moments to 𝔉 inference. Model results are tested with BRCA and HIV data sets and with carefully constructed simulations.

1. Introduction

This work is motivated by analytical challenges that arise in the use of microarray data to discover genes that are differentially expressed across experimental conditions such as control and treatment. Although the discussion centers around this genomics task, the developed methods are quite general and should be useful in other multiple-testing applications in which there is substantial dependence among test measures, and in which a small sample size may cause significant fluctuations in statistics employed in the testing. The specific aim of this work is to develop a reliable estimator of the number of false discoveries (type I errors—denoted 𝔉) in a multiple-testing problem in such an adverse setting.

The classic and contemporary literature in cell biology, and the more recent literature in genomics (both in print and posted on the Internet), is replete with tutorial information at all levels about cell anatomy and physiology, and genomics, as well as the microarray technology deployed in the present application. A good entry point for accessing information about contemporary developments in the genomics field is the web site of the US National Genome Research Institute [1]. The papers by Page et al. [2] and Wang [3] provide relatively current reviews of microarray technology and methods. A brief description of the biological aspects of the genomics application underlying this work is found in Appendix A of this paper.

In the gene-discovery application, each gene is tested against a null hypothesis, 0, that the gene is not differentially expressed across experimental conditions. All genes are initially assumed to satisfy 0 in this analysis, and 𝔉 is estimated conservatively. This “all-null’’ presumption is consistent with this application in which 0 is true for a vast majority of the genes in any experiment. Beyond the gene detection problem, however, this presumption is realistic in many practical applications of large-scale testing in which the prior probability of null cases, say 𝜋0, is large, and in which the goal is to identify a small set of interesting “nonnull’’ cases [4]. With 𝜋01, it is also possible to impose identifiability (the strongly justified assumption that a given gene satisfies 0) on some of the genes, yielding crucial information with which to condition the estimation of 𝔉 [5].

One of the earliest reports of research using the microarray (or “gene chip’’) appeared in a paper by Schena et al. in Science in 1995 [6]. Generally speaking, research that employs the microarray to analyze gene expression data has one (or both) of the following underlying aims: the discovery of gene coexpression, or the discovery of gene differential expression. To the extent that these problems have been investigated separately, the coexpression problem has frequently been addressed by clustering methods (e.g., [713]), whereas differential expression has been studied using variations of classical statistical hypothesis testing (e.g., [5, 14]). Whereas differential-expression/hypothesis-testing research was, and is, concerned with expression in response to differing cell conditions (normal versus pathology, medical treatment versus control, etc.), the early coexpression/clustering research was often focused on phenotypic manifestations of the gene expression.

As the technology has matured, the dichotomy suggested above has blurred with many current applications of the microarray involving “hybrid’’ research questions into both differential and coexpression. Application areas include discovery and exploration of gene regulatory systems, tissue and tumor classification, biomarker prediction, discovery and reverse engineering of gene expression networks—not to mention the microarray’s deployment in the study of protein synthesis, metabolism, evolution, and other areas related to cell biology. Technical approaches to these problems have gone well beyond classical clustering and hypothesis-testing methods. Indeed, in the past few years, statistical (i.e., “nonclustering’’) methods to address the coexpression problem have been reported (e.g., [15, 16]), while the hybrid of the two problems—that of detecting and analyzing differentially coexpressed genes—has been researched using an ever-increasing number of methods including clustering with complex and dynamic feature selection methods, image transformation and processing of expression data, biclustering, graph and network theory, hypothesis testing, and other statistical approaches (e.g., [1728]). In this paper, we return to the focused problem of detecting differential expression across treatment conditions, but it will become clear—as it has to the community working on this problem—that differential expression cannot be studied independently of coexpression.

Early work on classical statistical techniques for microarray-based gene discovery is summarized in the 2002 review paper by Pan [29]. Initially, it was customary to treat gene expression outcomes as realizations of independent random variables (RVs). More recent papers, however—notably, those of Owen [30], Efron [31], and Pawitan et al. [32]—caution researchers of the detrimental effects of correlated gene-expressions on the validity of “discovered’’ genes. In particular, it was reported that highly correlated tests increase the variance of 𝔉 (or, its normalized counterpart, the false discovery rate, ̇𝔉def=𝔉/𝐺, where 𝐺 is the number of “discovered’’ genes), thus making estimates of 𝔉 less reliable. In particular, high variance renders the average, 𝜇𝔉𝜇𝔉={𝔉}, an unreliable estimator of 𝔉 [30]. Among many causes, intergene correlation is attributable to coexpressed genes [4] and to unmodeled factors that introduce systematic effects across genes [33, 34]. As a result, for most real data, the assumption of independence or weak dependence among gene expressions is unfounded, and methods treating correlation are necessary [35, 36].

Accordingly, there has been significant recent interest in improving statistical gene detection methods in light of this detrimental correlation. For example, Storey et al. [37] present an approach to the notion of sharing information across 𝑡 scores, which they describe as “borrowing strength across the tests’’ for a potential increase in statistical power. Tibshirani and Wasserman [38] discuss a quantity called the “correlation-shared” 𝑡-statistic and derives theoretical bounds on its performance. Hu et al. [39] examine the covariance structure of the expression data and discover benefits of linking coexpression and differential expression in a distance measure—thus, moving toward the “hybrid’’ problem described above.

Recent research into the hybrid differential coexpression problem has also yielded results and methods that could ultimately benefit the differential expression problem. Because the differential coexpression research is often concerned with differing phenotypes, rather than with different treatment conditions, two given research efforts involving differential coexpression might seek answers to different sets of genetic questions through expression data. Like the “dual conditions researchers,’’ however, the “phenotype’’ researchers have encountered their own forms of confounding dependencies, notably the relative gene locations, the expression time sequencing, and phase information (e.g., [4042]). Papers have been published addressing these issues, including the exposition of new statistical approaches—for example, “CorScor’’ developed by Dettling et al. [21], the “ECF-statistic’’ of Lai et al. [22], “the gene-set coexpression analysis’’ of Choi and Kendziorski [15]—as well as new clustering methods—for example, the web-based expression analyzer of Xiang et al. [43], high-order preclustering method of Wong et al. [44], and the “BioSym’’ distance measure of Bandyopadhyay and Bhattacharyya [45]. A recent review of clustering methods in genomics appears in the paper by Dalton et al. [46]. A more general examination of the performance of classifiers of microarray expressions appears in the paper by Ancona et al. [47].

The present paper returns to the problem of gene discovery by statistical hypothesis testing, but with the concern for the effects of nonlinear dependencies on (the estimation of) the number of false results. Empirical work below provides cogent evidence that accounting for intergene correlation alone does not sufficiently mitigate the adverse effects of dependency. Recent work by Hu et al. [48] has shown the importance of accounting for nonlinear dependence in imputing missing values in microarray data. Modeling nonlinear dependencies is a challenging problem, and the present work makes only a modest—nevertheless, empirically significant—step into the realm of nonlinear dependence by modeling the third-moment characteristics of the quantity 𝔉. In principle, the proposed extension admits any order moment, but computational constraints limit the present developments. However, even the single step to a third-moment extension under severe sampling fluctuations is very challenging, and, in spite of this modest modeling enhancement, it is a hard-won extension yielding significantly improved estimates for a range of real and simulated examples (see Section 5).

Thus, a central finding of this work is that null statistic histogram approaches can be improved by including third-moment skewness corrections. Advancing the techniques to model higher order dependencies is challenging, but the effort could have a substantial payoff. Errors in gene detection are expensive in financial terms, but the derailing of biomedical research resulting from a false gene discovery could be profoundly costly in many ways. Even modest improvements in genomic techniques are potentially very significant.

2. Notation and Terminology

Because this paper has a practical aim, we will assume, without comment, “friendly’’ mathematical conditions such as existence of distributions, measurability, and sure convergence of integrals. Even so, the mathematical developments in this paper necessarily involve extensive notation and we strive for consistency and clarity in its use. Quantities are generally formulated as RVs unless stated otherwise. This excludes obviously deterministic quantities like sequence indices, integers defined in the abstract (e.g., the number of microarrays, “𝑀’’), and statistical expectations. Precise formulations require that probability distributions ordinarily be denoted formally as, for example, 𝑝𝑣1,𝑣2(𝜉1,𝜉2) for the joint distribution of RVs 𝑣1 and 𝑣2, but the more common abusive notation “𝑝(𝑣1,𝑣2)’’ is more expedient in a few cases. The notation 𝑝() may denote either a discrete or continuous (i.e., density) distribution, depending, of course, on the RV(s) being modeled. The meaning should be clear in context. We deliberately allow this ambiguity because it avoids some notational awkwardness as discrete distributions are fitted with densities. On the other hand, the notation (𝐴) is used to denote a probability assignment to a measurable event 𝐴.

Many developments in this paper are centered on second-order statistical concepts. It is important to carefully define terminology used in this regard, since the vocabulary has nuanced differences across disciplines. The elementary notation for scalar RVs in Table 1 is standard and is used conventionally in this paper. A caveat arises in the discussion of related matrices, however. The term “correlation matrix’’ is used in this paper in a way consistent with its use in many statistical developments, but not in a way that is universal across disciplines. The following definitions are used throughout this paper.

Definitions 1. Consider a random vector 𝐯𝑇=[𝑣1𝑣𝐺] with mean vector 𝝁𝐯def={𝐯}. Then, the covariance matrix associated with 𝐯 is defined as 𝚺𝐯def=𝐯𝝁𝐯𝐯𝝁𝐯𝑇𝐺×𝐺,(1) in which the (𝑖,𝑗) element is 𝜑(𝑣𝑖,𝑣𝑗). The correlation matrix of 𝐯 is 𝐑𝐯def𝐒=1𝐯𝝁𝐯𝐯𝝁𝐯𝑇𝐒1,(2) in which the (𝑖,𝑗) element is 𝜑(𝑣𝑖,𝑣𝑗) and 𝐒 is a diagonal matrix with (𝑖,𝑖) element =𝜎𝑣𝑖, the standard deviation of the 𝑖th RV, 𝑣𝑖.

In this paper, the term “correlation matrix’’ will refer to the definition in (2). On the contrary, in much of the engineering literature, the outer product {𝐯𝐯𝑇}=𝚺𝐯+𝝁𝐯𝝁𝑇𝐯 is called the (spatial) correlation matrix. In this case the (𝑖,𝑗) element of the matrix is the scalar correlation {𝑣𝑖𝑣𝑗}. In our definition the elements are correlation coefficients, which are, in fact, normalized covariances. One significant implication of this fact is that mean values of RVs have no effect on either matrix. This should be kept in mind in the developments to follow.

3. Problem Formulation

𝐺 genes are to be studied using 𝑀 microarray experiments. The expression values are recorded in an 𝐺×𝑀 matrix, 𝐗=[𝑥𝑔𝑚]. For analytical purposes, the expression quantities 𝑥𝑔𝑚 are generally RVs. Each of the 𝑀 microarray experiments takes place under one of two conditions (indexed by 𝑘=1or2) such as control and treatment. These two subsets of the data are called treatment groups in the paper. There are 𝑀𝑘 samples (i.e., microarrays) in treatment group 𝑘. Based on the evidence in 𝐗, we seek to identify a “small’’ number, 𝐺𝐺, of genes that are significantly differentially expressed across the two treatment groups. One widely used strategy (e.g., [5, 14]) is to posit that each of the genes, for 𝑔=1,2,𝐺 satisfies a null hypothesis,0,𝑔:Gene𝑔is𝑛𝑜𝑡dierentiallyexpressedinthetwotreatmentgroups.(3) All 𝐺 genes are tested against this hypothesis using two-sample null statistics 𝑧1,𝑧2,,𝑧𝐺 [14]. The magnitudes of 𝑧𝑔 scores establish a gene ranking, and the 𝐺 genes with the largest scores are reported as statistically significant discoveries.

Clearly, the list of 𝐺 discovered genes is only meaningful to the extent that 𝔉 is very small. Of course, 𝔉 can only be estimated since the state of any gene (i.e., whether or not it should be “discovered’’) is unknown. Strong causal relationships among genes give rise to highly correlated 𝑧𝑔 scores and greatly complicate the estimate of 𝔉 [30, 31]. Moreover, in spite of their declining cost, microarrays are still a relatively expensive technology. Consequently, the number of microarrays, 𝑀, in an experiment is usually smaller than number of genes, 𝐺, by as much as four orders of magnitude. Typically, existing microarrays record expression data for at least a few thousand genes. The fact that 𝑀𝐺 further complicates the problem because the knowledge about the underlying gene-gene correlation structure is critically sparse in the observations. At the same time, the consequences of correlation on differential analysis cannot be overlooked [35]. In fact, the present work will suggest that even nonlinear dependencies must be accounted for in order to properly estimate 𝔉. Theoretical justifications for this contention are given momentarily.

This paper develops a moment-based estimator of 𝔉 by giving the null 𝑧 histogram a stochastic interpretation. The observed null counts are viewed as realizations of a more fundamental random model shaped by inter-𝑧𝑔 dependence. A small zero-symmetric bin in the space of 𝑧 statistics is designated as the center area, and it is posited that no 𝑧𝑔 scores from “nonnull’’ genes fall in this range. Then, the null count in the center area, say 𝐶, is observable, and by conditioning 𝔉 on 𝐶, the variance of 𝔉 can be reduced significantly [5, 49]. We relate 𝔉 to 𝐶 through the discrete joint distribution 𝑝(𝔉,𝐶). To obtain an approximation of 𝑝(𝔉,𝐶), we estimate its first three moments, then fit the maximum entropy function (Appendix B). This approach inherently yields an estimate of the conditional distribution 𝑝(𝔉𝐶). A large number of estimates of the distribution of ̇𝔉 would theoretically be more useful than a point estimate because of the noisy nature of large-scale inferences [30]. Compared to histogram-curve-fitting techniques like empirical null [5], however, the present approach enjoys the attractive feature that covariance is separately estimated, and then explicitly incorporated into the inference.

Efron [31] reports that RVs 𝔉 and 𝐶 are found to be extremely negatively correlated in a number of real experiments. He provides an explanation for this finding, then employs these insights to develop a Poisson-model-based second-order estimator of 𝔉 which, like the present approach, relies on the center area concept. While Efron’s work is extremely important, his own research has gone on to show that purely second-order 𝔉 estimates suffer from over- and under-estimation effects. The second-moment estimates of 𝔉 reported later in the present paper (see Section 5), as well as those in the cited Efron paper, all show these adverse effects. There are three contributory factors: (i) 𝔉 is bounded below by zero, (ii) the mean of 𝔉 is small, and (iii) intergene covariance causes the variance of 𝔉 to inflate. All of these factors suggest that skewness corrections—reflecting nonlinear dependence—are vital.

4. Methods

4.1. Moments of the Joint Distribution 𝑃(𝔉,𝐶)
4.1.1. Count Models

The process begins by transforming test 𝑡 statistics to 𝑧 values as 𝑧𝑔=𝑃𝒢1𝑢{𝑃0(𝑡𝑔)},𝑔=1,,𝐺, where 𝑃0 is the putative null cumulative distribution function (c.d.f.) of the test statistic, and 𝑃𝒢1𝑢 is the inverse c.d.f. of the unit normal density, 𝑝𝒢𝑢𝒢(0,1). The 𝑧 values, modeled as RVs, provide the analytical convenience of multivariate normal form while describing the joint 𝑡-statistic behavior. We formally define the fundamental quantities:𝔉def𝑧=#𝑔𝑧𝑔𝛿0,𝑔,𝐶istruedef𝑧=#𝑔||𝑧𝑔||𝑐0,𝑔,istrue(4) in which #{𝒮} denotes the number of elements in the discrete set {𝓢}. 𝒵 is the sample space of 𝑧 values. The interval 𝒵𝐶def={𝑧𝒵|𝑧𝑔|<𝑐} corresponding to count 𝐶 is called the center area, and the semi-infinite interval 𝒵𝔉def={𝑧𝒵𝑧𝛿} associated with count 𝔉 is the left tail area. For proper comparison with Efron’s results [31], we work with a left tail area; however, the present approach can employ right- or double-sided tail areas equally well.

The premise that very few nonnull 𝑧𝑔 scores fall in 𝒵𝐶 and, hence, that 𝐶 is practically observable is of prime importance. A similar assumption plays a central role in the literature on estimating the proportion of null genes, as in, for example, papers by Efron [31], Pawitan et al. [32], and Langaas et al. [50]. The empirical null approach [5] relies on similar reasoning. We exploit the observability of 𝐶 to: (i) estimate the moments of 𝑝(𝔉,𝐶), (ii) use them to infer the distribution (estimate) ̂𝑝(𝔉,𝐶), and then (iii) report (an estimated) 𝑝(𝔉𝐶) which in turn could be used to find an estimator of 𝔉 conditioned upon 𝐶. Initially, all cases are treated as null. Improvement is possible by estimating 𝜋0 [50, 51].

4.1.2. Assumptions

Assumptions
The following assumptions underlie these developments:(1)𝜋0 is large, say 𝜋00.9 (Efron discusses this bound in [4]).(2)𝑧𝑔 is a unit normal variate [~𝒢(0,1)] for all 𝑔=1,2,,𝐺.(3)The 𝑧 scores are jointly Gaussian to the third order (but not uncorrelated).(4)Recall that 𝑥𝑔𝑚 [element (𝑔,𝑚) of the expression matrix 𝐗] denotes (the RV model for) the expression of gene 𝑔 on microarray 𝑚. Let 𝑥𝑔 denote the marginal RV for 𝑥𝑔𝑚, that is, the model for the expression outcomes of gene 𝑔. The realizations of 𝑥𝑔 are the elements of row 𝑔 of an observed 𝐗. Then, it is assumed that 𝜑(𝑧𝑔,𝑧𝑔)=𝜌(𝑧𝑔,𝑧𝑔)𝜌(𝑥𝑔,𝑥𝑔) for all 𝑔, 𝑔 (justified below).

4.1.3. The 𝑧-Value Histogram

It is convenient and computationally efficient to obtain the moments of 𝑝(𝔉,𝐶) using the moments of the 𝑧-value histogram. We seek central moments because they facilitate working with the maximum entropy distribution (Appendix B). The moment estimation is carried out as follows. 𝒵 is partitioned into 𝐵 disjoint bins, 𝒵=𝐵𝑏=1𝒵𝑏, where the 𝑏th bin has center 𝑧[𝑏] and width Δ (constant with 𝑏). Then, the 𝑧-histogram bin counts are𝑌𝑏𝑧=#𝑚𝑧𝑔𝒵𝑏=𝐺𝑔=1𝐼𝑏𝑧𝑔,for𝑏=1,,𝐵,(5) in which 𝐼𝑏(𝑧𝑔) is the indicator function for the event “score 𝑧𝑔 falls in bin 𝒵𝑏.’’

Consider bin 𝑏 with count 𝑌𝑏 for some 𝑏{1,,𝐵}. The mean of count 𝑌𝑏 is𝜇𝑌𝑏def𝑌=𝑏=𝐺𝑔=1𝐼𝑏𝑧𝑔=𝐺𝑔=1𝑧𝑔𝒵𝑏=𝐺𝑧[𝑏]𝑧+(Δ/2)[𝑏](Δ/2)𝑝𝒢𝑢𝑌(𝜉)𝑑𝜉𝜇𝑏def=𝐺Δ𝑝𝒢𝑢𝑧[𝑏],where𝑝𝒢𝑢(𝜉)𝒢(0,1).(6) The second-order joint central moment covariance of the pair (𝑌𝑏,𝑌𝑏), where 𝑏 may equal 𝑏, is𝜑𝑌𝑏,𝑌𝑏def𝑌=𝑏𝑌𝜇𝑏𝑌𝑏𝑌𝜇𝑏=𝐺𝑔=1𝐼𝑏𝑧𝑔𝐺𝑔=1𝐼𝑏𝑧𝑔𝑌𝜇𝑏𝜇𝑌𝑏=𝑔𝑔𝑧𝑔𝒵𝑏,𝑧𝑔𝒵𝑏+𝑔𝑧𝑔𝒵𝑏,𝑧𝑔𝒵𝑏𝑌𝜇𝑏𝜇𝑌𝑏.(7) Because of the bivariate normality of 𝑧𝑔 and 𝑧𝑔, (7) can be approximated by:𝑌𝜑𝑏,𝑌𝑏def=𝑔𝑔Δ2𝑝𝒢𝑧[𝑏],𝑧[𝑏];𝟎,𝚺[𝑔,𝑔]𝑌𝜇𝑏𝑌𝜇𝑏+𝛿𝑏𝑏,(8) where 𝛿𝑏𝑏 is the Kronecker delta, and 𝑝𝒢(𝜁1,𝜁2;𝟎,𝚺[𝑔,𝑔])𝒢(𝟎,𝚺[𝑔,𝑔]) is the bivariate Gaussian density with mean vector 𝟎=[00]𝑇, and covariance matrix (equivalent to the correlation matrix in this case)𝚺[𝑔,𝑔]𝑧def𝑧=𝑔𝑧𝑔𝑧𝑔𝑧𝑔=𝑧1𝜌𝑔,𝑧𝑔𝜌𝑧𝑔,𝑧𝑔1=𝐑[𝑔,𝑔]𝑧.(9) That is, defining the vector of arguments 𝜻def=[𝜁1𝜁2]𝑇,𝑝𝒢𝜻;𝟎,𝚺[𝑔,𝑔]𝑧=1|||𝚺2𝜋[𝑔,𝑔]𝑧|||1/21×exp2𝜻𝑇𝚺[𝑔,𝑔]𝑧1𝜻=12𝜋1𝜌2𝑧𝑔,𝑧𝑔𝜁×exp21𝑧2𝜌𝑔,𝑧𝑔𝜁1𝜁2+𝜁2221𝜌2𝑧𝑔,𝑧𝑔,(10) where || denotes the determinant. In the approximation (8), the density in (10) is evaluated at 𝜁1=𝑧[𝑏] and 𝜁2=𝑧[𝑏].

As reflected in the second line of (10), the covariance matrix 𝚺[𝑔,𝑔]𝑧 is fully specified by a a single-scalar parameter, 𝜌(𝑧𝑔,𝑧𝑔) for each 𝑔,𝑔 pair. Thus, we can express the Gaussian density in (10) as being parameterized by this autocorrelation coefficient for the given 𝑧-score pair, 𝑝𝒢[𝜻;𝟎,𝜌(𝑧𝑔,𝑧𝑔)]. Now, suppose that we can derive an empirical density [52], say 𝑞𝜌𝑧(), over the interval [1,1], fitted to the discrete set of 𝐺2 autocorrelation coefficients, {𝜌(𝑧𝑔,𝑧𝑔),1𝑔,𝑔𝐺}. This empirical distribution allows the summation over gene indices in (10) to be replaced by a continuous computation. This smoothed computation represents a further approximation of 𝜑(𝑌𝑏,𝑌𝑏), which is stated in the form of a lemma below. This result is similar to those of Owen [30, Theorem 1] and Efron [31, Lemma 2].

Lemma 1. Let 𝑞𝜌𝑧(𝜉) denote the empirical density of 𝜌𝑧, derived from the 𝐺2𝑧-pair sample correlation coefficients, 𝜌(𝑧𝑔,𝑧𝑔),1𝑔,𝑔𝐺. Then, the second joint central moment of a histogram count pair (𝑌𝑏,𝑌𝑏), where 𝑏 may equal 𝑏, is approximated by 𝜑𝑌𝑏,𝑌𝑏𝑌𝜑𝑏,𝑌𝑏def=[]𝐺!2Δ2𝑄𝜌𝑧𝑧[𝑏],𝑧[𝑏]𝑌𝜇𝑏𝑌𝜇𝑏+𝛿𝑏𝑏,(11) where [𝐺!]def=𝐺(𝐺1)(𝐺[1]),for1𝐺, and 𝑄𝜌𝑧𝜁1,𝜁2def=+11𝑞𝜌𝑧(𝜉)2𝜋1𝜉2𝜁×exp212𝜉𝜁1𝜁2+𝜁2221𝜉2d𝜉.(12)

Further, the third joint central moment of a triplet (𝑌𝑏,𝑌𝑏,𝑌𝑏), where two or more indices may be equal, is:𝛾𝑌𝑏,𝑌𝑏,𝑌𝑏def𝑌=𝑏𝑌𝜇𝑏×𝑌𝑏𝑌𝜇𝑏𝑌𝑏𝑌𝜇𝑏𝑌=𝑏𝑌𝑏𝑌𝑏𝑌𝜇𝑏𝜇𝑌𝑏𝜇𝑌𝑏𝜇𝑌𝑏𝜑𝑌𝑏,𝑌𝑏𝑌+𝜇𝑏𝜑𝑌𝑏,𝑌𝑏𝑌+𝜇𝑏𝜑𝑌𝑏,𝑌𝑏,(13) where𝑌𝑏𝑌𝑏𝑌𝑏=𝐺𝑔=1𝐼𝑏𝑧𝑔𝐺𝑔=1𝐼𝑏𝑧𝑔𝐺𝑔=1𝐼𝑏𝑧𝑔=𝑔𝑔𝑔𝑧𝑔𝒵𝑏,𝑧𝑔𝒵𝑏,𝑧𝑔𝒵𝑏+𝛿𝑏𝑏𝑔𝑔𝑧𝑔𝒵𝑏,𝑧𝑔𝒵𝑏+𝛿𝑏𝑏𝑔𝑔𝑧𝑔𝒵𝑏,𝑧𝑔𝒵𝑏+𝛿𝑏𝑏𝑔𝑔𝑧𝑔𝒵𝑏,𝑧𝑔𝒵𝑏+𝛿𝑏𝑏𝑏𝑔𝑧𝑔𝒵𝑏,(14) in which 𝛿𝑏𝑏𝑏 is the Kronecker sequence over ××.

The assumed trivariate normality of the 𝑧 scores implies that the joint distribution for each score triplet is specified by a 3×3 covariance matrix. Let us denote the covariance (equivalent to correlation) matrix for the 𝑧-value triplet (𝑧𝑔,𝑧𝑔,𝑧𝑔) by𝐑[𝑔,𝑔,𝑔]𝑧𝑧=𝑔𝑧𝑔𝑧𝑔𝑧𝑔𝑧𝑔𝑧𝑔=𝑧1𝜌𝑔,𝑧𝑔𝜌𝑧𝑔,𝑧𝑔𝜌𝑧𝑔,𝑧𝑔𝑧1𝜌𝑔,𝑧𝑔𝜌𝑧𝑔,𝑧𝑔𝜌𝑧𝑔,𝑧𝑔1.(15) For each 𝑧-score triplet, 𝐑𝑧 is an element of the space—call it 3—of all symmetric positive-semidefinite matrices in 3×3 with element magnitudes no greater than unity. 𝐑𝑧 is continuously distributed over 3.

Again, we need an empirical way to compute the joint moments of the 𝑧 scores. Let 𝑞𝐑𝑧(Ξ) be the empirical density of the 𝐺3 correlation matrices, 𝐑𝑧. This density must be inferred from the observed data. Of practical importance is the fact that, although each 𝐑𝑧 is distributed over a subspace of 3×3, the domain of 𝑞𝐑𝑧 is effectively a three-dimensional manifold of that space because each covariance matrix is unambiguously determined by its three values above or below its main diagonal (with unity diagonal elements). The argument Ξ may be thought of as a vector of these three elements (over the continuum of allowable values), but we will continue to denote it as a matrix as a reminder of its association with 𝐑𝑧. We have, in conjunction with (6)–(14), the following useful approximation.

Lemma 2. The third-order joint central moment of a histogram count triplet (𝑌𝑏,𝑌𝑏,𝑌𝑏), where two or more indices may be equal, is approximated by 𝑌̂𝛾𝑏,𝑌𝑏,𝑌𝑏def=[]𝐺!3Δ3𝑄𝐑𝑧𝑧[𝑏],𝑧[𝑏],𝑧[𝑏]+[]𝐺!2Δ2𝛿𝑏𝑏𝑄𝜌𝑧𝑧[𝑏],𝑧[𝑏]+𝛿𝑏𝑏𝑄𝜌𝑧𝑧[𝑏],𝑧[𝑏]+𝛿𝑏𝑏𝑄𝜌𝑧𝑧[𝑏],𝑧[𝑏]+𝛿𝑏𝑏𝑏𝑌𝜇𝑏𝑌𝜇𝑏𝑌𝜇𝑏𝑌𝜇𝑏𝑌𝜇𝑏𝑌𝜑𝑏,𝑌𝑏𝑌+𝜇𝑏𝑌𝜑𝑏,𝑌𝑏𝑌+𝜇𝑏𝑌𝜑𝑏,𝑌𝑏,(16) where 𝑄𝐑𝑧𝜁1,𝜁2,𝜁3=3𝑞𝐑𝑧(𝚵)(2𝜋)3/2||𝚵||1/21×exp2𝜁1𝜁2𝜁3𝚵1𝜁1𝜁2𝜁3𝑇d𝚵,(17)𝜇, 𝑄𝜌𝑧, 𝜑, and [𝐺!] are defined in (6) and (11), and Δ is the 𝑧-histogram bin width.

To obtain the moments of 𝑝(𝔉,𝐶) it is simply necessary to combine the moments of the corresponding 𝑌𝑏 counts. For example,𝜎2𝔉=𝔉𝜇𝔉2𝑏,𝑏𝒵𝑏,𝒵𝑏𝒵𝐶𝑌𝜑𝑏,𝑌𝑏,𝜎2𝐶=𝐶𝜇𝐶2𝑏,𝑏𝒵𝑏,𝒵𝑏𝒵𝐶𝑌𝜑𝑏,𝑌𝑏,𝜑(𝐶,𝔉)=𝐶𝜇𝐶2𝔉𝜇𝔉2𝑏,𝑏𝒵𝑏𝒵𝔉,𝒵𝑏𝒵𝐶𝑌𝜑𝑏,𝑌𝑏.(18)

The key quantities in the lemmas for approximating moments are the empirical covariance densities. Obtaining these densities in the presence of severe sampling errors is discussed next.

4.2. Empirical Correlation Densities
4.2.1. Approach

Because of severe sampling fluctuations, the current methods can recover only 𝑞𝜌𝑧(𝜉) from the data. This density is then used to estimate 𝑞𝐑𝑧(Ξ). For this purpose, as well as to facilitate the calculations of Lemmas 1 and 2, we seek to parameterize the requisite densities. For most real examples, a single omnibus parameter 𝛼 is found to be sufficient.

4.2.2. Data Normalization

For all-null false discovery rate calculations, normalization of the per-microarray expression results has been found to be beneficial [31, Remark E]. The columns of the data matrix 𝐗 are standardized to mean zero and unity variance. This standardization normalizes output “brightness’’ among microarrays [53, 54]. It also forces the sum of covariances, and approximately the sum of correlations, to be zero. This permits the fitting of a zero symmetric density to 𝑞𝜌𝑧(𝜉), which, in turn, has profound consequences for the form of 𝑞𝐑𝑧(Ξ).

Formally, let 𝐗𝑜 denote the residual expression matrix, obtained by subtracting from 𝐗 each gene’s average response within each treatment group, and let 𝑥𝑜𝑔 denote the marginal random variable modeling the residual expression outcomes for gene 𝑔 [like 𝑥𝑔 of Assumption (4), p. 5]. All further discussion of gene expression values will refer to these normalized residual values.

4.2.3. Obtaining 𝑞𝜌𝑧(𝜉)

The empirical densities 𝑞𝜌𝑧 and 𝑞𝐑𝑧, as well as others to be introduced below, clearly play a key role in moment estimation above. In each case, the empirical density—a surrogate for the true statistical density of the correlation coefficient(s) being modeled—is a distribution of a correlation function or matrix over a continuum, but it must be inferred from the data samples.

To deduce 𝑞𝜌𝑧(𝜉), we require 𝑞𝜌𝑥(𝜉)—the empirical density of the 𝐺2  gene expression correlation coefficients. The mapping between the domains of 𝑞𝜌z and 𝑞𝜌𝑥 is needed, in principle, to calibrate 𝑞𝜌z. However, for the usual two-sample 𝑡-statistic, assuming independent columns in 𝐗𝑜, 𝜌(𝑧𝑔,𝑧𝑔)𝜌(𝑥𝑔,𝑥𝑔) [recall Assumption (4), p. 5]; hence, 𝑞𝜌𝑧(𝜉)𝑞𝜌𝑥(𝜉). We make the assumption of equality of these densities below.

Let 𝜑(𝑥𝑜𝑔,𝑥𝑜𝑔) denote the sample covariance between rows (genes) 𝑔 and 𝑔 of 𝐗𝑜. For convenience, we define the notation𝜌𝑔𝑔def=𝜌𝑥𝑜𝑔,𝑥𝑜𝑔=𝜑𝑥𝑜𝑔,𝑥𝑜𝑔𝜑𝑥𝑜𝑔,𝑥𝑜𝑔𝜑𝑥𝑜𝑔,𝑥𝑜𝑔.(19) These are the values to be fit with density 𝑞𝜌𝑥(𝜉).

To reduce the variability added by sampling errors, we apply the Fisher transform:𝜏𝑔𝑔=12log1+𝜌𝑔𝑔1𝜌𝑔𝑔.(20) For bivariate normal samples, the Fisher transform has remarkable normalizing and variance stabilizing properties [55], and each 𝜏𝑔𝑔 is well modeled by the distribution 𝜏𝑔𝑔𝒢(𝜏𝑔𝑔,[𝐺3]1), where 𝜏𝑔𝑔 is the Fisher-transformed underlying correlation coefficient. Assuming a sampling model𝜏𝑔𝑔=𝜏𝑔𝑔+𝜀;𝜏𝑔𝑔𝑝𝜏(𝜉),(21) where 𝑝𝜏 is the distribution of the Fisher-transformed underlying correlation coefficients, we can interpret the histogram of Fisher-transformed sample correlations, say the “𝜏𝑔𝑔-histogram,’’ as a convolution of 𝑝𝜏(𝜉), the statistical correlation density on the scale resulting from the Fisher transform, and the histogram of sampling errors, say, the “𝜀-histogram,’’ also on the 𝜏-scale. Then, the underlying 𝑝𝜏(𝜉) is obtained by deconvolving this density from the convolved pair, 𝑝𝜏(𝜉)=𝑝𝜏(𝜉)𝑝𝜀(𝜉). For a wide variety of microarray data sets studied in this work (also see [30]), the normal distribution 𝒢(0,𝜎2) fits nicely to the 𝜏𝑔𝑔 histogram. For bivariate normal samples, where 𝜀𝒢(0,[𝐺3]1), the estimate of 𝑝𝜏(𝜉), say ̂𝑝𝜏(𝜉), takes the normal form 𝒢(0,𝜎2[𝐺3]1). It is this estimate that will serve as the empirical density of the Fisher-transformed correlations, 𝑞𝜏̂𝑝𝜏.

Having obtained the underlying 𝑞𝜏(𝜉), we must, in principle, undo the mapping (20) to obtain 𝑞𝜌𝑥(𝜉), then deduce 𝑞𝜌𝑧 from 𝑞𝜌𝑥. Recall, however, that we assume that the correlation coefficients of the 𝑧 and 𝑥𝑜 variables are identical [31], so that we may directly seek 𝑞𝜌𝑧(𝜉)=𝑞𝜌𝑥(𝜉) from 𝑞𝜏(𝜉). A distribution that fits the inverse-transformed 𝑞𝜏(𝜉) well is𝑞𝜌𝑧(𝜉)1𝜉2𝑎=12(𝜉+1)𝛼112(𝜉+1)𝛼,||𝜉||1,(22) a class of densities in the general Beta distribution family given by: 𝑝𝐵1(𝜉;𝛼,𝛽)=𝐵𝜉(𝛼,𝛽)𝛼1(1𝜉)𝛽1,0𝜉1,(23) where 𝐵 is the Beta function and 𝛼 and 𝛽 are nonnegative shape parameters. Comparing (22) and (23) gives a useful probabilistic interpretation of the correlation coefficient, say ̃𝜌𝑧, modeled by the empirical density 𝑞𝜌𝑧: We see from (22) that ̃𝜌𝑧𝑝𝐵(0.5𝜉+1;𝛼,𝛼). Therefore, 𝜎2̃𝜌𝑧 will be a factor of four greater than the variance of a Beta-distributed random variable with parameters 𝛼=𝛽. That is,𝜎2̃𝜌𝑧𝛼=42(2𝛼)2(2𝛼+1)𝛼=1𝜎2̃𝜌𝑧2𝜎2̃𝜌𝑧.(24) Thus, using 𝜎2𝜌𝑧 as an estimate of 𝜎2̃𝜌𝑧, we obtain the parameter 𝛼, hence, the distribution 𝑞𝜌𝑧.

4.2.4. Obtaining 𝑞𝑅𝑥(Ξ)

We now pursue 𝑞𝐑𝑥(Ξ) as an extension of 𝑞𝜌𝑥(𝜉). Like 𝑞𝐑𝑧, the effective domain of 𝑞𝐑𝑥 is of only three dimensions. Also similarly to the scalar density, for the two-sample 𝑡-statistic, 𝑞𝐑𝑧(Ξ)𝑞𝐑𝑥(Ξ). Hence, we can pursue 𝑞𝐑𝑧 indirectly by finding 𝑞𝐑𝑥.

We seek a joint distribution on the space of all 3×3 correlation matrices such that all the inherent marginal distributions (i.e., the distributions of 𝜌(𝑥𝑜𝑔,𝑥𝑜𝑔) for 𝑔𝑔) are equivalent to 𝑞𝜌𝑥(𝜉). Such a result can be obtained from the inverse-Wishart distribution whose marginalization properties are helpful when studying a subset of variables [56]. Suppose that the true statistical covariance matrix of 𝐗𝑜, say,𝚺𝐗𝑜def𝐗=𝑜(𝐗𝑜)𝑇𝐺𝐺×𝐺forsimplicity𝚺𝑜def=𝚺𝐗𝑜(25) follows the inverse-Wishart distribution 𝒲𝐺1(𝐈,𝜈), 𝜈𝐺,𝚺𝑜𝑝𝚺𝑜||𝚵||(𝚵𝜈𝐺)0.5(𝜈+𝐺+1)𝚵×exp0.5tr1,𝚵𝐺×𝐺,(26) where 𝜈 is the single parameter that characterizes the distribution, and tr{} indicates the trace. The goal is to relate 𝜈 to parameter 𝛼 of (22) and to determine the distribution of any of the 3×3 covariance submatrices of 𝚺𝑜.

Following the separation strategy of Barnard et al. [57], we decompose 𝚺𝑜 into its variances and normalized covariances (i.e., correlation coefficients) as𝚺𝑜=𝐒𝐑𝑜𝐒,(27) where 𝐒𝐺×𝐺 is the diagonal matrix whose 𝑖th diagonal element, 𝑠𝑖, is the standard deviation of the gene 𝑖 residual expression [recall (2)]. 𝐑𝑜def=𝐑𝐗𝑜 is the 𝐺×𝐺 correlation matrix of the residual expression matrix 𝐗𝑜. Under the transformation 𝚺𝑜(𝐒,𝐑𝑜), the Jacobian is given by (2𝑖𝑠𝑖)𝐺 [58, Theorem 3]. Thus, after marginalization over 𝐒𝐑𝑜𝑝𝐑𝑜||𝚵||(𝚵𝜈)𝐺0.5(𝜈+𝐺+1)𝑖=10𝑠𝑖(𝜈+1)𝑒𝜉𝑖𝑖/2𝑠2𝑖𝑑𝑠𝑖,(28) where 𝜉𝑖𝑖 is the 𝑖th diagonal element of Ξ1. The product arises because of independence of the 𝑠𝑖 elements. Substituting 𝜔𝑖=𝜉𝑖𝑖/2𝑠2𝑖 yields𝐑𝑜𝑝𝐑𝑜||𝚵||(𝚵𝜈)0.5(𝜈+𝐺+1)𝑖𝜉𝑖𝑖0.5𝜈×𝑖0𝜔𝑖0.5(𝜈2)𝑒𝜔𝑖𝑑𝜔𝑖,(29) which leads to an expression for the probability density of the matrix 𝐑𝑜:𝑝𝐑𝑜||𝚵||(𝚵𝜈)0.5(𝜈1)(𝐺1)1𝑖||(𝚵)𝑖𝑖||0.5𝜈,(30) where (𝐀)𝑖𝑖 denotes the 𝑖th principal submatrix of 𝐀, and where we have used the fact that 𝜉𝑖𝑖=|(Ξ)𝑖𝑖|/|Ξ|. For 𝐑𝑜 with probability density (30), the marginal density of its arbitrary correlation submatrix also has a useful expression.

Lemma 3. For a correlation matrix 𝐑𝑜𝐺 with the probability density (30), the 𝜅×𝜅 correlation submatrix, 𝐑𝑜𝜅𝜅, has the density 𝑝𝐑𝑜𝜅𝚵𝜅||𝚵𝜈𝜅||0.5(𝜈𝐺+𝜅1)(𝜅1)1×𝑖||𝚵𝜅𝑖𝑖||0.5(𝜈𝐺+𝜅),𝚵𝜅𝜅.(31)

Proof. Suppose that the 𝜅×𝜅 statistical covariance submatrix 𝚺𝑜𝜅 undergoes the transformation 𝚺𝑜𝜅(𝐒𝜅,𝐑𝑜𝜅), where 𝐒𝜅 is the diagonal scaling matrix of appropriate standard deviations [recall (27)]. Then due to the marginalization property of inverse-Wishart, 𝐑𝑜𝜅𝒲𝜅1(𝐈,𝜈𝐺+𝜅). Following steps(26)–(30) for 𝚺𝑜𝜅 yields (31).

Substituting 𝜅=2 in result (31) yields𝑝𝐑𝑜2𝚵2𝜈𝑝𝜌12(𝜉𝜈𝐺)1𝜉20.5(𝜈𝐺1),||𝜉||1.(32) Note that this density is the function of a scalar argument, the single value of the off-diagonal elements of 𝐑𝑜2. We have indicated this by use of the subscript “𝜌12’’ in the second term in the expression. The critical property of this result is that it has the same uniparametric form as (22). By setting 𝜈𝐺=2𝛼+1 we can force the inherent marginal densities of the 𝐑𝑜 entries (𝜌(𝑥𝑜𝑔,𝑥𝑜𝑔),𝑔𝑔) to equal 𝑞𝜌𝑥(𝜉)—the specific aim of this derivation.

Finally, substituting 𝜅=3 in result (31), we obtain𝑞𝐑𝑥(𝚵)1𝜉212𝜉223𝜉213+2𝜉12𝜉23𝜉132(𝛼+1)1𝜉2121𝜉2231𝜉213𝛼+2.(33) in which 𝜉𝑖𝑗 is the (𝑖,𝑗) element of the evaluated matrix (in the abstract) Ξ. The density of a particular covariance matrix in 3, say Ξ=𝐑𝑥, involves the use of the three elements in the upper triangle of the matrix, reinforcing earlier assertions that the domain of 𝑞𝐑𝑥 is a manifold of the matrix space. Recall that 𝐑[𝑔,𝑔,𝑔]𝑥 (assumed equivalent to 𝐑[𝑔,𝑔,𝑔]𝑧) is the original notation for the 3×3 covariance matrix of a score triplet (𝑧𝑔,𝑧𝑔,𝑧𝑔), and, by extension, (𝑥𝑜𝑔,𝑥𝑜𝑔,𝑥𝑜𝑔). In the present discussion, 𝐑𝑥 assumes the role of a 3×3 submatrix of 𝐑𝑜, namely, 𝐑𝑜3.

For large 𝐺, the assumption of the inverse-Wishart distribution in (26) is not well justified. However, the assumption is used here strictly for its value in deducing 𝑞𝐑𝑥(Ξ) from 𝑞𝜌𝑥(𝜉). There is no concern for a model of the entire matrix 𝐑𝑜=𝐑𝐗𝑜. Further, single-parameter distributions on a positive definite matrix space are few. The inverse-Wishart distribution is chosen for its useful marginalization property. Of course, a tenuous assumption is not justified by a useful property if it leads to an unuseful procedure. The practical validation of the assumption is manifest in Section 5. The “Bayesian correlation priors’’ point of view from the work of Liechty et al. [59] was especially helpful in formulating these ideas. Exploring other ways to obtain 𝑞𝐑𝑥 is a worthwhile but challenging endeavor.

Finally, we have derived 𝑝𝑜𝐑𝜅(Ξ𝜅|𝜈) only up to a proportionality constant. However, for 𝑞𝐑𝑥(Ξ)(𝜅=3), normalization is straightforward. For 𝜅4 it is necessary to resort to Monte Carlo methods which only require densities up to a proportionality constant.

5. Results

5.1. Testing on Real Data Sets

A MATLAB implementation of the approach based on the work above is available at the website http://www.egr.msu.edu/~deller/. The methods were tested on two real data sets, both showing a significant amount of intergene covariance but exactly opposite 𝔉 behaviors. Calculations below are for left-sided tail-area parameter 𝛿=2.5, and center area parameter 𝑐=1 [recall (4)]. Comparisons with Efron’s [31] second-order estimator of 𝔉 are made.

The first data set is from the breast cancer (BRCA) study of Hedenfalk et al. [60]. These data record the expression of 𝐺=3226 genes on 𝑀=15 microarrays with seven samples assigned to BRCA1 mutations and eight to BRCA2. The original research seeks to identify genuine mRNA activity differences between these two categories. In the present paper, the logarithm is applied to the mRNA levels to increase Gaussianity [61].

In a study of the human immunodeficiency virus (HIV), Van ’t Wout et al. [62] investigated 𝐺=7680 genes over 𝑀=8 microarrays with four samples assigned to an HIV infected condition and the remaining four to the control. To produce the test cells, the control cells (CD4 T cell lines) were infected by the HIV1BRU virus. The paper reports raw mRNA levels which, like the BRCA data, are converted to logarithms in the present work.

The present analysis reduces an entire expression matrix to two numbers: the center area null count, 𝐶, and the omnibus parameter, 𝛼. As is evident in Figure 1, the parametrization described in Section 4 is realistic. For the BRCA data, 𝛼=17.77, and for HIV, 𝛼=3.51. The Figure 1 caption provides details.

The next step is to compute the moments of 𝑝(𝔉,𝐶) per Lemmas 1 and 2. These calculations require parameters 𝐺, 𝛼, 𝑐, 𝛿, and Δ. We set Δ=0.1. These estimated moments are used to find the maximum-entropy (maxent) distribution ̂𝑝(𝔉,𝐶) (Appendix B). Figure 2, for example, reports the moments and the corresponding maxent distribution for the BRCA data. 𝔉 and 𝐶 exhibit strong negative correlation of −0.89, a value similar to that in [31, Table 1]. Furthermore, 𝔉 shows significant positive skewness, which causes 𝐶 to exhibit negative skewness. This is not surprising as 𝔉 is bounded below by zero, and yet has small mean but inflated variance. The third-moment provides an additional level of detail about the joint behavior of 𝔉 and 𝐶.

During the maxent numerical optimization, a 100×500 equispaced mesh was found sufficient for the BRCA data; however, for the HIV data the resolution had to be increased to 400×2000. This is because, in addition to the larger 𝐺, the HIV 𝐗 also exhibits more covariance. The BRCA optimization required 30 iterations, while HIV took ~70 iterations, to converge to an estimated distribution.

Figure 3 reports the estimated 𝑝(𝔉𝒪), where 𝒪 refers to the observed data. Second- and third-moment estimates are shown separately. In the framework of statistical inference such a distribution is the ultimate goal, but this result could later be used for other purposes like computing point estimates and associated confidence intervals.

If the mean of the estimated ̂𝑝(𝔉𝒪) is used as a point estimate of 𝔉, then for the BRCA data, third-moment calculations suggest 104 false discoveries versus 79 for the second-moment while the usual 𝜇𝔉{𝔉} suggests only 20 false discoveries. These numbers must be put in perspective by noting that the actual 𝑧𝑔 count falling in the left-sided tail-area is 116. For the HIV data, the third-moment analysis found eight false discoveries compared to 19 for second-moment, while the mean estimator 𝜇𝔉{𝔉} produces 48. This time the 𝑧𝑔 count in the left-sided tail area is 46. Clearly, extensive analysis of intergene dependence can lead to very different conclusions from the same data, relative to those of the mean estimate of false discoveries (and, in turn, the procedures built around it).

The second-order estimator designed by Efron [31] found 77 false discoveries for BRCA. Efron compares that to the results of nonparametric analysis in the same paper and concludes underestimation, but the issue is not further pursued. Our findings show that nonlinear dependence (as reflected in the present case by moments higher than second) is potentially very important in characterizing the null 𝑧 histogram.

We note in passing that the availability of ̂𝑝(𝔉𝒪) permits the application of the bound (𝔉/𝐺𝒪𝛾)𝜆 as a control measure, as recommended by Lehmann and Romano [63]. It is not a trivial matter to choose 𝛾 and 𝜆 such that a fair comparison with other error measures is possible; however, for illustrative purposes, we set 𝛾=0.15 and 𝜆=0.5. With this constraint, the present approach reports 174 discoveries (108 for second-moment) for the HIV 𝐗. This compares favorably with the results of Efron [31], where the Benjamini-Hochberg procedure, with false discover rate control level 0.10 and an empirical null from [5], yields 180 discoveries. Without covariance modeling, the Benjamini-Hochberg procedure reports only 20 discoveries.

5.2. Testing on Simulated Data

Insight is gained by testing the approach on simulated data for which the “correct answer’’ is known. In the studies below, all cases are null (no treatment, residuals only). The goal is to see how well the realized left-sided tail-area count can be estimated from the center count. To maintain realism, we simulate raw mRNA levels. The testing scenario is a “two-group study,’’ so en route to 𝑧-values we take the usual two-sample 𝑡-statistic.

Let the mRNA expression level, 𝑥𝑔𝑚, of gene 𝑔 measured by microarray 𝑚, be distributed as the Gamma density: For 𝑚=1,,𝑀,𝑥𝑔𝑚𝑝Γ(𝜉;𝜅,𝜃)=𝜉𝜅1𝑒𝜉/𝜃𝜃𝜅Γ(𝜅),𝜉0,𝜅,𝜃>0,(34) where Γ(𝜅)=0𝑣𝜅1𝑒𝑣𝑑𝑣 is the Gamma function. 𝜅 and 𝜃 are called the shape parameter and the scale parameter of the distribution, respectively. This distribution is similar to the Gamma-Gamma model used by Newton et al. [64]. In (34) the shape parameter 𝜅 is common to all genes, while the scale parameters {𝜃𝑔}𝐺𝑔=1 characterize varying mRNA levels from gene to gene, but are assumed i.i.d. as𝜃𝑔i.i.d.𝑝Γ𝜉;𝜅0,𝜃0,for𝑔=1,,𝐺.(35) The intuition that genes with larger underlying mRNA levels would have higher variance is supported by model (34) since the mean of the 𝑔th gene is 𝜅𝜃𝑔 and variance is 𝜅𝜃2𝑔.

The parameter set (𝜅,𝜅0,𝜃0) in (34) and (35) can be chosen on the basis of the overall gene expression histogram of real microarray data. Results for three such parameter sets: (1,0.6,500),(2,0.39,384), and (3,0.33,300), are presented. These numbers were chosen to preserve the total sample variance, and the particular values are based on the HIV data of Van ’t Wout et al. [62] which were collected using Affymetrix microarrays. In particular, 𝜅=1 models 𝑥𝑔𝑚 variables that are exponentially distributed, 𝜅=2 models a unimodal distribution with heavy tails and a noticeable departure from Gaussianity. Case 𝜅=3 represents an approximation to a Gaussian distribution, but with slightly heavier tails.

Substantial row-wise covariance was added via the Gaussian copula technique: A (𝐺×𝑀) matrix, say 𝐙, of i.i.d. unit normal RVs was used to produce a correlated matrix, 𝐙𝑐, via the mapping𝐙𝑐=𝐋𝑇𝐙,where𝐙haselements𝑧𝑔𝑚i.i.d.𝒢(0,1),(36) in which 𝐋 is the lower-triangular Cholesky factor (e.g., [65]) of 𝐑𝑧,𝐺+𝜀𝐈𝐺, the correlation matrix of the actual expression matrix 𝐗 from the BRCA study, plus a small diagonal load to prevent singularity due to the fact that 𝑀<𝐺. Several other dense matrices, 𝐑, generated through a different method [66], yielded similar results. This process imposes the covariance of the real BRCA data on the simulated substrate of independent Gaussian variables. The resulting elements 𝑧𝑐𝑔𝑚 were mapped to 𝑃 values, 𝑃𝒢𝑢(𝑧𝑐𝑔𝑚), then further transformed to simulated expression variables, 𝑥𝑔𝑚, through the inverse Gamma c.d.f. as in (34). The result is the simulated expression matrix 𝐗=[𝑥𝑔𝑚]𝐺×𝑀.

Figures 4 and 5 compare second- and third-moment estimates for 𝛿=2.0 and 𝛿=2.5, respectively. In both cases, 𝑐=1 for the center area (see Section 6). For each (𝜅,𝜅0,𝜃0), 800 matrices 𝐗 were processed. On each 𝐗 the approach was applied in its entirety and no additional knowledge was assumed. The a posteriori mean was used as the final estimate. The usual mean estimator 𝜇𝔉{𝔉} consistently reported 20 for 𝛿=2.5 and 73 for 𝛿=2.0, regardless of the particular 𝐗.

Strikingly, for all three parameter sets (𝜅,𝜅0,𝜃0), the third-moment skewness corrections make the estimation process more accurate. For some of the scenarios third-moment estimates saturate somewhat, but the effect is minor compared to that in the second-order approaches. To the extent that these parameter sets cover a wide range of realistic gene expression data, the practical utility of the proposed approach is evident.

6. Discussion

Advances in DNA microarray technology, improved standardization procedures, and a careful execution of laboratory protocols collectively lead to testing situations with marginally correct but strongly correlated null hypotheses. If correlation is the result of intrinsic gene-gene interactions, no experimental design can circumvent it. Correlation can cause the realized ̇𝔉 to vary significantly from case to case [63], and the control of ̇{𝔉} via the usual 𝜇𝔉={𝔉} may no longer represent the basic facts. The moment theory of the null statistic histogram can be used to deduce an estimator of 𝜇𝔉 which explicitly combines identifiability and covariance. Though we have explored these ideas in the differential analysis context above, the findings are quite general.

It is reasonable to question the necessity of the heavy mathematical machinery of the foregoing sections since it is possible to simulate a number of sets of 𝑧-scores {𝑧𝑔}𝐺𝑔=1 from the distribution 𝐳𝒢(𝟎,𝚺𝑧𝐺×𝐺), then estimate the moments. However, due to sampling fluctuations, the underlying covariance matrix 𝚺𝑧𝐺×𝐺 is ordinarily unattainable; however, pursuing quantities like 𝑞𝜌𝑧(𝜉) and 𝑞𝐑𝑧(𝝃) is still possible. Also, as 𝐺 gets larger (~25,000 for recent microarray studies) computational demands, as well as the large number of 𝑧-score sets required, become prohibitive.

Permutation calculations, as in [31, Section 4], offer an alternative way to estimate the moments. They too can run into computational difficulties, especially when the test statistic itself is computationally intensive. Further difficulty arises when samples are few. For a two-group study like HIV (four samples each condition), the data provide only 70 unique permutations.

When a direct extraction of inter-hypotheses covariance is not feasible, single omnibus parameter models remain useful in that the investigator can still use judgment to intelligently incorporate some form of covariance effect by setting a value of the parameter 𝛼.

The distribution of interest 𝑝(𝔉,𝐶) resides over support domain 𝒟 as shown in Figure 6, and the maxent algorithm is adept at handling such complicated support regions. At a more fundamental level, through maxent, we seek to minimize the amount of unintentional prior information brought into the inference.

Apart from the numerical parameters Δ (bin width) and the mesh resolution in maxent, the only open choice of parameterization in the present method is 𝑐, the center area boundary. The selection of 𝑐=1 in this paper is based on the first eigenvector analysis of Efron [31] which suggests that (within certain approximations) the interval [1,1] has completely opposite count behavior from the rest of the 𝒵 space.

One surprising result of this and similar studies is that more inter-𝑧𝑔 covariance does not translate into more extreme covariance between variables 𝔉 and 𝐶. In the BRCA data, for example, the coefficient between 𝔉 and 𝐶 is −0.89, while for the HIV data the covariance drops to −0.75. Further research to gain insight into this behavior would be very useful.

Finally, while covariance was viewed in the present paper as a “destructive factor’’ in the attempt to estimate 𝔉, inter-𝑧𝑔 covariance can, in fact, be exploited to increase power by finding a superior ranking of potential discoveries. The recent multiple testing literature has begun to address this possibility.

Ethical Approval

Human subject data used in this study are publicly available and anonymous and are, therefore, exempted from continuing Internal Review Board scrutiny according to US Health and Human Services Policy 45 CFR 36, Subpart A, §46.101 (2.b.4).

Appenddix

A. Genes and Microarrays

To understand the core statistical methods developed in this paper, it is only necessary to know some very general facts about the biological application. The “blueprint’’ for building the components of every cell in an organism (except some viruses) is encoded in macromolecules collectively called deoxyribonucleic acid or “DNA.’’ Each DNA molecule consists of a complementary pair of long sequences (polymers) of small molecular elements known as nucleotides. The characteristic “double helix’’ configuration of DNA molecules results from chemical forces as the nucleotides on the complementary polymers strongly bind to one another. Each nucleotide is built around one of the four nitrogen bases guanine, adenine, thymine and cytosine, commonly known by their initial letters G, A, T, and C. It is the sequence of these bases that encodes the information for producing proteins that, in turn, determine the structures and functions of cells. In the human genome (the complete set of life-sustaining instructions in the genetic material of an organism), the DNA consists of about three billion nucleotides organized into 23 sets of structures called chromosomes.

Genes are certain identifiable sections of the DNA—varying in length from a few hundred to a few million bases—that encode a particular trait or “hereditary unit’’ of biological information by specifying and regulating the types of proteins produced. Almost every gene is present in the DNA of every cell of a given organism, but only a relatively few genes are active in any given cell type. Genes comprise only about 1% of the DNA material in the human geneome. The reasons for the existence of the remaining 99% of the DNA remain the subject of much conjecture, hypothesis, and research. It is currently estimated that there are on the order of 25,000 genes in the human genome, 99% of which are identical among all humans. The slight variances in the other 1% of the genes (gene alleles) determine differences among individuals (e.g., eye, hair, and skin color, blood type, etc.). Apart from normal variations (alleles), mutations in genes can lead to the formation of abnormal proteins with beneficial, neutral, or negative consequences for cell function and replication.

A gene is defined by its sequence of nucleotide bases, which, in turn, can be expressed as a sequence whose elements come from the set of four letters G, A, T, and C as described above—thus reducing the information carried by the gene to a simple code like “ATCGCT’’. A three letter code (i.e., a three base set like “AGA’’), called a codon, ultimately specifies one of the 20 amino acids that are the building blocks of proteins. There are 64 possible codons, and 61 of these are used to indicate 20 amino acids, so there is redundant representation of the amino acids in the codons. The remaining three codons are used for regulating the protein synthesis.

The manifestation of genes as proteins is called gene expression. The degree to which a gene is “active’’ in a given collection of cells (e.g., a basal cell (skin) tumor) in a given set of conditions (e.g., untreated versus treated with radiation or chemotherapy) can be ascertained by measuring the levels of certain molecules [messenger RNA (mRNA) or complementary DNA (cDNA)] related to to proteins manufactured by the gene. A microarray consists of thousands of binding sites (“probes’’), each populated with DNA fragments that can be associated with particular genes. The extent to which a gene is expressed in a particular preparation determines the extent to which the related microarray site “lights up’’ as the phosphorescent mRNA or cDNA in the preparation binds to its site. A single microarray experiment can be used to simultaneously quantify the expression levels of thousands of genes in a particular tissue preparation.

The applied purpose of the statistical modeling work in this paper is to develop methods for determining from microarray data which genes are expressed at significantly different levels when a cell type is exposed to different conditions. Ultimately, this information can be used to understand normal and abnormal cell function and replication, to target genes for medical therapies, and to develop drugs for treating myriad diseases and systemic disorders at the cellular level.

B. Estimating 𝑃(𝔉,𝐶) by Maximum Entropy Optimization (MAXENT)

B.1. MAXENT Distribution

𝑝(𝔉,𝐶) is a discrete distribution with support domain𝒟={(𝑖,𝑗)𝑖,𝑗;0𝑖𝐺,0𝑗𝐺,0𝑖+𝑗𝐺}.(B.1) Any moment-based inference involves computation over 𝒟 whose increasing cardinality (𝐺2) makes processing difficult for a large 𝐺. However, computation can be reduced substantially by truncating the domain 𝒟 to the set𝒟𝑡=(𝑖,𝑗)𝔉min𝑖𝔉max,𝐶min𝑗𝐶max,,0𝑖+𝑗𝐺(B.2) where,𝔉min𝜇=max𝔉𝜎𝔉,𝔉,0max𝜇=min𝔉+𝜎𝔉,𝐶,𝐺min=max𝜇𝐶𝜎𝐶,𝐶,0max=min𝜇𝐶+𝜎𝐶.,𝐺(B.3) Chebyshev’s inequality guides the choice of parameter . For 6, the loss of accuracy due to truncation is negligible.

Computation can be reduced further by recognizing that distributions imposed on the basis of a small number of moment constraints often enjoy a high level of regularity so that a sparser mesh should be adequate. The computation-accuracy trade-off becomes much easier to analyze if the problem is posed as one of learning a density function over a continuous domain, say𝒮𝑡=(𝑣,𝑤)𝑣(𝔉),𝑤(𝐶),𝜇𝑣+𝑤1𝜇,(B.4) where,𝜇def=𝐺/(𝜇𝔉+𝜇𝐶), (𝔉)def𝐺=rangeof𝔉=theinterval𝔉min𝜇𝔉,𝐺𝔉max𝜇𝔉,(B.5) and similarly for (𝐶) (see Figure 6). Note that the continuous domain is scaled by the factor 𝐺 to improve numerical stability. The moment constraints must be scaled accordingly. Let 𝒫𝑐 denote the space of feasible distributions 𝑝(𝜉1,𝜉2). Then, for all 𝑝𝒫𝑐:𝒮𝑡𝜉𝑖1𝜉𝑗2𝑝𝜉1,𝜉2d𝜉1d𝜉2=𝔉𝜇𝔉𝑖𝐶𝜇𝐶𝑗𝐺𝑖+𝑗def=𝜂𝑖𝑗,(B.6) with 0𝑖+𝑗𝐽. In (B.6), (𝑖,𝑗)=(0,0) corresponds to the constraint 𝒮𝑡𝑝(𝜉1,𝜉2)d𝜉1d𝜉2=1, while (𝑖,𝑗)=(1,0) together with (𝑖,𝑗)=(0,1) imply that every 𝑝𝒫𝑐 has mean [00]𝑇. We are concerned with problems for which 𝐽3. For convenience, we have defined the notation 𝜂𝑖𝑗 as a shorthand for the scaled (𝑖,𝑗) joint central moment of 𝔉 and 𝐶.

The selection of a unique 𝑝(𝜉1,𝜉2) is based on the principle of entropy maximization (MAXENT) which seeks a 𝑝𝒫𝑐 with maximum information entropy [67]. The information entropy essentially measures the spread of the distribution, and hence, maxent can be seen as a criterion, which, within one’s knowledge constraints, maximizes the representation of unknown information (“ignorance")—arguably, a suitable approach for statistical inference.

MAXENT seeks the following solution in 𝒫𝑐:𝑝𝜉1,𝜉2=max𝑝𝒫𝑐𝒮𝑡𝑝𝜉1,𝜉2𝑝𝜉ln1,𝜉2d𝜉1d𝜉2.(B.7) The solution takes the following exponential form:𝑝𝜆𝜉1,𝜉2=exp1𝑖+𝑗𝐽𝜆𝑖𝑗𝜉𝑖1𝜉𝑗2𝒮𝑡exp1𝑖+𝑗𝐽𝜆𝑖𝑗𝜉𝑖1𝜉𝑗2d𝜉1d𝜉2,(B.8) in which the 𝜆𝑖𝑗 are Lagrange multipliers. The derivation of the exponential form (B.8) and a procedure to determine optimal multipliers 𝜆𝑖𝑗 are given in the following subsection.

B.2. MAXENT Solution Details

The information entropy functional is concave [68], and the constraints in (B.6) are linear in 𝑝(𝜉1,𝜉2). Thus, the problem in (B.7) is a convex program that be solved in a Lagrangian dual framework, where one works with an unconstrained upper bound that is easy to optimize (e.g., [69]). More importantly, in the present case, the framework allows the conversion of the original infinite-dimension problem of functional variation into a finite-dimension problem with as few variables as the number of constraints.

Lemma 4. The dual, Ψ(𝝀), of the concave optimization problem (B.7) is given by: Ψ(𝝀)=ln𝒮𝑡exp1𝑖+𝑗𝐽𝜆𝑖𝑗𝜉𝑖1𝜉𝑗2d𝜉1d𝜉22𝑖+𝑗𝐽𝜆𝑖𝑗𝜂𝑖𝑗,(B.9) where 𝝀 is the set of Lagrange multipliers, {𝜆𝑖𝑗}𝑖,𝑗; 𝜆𝑖𝑗 is the multiplier corresponding to the (𝑖,𝑗)th constraint; and 𝑖,𝑗,𝐽, and 𝜂𝑖𝑗 are defined in (B.6).

Proof. By the definition of the Lagrangian dual function Ψ(𝝀)=sup𝑝𝒫𝑐𝒮𝑡𝑝𝜉1,𝜉2𝑝𝜉ln1,𝜉2d𝜉1d𝜉2+𝑖+𝑗𝐽𝜆𝑖𝑗𝒮𝑡𝜉𝑖1𝜉𝑗2𝑝𝜉1,𝜉2d𝜉1d𝜉2𝜂𝑖𝑗.(B.10) Taking the functional variation of the bracketed term in (B.10) with respect to the unknown density 𝑝(𝜉1,𝜉2), and using the fact that 𝒮𝑡𝑝(𝜉1,𝜉2)d𝜉1d𝜉2=1, we obtain the maximizer of (B.10) 𝑝𝝀𝜉1,𝜉2=exp1𝑖+𝑗𝐽𝜆𝑖𝑗𝜉𝑖1𝜉𝑗2𝒮𝑡exp1𝑖+𝑗𝐽𝜆𝑖𝑗𝜉𝑖1𝜉𝑗2d𝜉1d𝜉2.(B.11) Inserting (B.11) into (B.10) yields Ψ(𝝀)=𝒮𝑡𝑝𝜉1,𝜉2×ln𝒮𝑡exp1𝑖+𝑗𝐽𝜆𝑖𝑗𝜉𝑖1𝜉𝑗2d𝜉1d𝜉2d𝜉1d𝜉21𝑖+𝑗𝐽𝜆𝑖𝑗𝜂𝑖𝑗=ln𝒮𝑡exp1𝑖+𝑗𝐽𝜆𝑖𝑗𝜉𝑖1𝜉𝑗2d𝜉1d𝜉22𝑖+𝑗𝐽𝜆𝑖𝑗𝜂𝑖𝑗,(B.12) where we have used the facts 𝒮𝑡𝑝(𝜉1,𝜉2)d𝜉1d𝜉2𝜂00=0, 𝜂10=0, and 𝜂01=0 from (B.6).

It is easy to verify that the Hessian, denoted {}, of (B.9) is positive definite and hence Ψ(𝝀) is convex. Suppose that 𝝀 is the minimum of Ψ(𝝀). Then the corresponding primal solution 𝑝𝝀(𝜉1,𝜉2)—obtained via (B.11)—indeed maximizes (B.7). To verify this, let 𝑝𝑜(𝜉1,𝜉2) be the maximizer of (B.7). Then, from (B.10), Ψ(𝝀){𝑝𝑜(𝜉1,𝜉2)} for all 𝝀. Now from general optimization theory, the functional variation of the Lagrangian with respect to 𝑝(𝜉1,𝜉2) evaluated at 𝑝𝑜(𝜉1,𝜉2) must be zero, which implies that 𝑝𝑜(𝜉1,𝜉2) can be written in the form (B.11) for some 𝝀𝑜. But, then, Ψ(𝝀)Ψ(𝝀𝑜)Ψ(𝝀){𝑝𝑜(𝜉1,𝜉2)}. Consequently, Ψ(𝝀)={𝑝𝑜(𝜉1,𝜉2)}, so that 𝝀=𝝀𝑜. Hence 𝑝𝝀(𝜉1,𝜉2) is the maximizer of (B.7).

Newton’s method (e.g., [70, Section 4.6]) specifies that if a multivariable function Ψ(𝝀) is twice differentiable and the initial value 𝝀0 is chosen close enough to the optimal 𝝀, then the sequence over indices 𝑡=0,1,2,,𝝀𝑡+1=𝝀𝑡𝝀𝛾ΔΨ𝑡Ψ𝝀𝑡(B.13) converges to 𝝀. In (B.13), ΔΨ(𝝀𝑡) denotes the gradient of Ψ(𝝀) evaluated at 𝝀𝑡. The parameter 𝛾>0 allows a finer control of step sizes to avoid numerical instabilities. Intuitively, at the 𝑡th iteration, Ψ(𝝀) is replaced by its second-order Taylor expansion around 𝝀𝑡 and then minimized exactly, which produces the minimum 𝝀𝑡+1. At the (𝑡+1)st iteration, 𝝀𝑡+1 becomes the point of expansion and the method continues until the desired convergence level is achieved.

The elements of the gradient ̆ΔΨ(𝝀) are given by:̆𝝀𝜕Ψ𝜕𝜆𝑖𝑗=𝒮𝑡𝜉𝑖1𝜉𝑗2×exp1𝑖+𝑗𝐽̆𝜆𝑖𝑗𝜉𝑖1𝜉𝑗2𝒮𝑡exp1𝑖+𝑗𝐽̆𝜆𝑖𝑗𝜉𝑖1𝜉𝑗2d𝜉1d𝜉2×d𝜉1d𝜉2𝜂𝑖𝑗=𝒮𝑡𝜉𝑖1𝜉𝑗2𝜉̆𝑝1,𝜉2d𝜉1d𝜉2𝜂𝑖𝑗=̆𝜂𝑖𝑗𝜂𝑖𝑗,(B.14) where ̆𝜂𝑖𝑗 denotes the (𝑖,𝑗) central moment of the distribution of (B.11) parameterized by ̆𝝀. Similarly, the elements of the Hessian are given by:𝜕2Ψ̆𝝀𝜕𝜆𝑖𝑗𝜕𝜆𝑖𝑗=𝒮𝑡𝜉𝑖+𝑖1𝜉2𝑗+𝑗𝜉̆𝑝1,𝜉2d𝜉1d𝜉2𝒮𝑡𝜉𝑖1𝜉𝑗2𝜉̆𝑝1,𝜉2d𝜉1d𝜉2×𝒮𝑡𝜉𝑖1𝜉𝑗2𝜉̆𝑝1,𝜉2d𝜉1d𝜉2=̆𝜂(𝑖+𝑖)(𝑗+𝑗)̆𝜂𝑖𝑗̆𝜂𝑖𝑗.(B.15) The gradient calculations that occur as a part of the Hessian essentially involve integration over the planar domain 𝒮𝑡. Many advanced techniques for implementing numerical integration on a quadrangle like 𝒮t are available in the literature; however, an equispaced rectangular mesh is found to be sufficient for the present purpose. We initiate the sequence (B.13) with 𝝀=0 which implies a uniform distribution over 𝒮𝑡.

Acknowledgments

This work was supported in part by the Quantitative Biology Initiative at Michigan State University. H. Wang received support from the National Science Foundation of China (Grant no. 61002003) and from the Zhejiang Provincial Natural Science Foundation of China (Grant no. Z1111051). The authors are grateful to the personnel in the MSU High Performance Computing Center for assistance in implementing the extensive simulations, and to Dr. K.H. Desai for many contributions to this work.