Abstract
Consider the multiple testing problem of testing m null hypotheses , among which hypotheses are truly null. Given the P-values for each hypothesis, the question of interest is how to combine the P-values to find out which hypotheses are false nulls and possibly to make a statistical inference on . Benjamini and Hochberg proposed a classical procedure that can control the false discovery rate (FDR). The FDR control is a little bit unsatisfactory in that it only concerns the expectation of the false discovery proportion (FDP). The control of the actual random variable FDP has recently drawn much attention. For any level , this paper proposes a procedure to construct an upper prediction bound (UPB) for the FDP for a fixed rejection region. When , our procedure is very close to the classical Benjamini and Hochberg procedure. Simultaneous UPBs for all rejection regions' FDPs and the upper confidence bound for the unknown are presented consequently. This new proposed procedure works for finite samples and hence avoids the slow convergence problem of the asymptotic theory.
1. Introduction
In this paper, we consider the problem of testing null hypotheses , among which hypotheses are truly null. We shall assume that -values are available for individual hypotheses. In a seminal paper, Benjamini and Hochberg [1] proposed the false discovery rate (FDR) as an alternative to the classically defined family-wise error rate (FWER). The proposed FDR achieves a good balance between the -value itself and the FWER correction [2]; the former may give too many false positives, and the latter may give too many false negatives. However, the control of the FDR is a little bit unsatisfactory in that it only concerns the expectation of the false discovery proportion (FDP). In practice, researchers may be interested in more detailed statistical inference on the actual random variable FDP, not just its expectation. The goal of this paper is to provide a simple procedure to control the FDP.
Let us first introduce some notation. Given hypotheses , let the complete index set be , the unknown subset of for which the null hypotheses are true, and the subset for which null hypotheses are false. Denote that , where denotes the cardinality of a set. The -values for testing the hypotheses are . A fixed rejection region for the -values can conveniently be taken as (). The value of could be 0.05, for example. Define the number of all rejected hypotheses and the number of falsely rejected hypotheses, respectively, Following the notation of Korn et al. [3], and Genovese and Wasserman [4], Lehmann and Romano [5], the false discovery proportion is defined to be the proportion of falsely rejected null hypotheses among the rejected ones, where the ratio is defined to be zero when the denominator is zero. For a given fixed rejection region , , , and are random variables. , , and will be shorthand for , , and respectively, when the rejection region is clear from the context. The false discovery rate of Benjamini and Hochberg [1] is
A good understanding of will lead investigators to pick an appropriate rejection region of -values. As is an unobservable random variable depending on the observed -values and the rejection region , the quantity just describes the expectation of this random variable . One way to have a more detailed statistical inference on the random variable is to derive its distribution, which is very difficult unless a strong assumption can be imposed on the -values from the false null hypotheses. A conservative approach is to compute an upper prediction bound for so that we can safeguard against excessive type I errors. In Section 2, for a fixed rejection region , we can compute an upper prediction bound (UPB) for such that
If had been a nonrandom variable, then it should be always no less than the quantile of the random variable . When , our procedure is very close to the classical BH procedure of Benjamini and Hochberg [1]. In other words, the BH procedure gives us an approximate 50% upper prediction bound (UPB) for . With different degrees of being conservative, one should take at 0.9, 0.95, and 0.99 to ensure high coverage of the false discovery proportion. We also describe how to compute an upper confidence bound (UCB) for , the number of true null hypotheses. The UCB for can be used to improve the estimate . In practice, the rejection region needs to be adapted to the actual dataset. In Section 3, we give a procedure to construct an upper prediction band for for all , and this upper prediction band can be used to pick a data-defined rejection region of -values such that the false discovery proportion can be controlled at target level with prediction accuracy , that is,
Thus with probability at least , the value of is or less. For the independent true null -values, Genovese and Wasserman [4], Meinshausen and Rice [6] also worked on the control of the FDP in the sense of the above equation. However, their results are based on asymptotic theory, while our focus is on the finite-sample results and avoids the slow convergence problem of the asymptotic theory. Other works such as Lehmann and Romano [5], Romano and Shaikh [7, 8], and van der Laan et al. [9] proposed procedures that allow dependence in the -values but have potentially lost statistical power as the dependence information is not exploited. Section 4 presents a focused statistical inference by restricting the rejection regions onto , which unifies the results of Sections 2 and 3. Section 5 generalizes the results from independent data to less-independent situations, and Section 6 gives our discussion.
2. Finding a UPB for the False Discovery Proportion for a Fixed Rejection Region
For the sake of simplicity, we will first assume that the -values from the true null hypotheses are following mutually independently uniform distribution . We have no further assumptions on the -values from false null hypotheses. This assumption is the same as in Benjamini and Hochberg [1]. In Section 5 we will generalize the result to less independent situations. For a fixed rejection region of the -values, we would like to find the upper prediction bound (UPB) for the false discovery proportion . As we mentioned in Section 1, the distribution of is unknown. However, for any given experimental data, the total number of rejections, , can be easily obtained by (1.1). Under the assumption that true null -values are independently distributed as , has a binomial distribution . Let be random variables mutually independently distributed as , and distributed as , hence,
The quantile for is the quantile of the distribution . Here is defined as As can be computed from the observed data, a UPB for can be estimated by
Lemma 2.1. For any given , (a), (b), and(c) let and . The values that and take can only be zero or one.
Proof. By noting that when , we have . Applying the definition of , we obtain the result for part (a).
Note that
Therefore,
Similarly, we can obtain that
By noting that
we have
Combining this with (2.5) and (2.6) leads to the result for part (b).
Parts (a) and (b), respectively, say that and are both increasing functions of . Simple algebra can establish that . Both and are nonnegative due to the increasing property of functions and , and hence and . The only values that and take can only be zero and one as functions and only take integer values. Thus, we complete the proof for part (c).
Lemma 2.2. For any given t, , of (2.3) is a UPB for the false discovery proportion , that is,
The proof is straightforward by using the fact that . We have In the third line, we have used the fact that is the same as the set , which is obtained by noting that must be zero when . Following this proof, we can easily see that The basic construction of in (2.3) is the idea central to formulating prediction inference for . In practice, is an unknown parameter. The most conservative approach is to replace with , in which case we obtain a conservative UPB for . The independence assumption among true null -values can be used to give a confidence inference for ; thus, we can find a better estimate of the UPB for . For any given , a UCB for is given by where as defined in Lemma 2.1(c). Since and takes value of only zero and one, there exists at least one such that when . Therefore, in (2.12) is well defined. The parameter in (2.12) is used to construct a UCB for ; more discussion about it can be seen in Remark 2.6 of the following theorem.
Theorem 2.3.
(a) is a conservative UCB for , that is,
(b) Especially, if takes the same value as in the -value rejection region, then
Proof. Use as a shorthand of in this proof. We want to establish that
The fact that function is increasing in leads to . On the other hand, if , then is strictly less than , and we must have according to (2.12). is the maximum of such that , and hence as . The increasing property of leads to . Combining this with , we obtain that ; therefore, we conclude
and complete the proof of (2.15).
Note that
we have
Hence, we have the proof of part (a). When is set to be , we have that the set contains from the above derivation, and that from the derivation of Lemma 2.2. Therefore,
Thus, we have the proof of part (b) of this theorem.
Remark 2.4. Theorem 2.3 gives researchers a good sense of the total number of true null hypotheses. Other papers, for example, Storey et al. [10], Benjamini and Hochberg [11], and Langaas et al. [12], gave only point estimates of or . Part (a) gives a confidence inference for , and part (b) gives a simultaneous statement for the and , which is more interesting. Meinshausen [13] gives a confidence for by using resampling methods, while ours exploited the independence information so that it works for finite samples.
Remark 2.5. Theorem 2.3 of Göb [14] implies that for a binomial distribution, the difference between the median and mean is less than 1, that is, . From (2.3), we know the 50% UPB for the can be estimated by . This UPB for is very close to with a difference smaller than . Replacing by in is equivalent to the classical BH procedure. For a very large , the term can be ignored, and the BH procedure offers an approximate estimate of the 50% UPB for .
Remark 2.6. When is large, the distribution can be closely approximated by . Let be the quantile of a standard normal distribution. After some algebraic manipulations, we obtain a UCB for
Taking , we have , which is equivalent to (2.3) of Storey et al. [10]. For most practical applications, one can set the value of . Fine tuning of the parameter will be discussed in Section 3.3.
When the rejection region is small, the UCB for obtained from part (b) of Theorem 2.3 may be too conservative. It may be advantageous to have separate values for and . Part (a) of Theorem 2.3 implies the following.
Corollary 2.7. Replacing by its the upper confidence bound and by in (2.3), we define Then, is a conservative UPB for the false discovery proportion .
3. Upper Prediction Bounds and Simultaneous Inferences
3.1. The Setup
In Section 2, the UPBs for are only valid for a fixed rejection region of -values. In practice, researchers will not fix the rejected region but adapt it to the actual data. The logic is the same as with single hypothesis testing. In single hypothesis testing with nested rejection regions , for an observed statistic , one will find the rejection region that contains the observed statistic with the smallest type I error , that is,
The same logic can be applied to our false discovery proportion. In this case, we will try to find the largest rejection region such that the false discovery proportion is not more than , say 10%, with probability . Define We then reject any hypothesis whose -value is no greater than . If is independent of and , then we can expect that
Asymptotically, and may be independent: this question is open for future research. To overcome the independence assumption of and , we seek an alternative approach: to find simultaneous UPBs for all rejection regions , , that is, to find an upper prediction band such that
Hence we have the simultaneous inferences on for each rejection region . Following the definition of in (2.2) to construct the UPB for , we want to define the simultaneous critical values of . Using the distribution of is unwise as , which takes value with probability one. A better approach is to center , that is,
This leads to a test statistic related to the Kolmogorov-Smirnov test statistic, which gives an upper confidence band for a cumulative distribution function . It turns out that this method leads to very high UPBs when is close to zero or one. Therefore, we normalize , that is, Note that is continuously distributed even though each is discretely distributed. Let be the quantile of , that is, We can then redefine as Corresponding to Lemma 2.2 and Corollary 2.7, we have similar results below.
Corollary 3.1. For any given , of (3.8) is an exactly upper prediction band for the false discovery proportion , that is,
Corollary 3.2. Denote that . Define Let . Then is a conservative upper prediction band for the false discovery proportion , that is,
Remark 3.3. Using the same idea as in the proof of Lemma 2.2, the proof of the above corollaries is straightforward after converting the comparison between and to the comparison between and . This conversion provides a powerful tool for understanding the false discovery proportion.
Remark 3.4. The formulation of in (3.8) is motivated by the normal approximation of . But our definition of gives exact UPBs simultaneously for all due to the exactness of the quantile .
Remark 3.5. Meinshausen and Rice [6] and Donoho and Jin [15] also utilize the empirical process . However, they focus on the asymptotic theory for , which may face the slow convergence problem described in the next section. Our focus is on the finite sample control.
Remark 3.6. Let . After some simplifications, we have , and then is a decreasing function in , for . Equation (3.6) can be simplified to be where is the th smallest ordered one among the samples of distribution. This formulation facilitates the computation of the distribution of by Monte Carlo methods. The standard error associated with the Monte Carlo simulations in computing the probability in (3.7) is no greater than , where is the number of simulations.
3.2. Computing the Distribution of
In order to make simultaneous inferences, we need to know the distribution of defined in (3.6). Example 1 of Jaeschke [16] showed that asymptotically, for any , This implies that converges to 1 in probability as goes to . Jaeschke [16] claimed that this probability convergence is of almost no practical use. This is where we need to be cautious using asymptotic results. Figure 1 shows the poor approximation of the asymptotic result, even for a very large . Noe and Vandewiele [17] gave an iterative algorithm to compute the exact probability . Their algorithm is only good for very small due to the computational time and propagation of precision errors in representing real numbers in computer. Equation (24) of their paper gives an approximate formula for , This approximation is very good for but is away from the true probability when . For our applications, the 50% quantile (median) of is very useful, but the approximation of (3.14) is poor there.
In order to overcome the above poor approximation, we propose to use the Monte Carlo method to obtain the probability . Figures 2 and 3 give the probability for with simulations for , and . The Monto Carlo method generated quantiles for are almost the same as those quantiles that were able to be computed by the exact algorithm in Noe and Vandewiele [17]. Our two figures show that the distribution of does not change dramatically from . This property is beneficial for a multiple testing problem with large number of hypotheses, as it will not be overpenalized. Table 1 gives the quantiles of with .
3.3. More about the Upper Confidence Bound for
In computing the UCB for and consequently the UPB for , we rely on the unspecified parameter . A conventional choice of is 0.5. It is tempting to use as the best UCB for . This approach should be avoided as it may lead to an overoptimistic UCB. We can use the same idea in computing the simultaneous upper prediction bounds for to find an UCB for . Equation (2.20) motivates to the following theorem.
Theorem 3.7. Define as where Let Using to replace in (3.8) results in . We have Thus simultaneously is a UCB for and is a upper prediction band.
Proof. Note that when , if and only if Therefore, The last step follows: (i) is no less than for any , and (ii) . The fact (ii) gives Following the same idea as in the proof of Theorem 2.3 part (b), we can show that the set is a superset of . Therefore, Readers should note that the maximum quantile defined in (3.16) is only used to construct , The construction of itself does not use the maximum but the quantile , while still depends on the maximum quantiles indirectly through .
3.4. The Algorithm
Putting all these pieces together, we describe the procedure to compute the upper prediction band for and the UCB for in Algorithm 1. Note that we have to compute the quantile and . This is very time consuming for large , which is typically from thousands to tens of thousands. The computationally time can be reduced by the following strategies.(1)After careful study of the two equations (3.8) and (3.15), we find that if we replace all and by , where , the conclusions of Corollaries 3.1 and 3.2 and Theorem 3.7 still hold.(2)The quantile is an increasing function of , as shown by the Monte Carlo simulations in Figures 2 and 3. The rigorous mathematical proof of this finding is open to future research. For practical applications, we can first use Monte Carlo simulations to verify this property for the range of that is related to the project and then replace all by .(3)Figure 2 shows that is very close to if is close to a large , say more than 100. Therefore, in practical computations, we can first compute and store a representative sequences of the quantiles for , and consequently we can get an upper bound for for . In computing the quantiles , we recommend to have at least Monte Carlo simulations in order to get an accurate quantile computation for tail part. Even with simulations, we still see a small amount of random noise in the tail part in Figure 3.
|
4. A Focused Inference on and : A Unified Approach
In many applications, it may be unnecessary to compute the simultaneous UPBs for for all in and using for all in to derive a UCB for . In most applications, it may be reasonable to restrict the rejections onto . The can take value of based on Bonferroni FWER control at level 0.01. It is rare to consider a smaller rejection region than this. The can take value of 0.05 as it is rare to consider a larger rejection region than even in a single hypothesis testing problem. For the same reason, we can also restrict onto in (3.17). The interval [, ] can be taken as a region close to one as the minimum of is reached when is close to 1 [18], but if is too close to 1, is not stable. One good choice of can be 0.8, and can be 0.95.
The above scenario is a focused inference on and . The in Section 3.2 will be a little bit more conservative for us. We can redefine as in the following:
From this we can define the quantile , and derive results similar to Theorem 3.7. Figure 4 shows quantiles for for and . Table 2 gives the numerical values of for . It clearly shows that is around 10% smaller than the unrestricted quantiles . For small values of , say that , the former is at least 25% smaller than the latter.
Corollary 4.1. Define as where . Let Define Replacing by results in . We have that is a UCB for , and is a upper prediction band for for , that is,
Note that the UCB for takes not only the minimum of for , but also the minimum of for . This advantage is due to the construction of the , which takes maximum over these two intervals. The details of the calculation are summarized in Algorithm 2. The proof of this corollary is the same as that in Theorem 3.7.
|
By setting and , this corollary is equivalent to Theorem 2.3 through some algebra manipulations, while Theorem 2.3 uses the exact confidence bound from the binomial distribution without relying on the quantiles of . Furthermore, Theorem 3.7 is exact a special case of Corollary 4.1 by setting , and and by considering open intervals rather close intervals. The focused inference thus unifies both the fixed rejection approach and simultaneous approach. We should be cautious of selecting and based on the observed data, which may result in overoptimistic false discovery proportions. These settings have to be decided before the data are generated. A careful study of choosing appropriate values for and is open for future research.
5. Generalizing the Results to Less-Independent Situations
The results of Sections 2, 3, and 4 are based on the assumption that the true null -values are independently distributed as . Given this, we need no further assumptions concerning the false null -values. This independence assumption can be weakened as in the following:
Binomial Dominant Condition: One has .
The notation means that random variable is stochastically no greater than random variable , that is, for any . Replacing the independence assumption by the binomial dominant condition, the results corresponding to Lemma 2.2, Theorem 2.3, and Corollary 2.7 in Section 2 still hold for a fixed rejection region. For the simultaneous UPBs in Sections 3 and 4, we need a stronger assumption than the binomial dominant condition as the joint distribution of needs to be specified. We can replace the binomial dominant condition by the following.
Joint Binomial Dominant Condition: for any , and . Here , and are mutually independently distributed as distribution . The notation means for any , .
Replacing the independence assumption by this joint binomial dominant condition, the results in Sections 3 and 4 are still valid for the upper prediction band for and the UCB for . A special case for this joint binomial dominant condition is that when the true null -values are independent with distribution stochastically no smaller than . This happens when the null hypothesis is composite or the statistic to test the null hypothesis is not a continuous random variable.
More generally, we would like the construction of upper prediction band for not to rely on the independence assumption or any kind of weak dependence assumption (the binomial dominant condition or the joint binomial dominant condition). The method of Romano and Shaikh [7, 8] can be applied without any assumptions on the dependence, but may potentially have lost power due to that the correlation structure of the data has not been exploited. A resampling procedure [13, 19] has been proposed to address this limitation.
6. Discussion
The method of this paper applies to data where true null -values are independent, or to slightly dependent data where the joint binomial dominant condition is satisfied. This assumption does not rely on any specification for the false null -values. In this paper we used the idea of considering a fixed rejection region to construct a UPB for and a UCB for . By utilizing the normalized empirical process , we find simultaneous UPBs for for all and can further modify the construction of the UCB for . The result of Theorem 3.7 gives the joint statement about the UCB for and the simultaneous UPBs for the false discovery proportions . A focused approach in Corollary 4.1 unifies the result of the fixed rejection region method and the simultaneous approach.
The method in this paper is based on finite samples and avoids the slow convergence problem of the asymptotic theory for the empirical process . The Monte Carlo simulations give very accurate estimates of the quantiles for . The standard error associated with the Monte Carlo simulations in computing the probability in (3.7) is no greater than , where is the number of simulations.
In the dataset where the test statistics are not independent or do not satisfy joint binomial dominant condition, the method in this paper may not be guaranteed to work. One can alternatively use the methods proposed in Romano and Shaikh [7, 8], Meinshausen [13], Ge et al. [19]. The method proposed in this paper can be potentially extended to dependent data by using resamplings, and this work is open for future research.
Acknowledgments
The authors thank Terry Speed, Stuart Sealfon, Carol Bodian, Sylvan Wallenstein, John Mandeli, Samprit Chatterjee, and Jim Godbold for their discussions of this work. They thank the editor and referees for helpful comments that have led to an improved paper. This work was partly supported by National Institute of Allergy and Infectious Diseases with the contract HHSN266200500021C.