Abstract

The false discovery proportion (FDP), the proportion of incorrect rejections among all rejections, is a direct measure of abundance of false positive findings in multiple testing. Many methods have been proposed to control FDP, but they are too conservative to be useful for power analysis. Study designs for controlling the mean of FDP, which is false discovery rate, have been commonly used. However, there has been little attempt to design study with direct FDP control to achieve certain level of efficiency. We provide a sample size calculation method using the variance formula of the FDP under weak-dependence assumptions to achieve the desired overall power. The relationship between design parameters and sample size is explored. The adequacy of the procedure is assessed by simulation. We illustrate the method using estimated correlations from a prostate cancer dataset.

1. Introduction

Modern biomedical research frequently involves parallel measurements of a large number of quantities of interest, such as gene expression levels, single nucleotide polymorphism (SNP) and DNA copy number variations. The scientific question can often be formulated as a multiple testing problem. In order to address the multiplicity issue, many methods have been proposed to control the family-wise error rate (FWER), false discovery rate (FDR) or false discovery proportion (FDP). Controlling FDR has been widely used in high-dimensional data analysis [13]. FDR is the expected value of the FDP, which is the proportion of incorrect rejections among all rejections. Controlling FDR ensures that the average of FDP from many independently repeated experiments is under control. However, the variability of FDP is ignored, and the actual FDP could be much greater than FDR with high probability, especially when test statistics are correlated. Consequently, researchers have proposed many procedures to control FDP directly [411]: PFDP𝑟1𝑐1(1.1) for given 𝑟1 and 𝑐1. This is a more stringent criterion than FDR because the proportion of false rejections is bounded above by 𝑟1 with high probability. The FDP controlling procedures are generally too conservative to be useful for the purpose of study design or power analysis.

When we design studies involving multiple testing, it is important to determine sample size to ensure adequate statistical power. Methods for calculating sample size have been proposed to control various criteria, for example, FWER [1214], FDR [1520], the number of false discoveries [19, 21] and FDP [22]. For controlling FDP, Oura et al. [22] provided a method to calculate sample size using the beta-binomial model for the sum of rejection status of true alternative hypotheses. It is assumed that only test statistics of true alternative hypotheses are dependent, with a parametric correlation structure. This assumption is restrictive because null test statistics can also be correlated and the dependence structure can be more complicated than the assumed parametric correlation structure. Furthermore, the computation is intensive because computation of the beta-binomial distribution is required. However, to our knowledge this is the only paper that directly deals with this important design problem.

In this paper, we provide a more general method of sample size calculation for controlling FDP under weak-dependence assumptions. Under some assumptions on dependence among test statistics, explicit formulas for the mean and variance of FDP have been derived for each fixed effect size [23]. The formulas elucidate the effects of various design parameters on the variance of FDP. Moreover, the formulas provide a convenient tool to calculate sample size for controlling the FDP. As in [13, 18, 19, 24], we consider the probability of detecting at least a specified proportion of true alternative hypotheses as the power criterion. An iterative computation algorithm for calculating sample size is provided. Simulation experiments indicate that studies with the resultant sample sizes satisfy the power criterion at the given rejection threshold. We illustrate the sample size calculation procedure using a prostate cancer dataset.

2. Methods

2.1. Notation

Suppose that 𝑚 hypotheses are tested simultaneously. Let 𝑀0 denote the index set of 𝑚0 tests for which null hypotheses are true and 𝑀1 the index set of 𝑚1=𝑚𝑚0 tests for which alternative hypotheses are true. Denote the proportion of true null hypotheses by 𝜋0=𝑚0/𝑚. We reject a hypothesis if the 𝑃 value is less than some threshold 𝛼, and denote the rejection status of the 𝑖th test by 𝑅𝑖(𝛼)=𝐼(𝑝𝑖<𝛼), where 𝑝𝑖 denotes the 𝑃 value of the 𝑖th test and 𝐼() is an indicator function. The number of rejections is 𝑅=𝑚𝑖=1𝑅𝑖(𝛼). Let the comparison-wise type II error of the 𝑖th test be 𝛽𝑖 and the average type II error be 1𝛽=𝑚1𝑖𝑀1𝛽𝑖.(2.1) Table 1 summarizes the outcomes of 𝑚 tests and their expected values.

Denote the Pearson correlation coefficient of two rejection indicators by 𝜃𝑖𝑗𝑅=corr𝑖(𝛼),𝑅𝑗.(𝛼)(2.2) Furthermore, for 𝑖,𝑗𝑀0, define 𝜃𝑉𝑖𝑗𝑅=corr𝑖(𝛼),𝑅𝑗.(𝛼)(2.3) Let the average correlation be denoted as 𝜃𝑉=𝑖,𝑗𝑀0,𝑖𝑗𝜃𝑉𝑖𝑗𝑚0𝑚0.1(2.4) Similarly, for 𝑖,𝑗𝑀1, we define 𝜃𝑈𝑖𝑗𝑅=corr𝑖(𝛼),𝑅𝑗.(𝛼)(2.5) The average correlation is 𝜃𝑈=𝑖,𝑗𝑀1,𝑖𝑗𝜃𝑈𝑖𝑗𝑚1𝑚1.1(2.6) In addition, for 𝑖𝑀1,𝑗𝑀0, denote 𝜃𝑖𝑗𝑈𝑉𝑅=corr𝑖(𝛼),𝑅𝑗.(𝛼)(2.7) Denote the average correlation by 𝜃𝑈𝑉=𝑖𝑀1𝑗𝑀0𝜃𝑖𝑗𝑈𝑉𝑚0𝑚1.(2.8)

2.2. The Effect of Design Parameters on the Variance of FDP

It has been shown via numerical studies that the variability of FDP increases when test statistics are dependent [25, 26]. But the relationship between design parameters and the variance of FDP has not been examined through analytical formulas. Under the assumptions of common effect size and weak dependence among test statistics, explicit formulas for the mean (𝜇𝑄) and variance (𝜎2𝑄) of the FDP have been derived [23]: 𝜇𝑄𝜋0𝛼𝜋0𝛼+1𝜋01𝛽,𝜎(2.9)2𝑄𝜋01𝜋02𝛼(1𝛼)1𝛽𝜋0𝛼+1𝜋01𝛽4Σ,(2.10) where 1Σ=𝑚1𝜋𝛽+01𝜋0𝜔𝛽+𝜋01𝑚1𝛽𝜃𝑉+𝜋0𝜔𝛽𝜃𝑈2𝜋0𝜔𝛽1𝛽𝜃𝑈𝑉(2.11) and 𝜔=𝛼/(1𝛼).

The variance formula (2.10) elucidates the effects of various design parameters on the variance of FDP. To explore the effects, in Figure 1 we calculated 𝜎𝑄 using (2.10) and plotted it against 𝑚 for different correlations 𝜃𝑉. We set 𝜋0=0.7 and 𝑚 in the range of 1000 to 10000. The average correlations 𝜃𝑈 and 𝜃𝑈𝑉 are fixed to be 0.001 and 0, respectively. The levels of 𝛼 and 𝛽 are chosen such that FDR is 3% or 5%. At each value of 𝜃𝑉, 𝜎𝑄 decreases as the number of tests 𝑚 increases. The solid line shows the standard deviation of the FDP when 𝜃𝑉 is 0. When 𝜃𝑉 is not 0, 𝜎𝑄 increases evidently. If test statistics are highly correlated, FDP can be much greater than its mean FDR at a given rejection threshold due to its large variability.

In Figure 2, the relationship between 𝜎𝑄 and 𝜋0 was investigated. When other parameters are fixed, 𝜎𝑄 increases as 𝜋0 increases.

Figure 3 shows that 𝜎𝑄 increases as 𝛽 increases. When other factors are fixed, the variability of FDP is smaller when the comparison-wise type II error is smaller.

2.3. Power and Sample Size Analysis

Under some general regularity conditions including weak dependence among test statistics, the FDP follows an asymptotic normal distribution 𝑁(𝜇𝑄,𝜎2𝑄) [23, 27]. As was pointed out by Shang et al. [23], 𝑌=log(FDP) also has an asymptotic normal distribution by the delta method, and under weak dependence log(FDP) is closer to normal than the FDP itself. The approximate mean and variance of 𝑌=log(FDP) are [23] 𝜇𝑌𝜇log𝑄𝜋log0𝛼𝜋0𝛼+1𝜋01𝛽,𝜎(2.12)2𝑌1𝜋02(1𝛼)1𝛽𝜋0𝛼𝜋0𝛼+1𝜋01𝛽2Σ,(2.13) where Σ is in (2.11).

To control FDP with desired power, criterion (1.1) has to be satisfied. Asymptotic normality of log(FDP) implies that Φlog𝑟1𝜇𝑌𝜎𝑌𝑐1,(2.14) where Φ() is the cumulative distribution function (CDF) of standard normal distribution, 𝜇𝑌 is in (2.12), and 𝜎2𝑌 is in (2.13).

There are two commonly used power criteria in multiple testing: the average power, defined as E(𝑈/𝑚1), and the overall power, defined as P(𝑈/𝑚1𝑟2) for given 𝑟2. When a study is designed using the average power criterion, the proportion of true alternative hypotheses rejected will be greater than a prespecified number on average. However, under dependence among test statistics the variability of 𝑈/𝑚1 increases [18], and the study can be underpowered with high probability. Consequently, the overall power has been used in [13, 18, 19, 24] and we also use this power criterion, P𝑈𝑚1𝑟2𝑐2(2.15) for given 𝑟2 and 𝑐2.

Under the weak-dependence assumptions in [18, 23], 𝑈/𝑚1 has an asymptotic normal distribution: 𝑈𝑚1𝑁1𝛽,𝛽1𝛽𝑚11+𝜃𝑈𝑚1.1(2.16) Setting the inequality in (2.15) to equality, the following equation for 𝛽 can be obtained as in [18]: 𝛽=1𝑟212𝑟2+4𝑚1𝑟21𝑟2+12𝑚1,+2(2.17) where 𝑚1=𝑚1/{1+𝜃𝑈(𝑚11)}𝑧21𝑐2 and Φ(𝑧1𝑐2)=𝑐2.

For illustration, consider that a two-sample one-sided 𝑡-test is performed. Let 𝛿 denote the effect size (mean difference divided by the common standard deviation), and 𝑎1 and 1𝑎1 denote the allocation proportion for two groups. We first find 𝛼 and 𝛽 which fulfill criteria (1.1) and (2.15). The required sample size 𝑛 is the smallest integer satisfying the following inequality: 1𝛽1Γ𝑛2𝑡𝑛2,𝛼|||𝛿𝑎11𝑎1𝑛,(2.18) where Γ𝑛2(𝑏) is the CDF of a noncentral 𝑡-distribution with 𝑛2 degrees of freedom and noncentrality parameter 𝑏 and 𝑡𝑛2,𝛼 is the upper 𝛼 critical value of the central 𝑡-distribution with 𝑛2 degrees of freedom.

Following the notation defined in Section 2.1, the correlations can be calculated as: 𝜃𝑉𝑖𝑗=Ψ𝑛2𝑡𝑛2,𝛼,𝑡𝑛2,𝛼;𝜌𝑖𝑗𝛼2,𝜃𝛼(1𝛼)(2.19)𝑈𝑖𝑗=Ψ𝑛2𝑡𝑛2,𝛽,𝑡𝑛2,𝛽;𝜌𝑖𝑗1𝛽2𝛽1𝛽,𝜃(2.20)𝑖𝑗𝑈𝑉=Ψ𝑛2𝑡𝑛2,𝛼,𝑡𝑛2,𝛽;𝜌𝑖𝑗𝛼1𝛽𝛼(1𝛼)𝛽1𝛽,(2.21) where Ψ𝑛2 is the CDF of a bivariate 𝑡-distribution with 𝑛2 degrees of freedom and 𝜌𝑖𝑗 denotes the Pearson correlation between the 𝑖th and 𝑗th test statistics. As can be seen from these formulas, the correlations depend on 𝛼 and 𝛽. No analytical solutions can be found for these two parameters. We use the following iterative computation algorithm to calculate sample size.

Algorithm.(1)Input design parameters 𝑟1,𝑐1,𝑟2,𝑐2,𝑚,𝜋0,𝛿,𝑎1,𝜌𝑈,𝜌𝑉and𝜌𝑈𝑉. (2)Start from 𝜃𝑈 = 0, 𝜃𝑉 = 0 and 𝜃𝑈𝑉 = 0. (3)Calculate 𝛽 from (2.17). (4)Using the current values of 𝜃𝑈, 𝜃𝑉, 𝜃𝑈𝑉 and 𝛽, solve for 𝛼 from equation Φ((log𝑟1𝜇𝑌)/𝜎𝑌)=𝑐1. (5)Using the current estimates of 𝛽 and 𝛼, calculate 𝜃𝑉, 𝜃𝑈 and 𝜃𝑈𝑉 from (2.19), (2.20) and (2.21), respectively. Obtain the average correlations 𝜃𝑉, 𝜃𝑈 and 𝜃𝑈𝑉. (6)With updated estimates of 𝜃𝑉, 𝜃𝑈 and 𝜃𝑈𝑉, repeat steps 3 to 5 until the estimates of 𝛽 and 𝛼 converge.(7)Plug the estimated 𝛽 and 𝛼 into (2.18) to solve the sample size.
The estimates of rejection threshold 𝛼 and comparison-wise type II error 𝛽 can also be obtained.

3. Numerical Studies

3.1. Simulation

The proposed sample size calculation procedure was illustrated for one-sided 𝑡-test comparing the mean of two groups. The effect size 𝛿=1 and allocation proportion 𝑎1=0.5. Two types of correlation structures were used: blockwise correlation and autoregressive correlation structure. In the blockwise correlation structure, a proportion of test statistics were correlated in units of blocks. The correlation coefficient within block was a constant, and test statistics were independent across blocks. True null test statistics and true alternative test statistics were independent. In the autoregressive correlation structure, the correlation matrix (𝜎𝑖𝑗) for dependent test statistics was parameterized by 𝜎𝑖𝑗(𝜌)=𝜌|𝑖𝑗|, where 𝜎𝑖𝑗(𝜌) is the Pearson correlation coefficient for the 𝑖th and 𝑗th test statistics and 𝜌 is a correlation parameter.

Oura et al. [22] provided a sample size calculation method for controlling FDP using the beta-binomial model. Only test statistics of true alternative hypotheses are allowed to be dependent, with blockwise correlation structure. For comparison, this method and the sample size calculation procedure for controlling FDR with dependence adjustment in [18] were also assessed. Specifically, the criteria for controlling FDP are 𝑈P(FDP0.05)0.95,P𝑚10.90.8.(3.1) The criteria for controlling FDR are 𝑈FDR0.05,P𝑚10.90.8.(3.2)

Table 2 presents the sample size estimates for the blockwise correlation structure. Several parameter configurations were used. The block size is 20 or 100, for 𝑚 = 2000 or 10000, respectively. We observe that the sample size increases as the correlation between test statistics gets stronger, represented by a greater correlation parameter or a larger proportion of correlated test statistics. When the correlation is fixed, as the number of tests 𝑚 increases, the required sample size decreases. With the other parameters fixed, when the number of true alternative hypotheses increases (𝜋0 decreases), the required sample size decreases.

The sample sizes for controlling FDP are greater than those for controlling FDR because controlling FDP is in general more stringent. In the case that 𝜋0 = 0.9, 𝑝𝑣 = 0.3, 𝜌𝑣 = 0.6 and 𝑚 = 2000 (see Table 2), the sample size for controlling FDP is 81, which is 23% greater than the sample size for controlling FDR. The sample sizes using the method in [22] are in parentheses and are slightly smaller than ours. In terms of computational efficiency, our algorithm converges very fast and generally within 10 steps. The computation is not heavy, and in fact, very similar and comparable to that in [18] for controlling FDR with dependence adjustment. The method of Oura et al. [22] is more computationally intensive. It becomes not feasible when the number of tests or the number of blocks of dependent test statistics is large. Simulation studies show that FDP is controlled and the power is achievable with the sample size given by our procedure at the calculated rejection threshold 𝛼 (results not shown).

Table 3 presents the sample sizes for the autoregressive correlation structure. Similar trends for sample size are observed as the design parameters vary. The method in [22] is not applicable to this dependence structure.

3.2. Sample Size Calculation Based on a Prostate Cancer Dataset

We use a prostate cancer dataset as source of correlation structure to illustrate the proposed sample size calculation method while ensuring overall power. The study by Wang et al. [28] investigated the association between mRNA gene expression levels and the aggressive phenotype of prostate cancer. The dataset contains 13935 mRNA measured from 62 patients with aggressive prostate cancer and 63 patients with nonaggressive disease. The method in [23] was used to estimate the correlation between gene expression levels. The estimated average correlation of expression levels of null genes, alternative genes and between null genes and alternative genes are 0.0040, 0.0043 and −0.0005, respectively.

Sample size was calculated for one-sided 𝑡-test. The total number of genes is 𝑚=10000, and the following criteria are to be satisfied: P(FDP0.10)0.7 and P(𝑈/𝑚10.9)0.8. Table 4 presents the sample sizes for various values of 𝑚1. We performed simulation studies to confirm that these sample sizes provided adequate power at the rejection threshold given by our algorithm. Simulation data were generated with blockwise dependence structure such that the average correlation was close to the estimated correlation from the real dataset.

4. Discussion

In practice, when planning a study one typically needs to make some assumptions. For designing multiple testing studies, a common assumption is that the dependence between test statistics is weak. In this paper, we provide a computationally effective method of sample size calculation for controlling FDP under weak dependence while achieving the desired overall power. This approach uses semiparametric assumptions on dependence structure. We only need to estimate the Pearson correlation between test statistics, and thus this method is applicable to many realistic settings where weak dependence can be assumed. The variance formula of FDP provides a convenient tool to uncover the relationship between the design parameters and the variability of FDP under the assumption of weak dependence. Simulation studies indicate that the algorithm is computationally efficient and stable.

We have used one-sided 𝑡-test to illustrate the method, and the procedure can be easily extended to two-sided 𝑡-test and other tests. Common effect size is assumed in the sample size calculation algorithm. In practice, one can try different effect sizes, see the range of sample sizes and the variability and then make a decision. Effects of variations on other parameters such as 𝜋0 can be examined similarly.

Acknowledgments

We would like to thank the reviewers and the Editor for their careful reading of our paper and the constructive suggestions. This research was partially supported by a Stony Wold-Herbert Foundation grant and NIH grants 2P30 CA16087, 5P30 ES00260, UL1 TR000038 and R03 CA153083.