Table of Contents Author Guidelines Submit a Manuscript
Journal of Probability and Statistics
Volume 2012, Article ID 935621, 10 pages
Research Article

Sample Size Growth with an Increasing Number of Comparisons

1Department of Medicine, UCLA School of Medicine, Los Angeles, CA 90095, USA
2Division of Biostatistics, NYU School of Medicine, New York, NY 10016, USA

Received 30 March 2012; Accepted 8 June 2012

Academic Editor: Wei T. Pan

Copyright © 2012 Chi-Hong Tseng and Yongzhao Shao. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


An appropriate sample size is crucial for the success of many studies that involve a large number of comparisons. Sample size formulas for testing multiple hypotheses are provided in this paper. They can be used to determine the sample sizes required to provide adequate power while controlling familywise error rate or false discovery rate, to derive the growth rate of sample size with respect to an increasing number of comparisons or decrease in effect size, and to assess reliability of study designs. It is demonstrated that practical sample sizes can often be achieved even when adjustments for a large number of comparisons are made as in many genomewide studies.

1. Introduction

With the recent advancement in high-throughput technologies, simultaneous testing of a large number of hypotheses has become a common practice for many types of genomewide studies. Examples include genetic association studies and DNA microarray studies. In a genomewide association analysis, a large number of genetic markers are tested for association with the disease [1]. In DNA microarray studies, the interest is typically to identify differentially expressed genes between patient groups among a large number of candidate genes [2].

The challenges for designing such large-scale studies include the selection of features of scientific importance to be investigated, selection of appropriate sample size to provide adequate power, and choices of methods appropriate for the adjustment of multiple testing [37]. There exist recent methodological breakthroughs on multiple comparisons, such as in the frontier of controlling the false discovery rate (FDR) [8, 9], which is particularly useful for the study of DNA microarray and protein arrays. It is also increasingly used in genomewide association studies [10]. On the other hand, the Bonferroni type adjustment is still surprisingly useful. For example, Klein et al. [1] successfully identified two SNPs which are associated with the age-related macular degeneration disease (AMD) using a Bonferroni adjustment. Witte et al. [11] provided an interesting observation that the relative sample size, based on Bonferroni adjustment, is approximately in a linear relationship to the logarithm of the number of comparisons.

An appropriate sample size is crucial for the success of studies involving a large number of comparisons. However, optimal and reliable sample size is extremely challenging to identify, as it typically depends on other design parameters that often have to be estimated based on preliminary data. Preliminary data are often limited at the design stage of studies, which lead to unreliable estimates of design parameters and create extra uncertainty in sample size estimation. Thus, it is of great practical interest to examine the relationship between sample size and other design parameters, such as the number of comparisons to be made. In this paper, we analyze this problem beyond witte et al.’s [11] observation by providing explicit sample size formulas, examining various genomic analyses, and deriving sample size formula for FDR control. The explicit sample size formulas are desirable because they elucidates how the change in other design parameters would affect sample size. This is of fundamental importance for understanding the reliability of study designs.

2. Sample Size Formulas

For testing a single hypothesis, the sample size problem is typically formulated as finding the number of subjects needed to ensure desired power 1𝛽 for detecting an effect size Δ at a prespecified significance level 𝛼. Consider an one-sided test for equality of two normal means assuming known variances 𝜎21 and 𝜎22, respectively. The sample size per group (𝑛) is as follows [12]: 𝑧𝑛=𝛼+𝐶𝑧𝛽2Δ2,(2.1) where Δ=|𝜇1𝜇2|/𝜎21+𝜎22, 𝐶=1, Φ(𝑧𝑡)=1𝑡, and Φ(𝑧) is the distribution function (CDF) of the standard normal distribution.

Many of the most widely used statistical tests have similar sample size formulas as in (2.1). For example, the commonly used Mann-Whitney test for comparing two continuous distributions without normality assumption has the same form of sample size formula as in (2.1). Similarly, for testing equality of two binomial proportions, using independent samples or using correlated samples as in McNemar’s test, the sample size formulas are also of form (2.1) as discussed in Rosner [12].

For testing a single hypothesis, the influences of 𝛼, 𝛽, and Δ on the sample size 𝑛 can be inferred easily from the above sample size formula (2.1), and are well known. When testing multiple hypotheses, one must guard against an abundance of false-positive results. The traditional criterion for error control in such situations is the familywise error rate (FWER), which is the probability of rejecting one or more true null hypotheses. The simplest and most commonly used method for controlling FWER is the Bonferroni correction, which is discussed in the next subsection.

2.1. FWER Control

In this section, we present sample size formulas for multiple comparisons in the context of controlling the familywise error rate (FWER). Suppose we make multiple comparisons with Δ being the same. If we wish to retain a familywise error rate 𝛼, and power (1𝛽), then with the Bonferroni adjustment, 𝛼bon=𝛼/𝑀, the sample size corresponding to (2.1) becomes 𝑛𝑀=𝑧𝛼/𝑀+𝐶𝑧𝛽2Δ2.(2.2) To see how 𝑛𝑀 changes as 𝑀 increases, we can use the following well-known fact: when 𝛼<0.5, 𝜙(𝑧𝛼)(1/𝑧𝛼1/𝑧3𝛼)1Φ(𝑧𝛼)𝜙(𝑧𝛼)/𝑧𝛼. Since 𝛼/𝑀=1Φ(𝑧𝛼/𝑀), we can approximate 𝑧𝛼/𝑀 by 𝑧𝛼/𝑀, where 𝑧2𝛼/𝑀𝑀2log𝛼𝑀log(2𝜋)loglog𝛼.(2.3) The explicit approximation of 𝑧2𝛼/𝑀 in (2.3) works extremely well for 𝑀 ranging from 10 to 1010. Putting (2.3) into (2.2) yields the following approximation of the required sample size 𝑛𝑀: 𝑛𝑀=𝑧𝛼/𝑀+𝐶𝑧𝛽2Δ2.(2.4) Then, for fixed (𝛼,𝛽,Δ), from (2.3) and (2.4), we have 𝑛𝑀𝑛𝑀2Δ2𝑀log𝛼,as𝑀+.(2.5) A few facts are self-evident from the above approximation. First, 𝑛𝑀 is an approximately linear function of log𝑀 (base 10) with slope 2/Δ2. Second, the impact of 𝛽 on 𝑛𝑀 (or 𝑛𝑀) is negligible when 𝑀 is large. Third, a decrease in 𝛼 is equivalent to an increase in 𝑀 on 𝑛𝑀 (or 𝑛𝑀). The impact of Δ on 𝑛𝑀 (or 𝑛𝑀) is demonstrated in Figure 1 with 𝛼=0.05, 1𝛽=0.90, and Δ=0.5,1, and 2, respectively. It shows that 𝑛𝑀 (open circles) can indeed be approximated well by a linear function of log𝑀. The lines are calculated based on approximate normal quantiles (2.4) for 𝑛𝑀. Moreover, when Δ is large (e.g., Δ=2), the slope is very small.

Figure 1: Sample size versus log𝑀 (base 10) to detect effect sizes Δ=0.5,1 or 2 with 1𝛽=90% power at the familywise significance level 𝛼=5%, when Bonferroni adjustment is used. The open circles represent the sample sizes calculated based on exact normal quantiles (2.2).

The simple Bonferroni correction is very useful, when the number of true alternatives is small. This often occurs, for example, in candidate gene association studies. The Bonferroni approach is easy to apply, for example, it is convenient when the hypotheses involve many covariates and nuisance parameters, whereas the permutation approaches may not be applicable, because they require some symmetry or exchangeability on the null hypotheses [13, 14]. Next, we give two practical examples to illustrate the growth rate of sample size relative to the number of tests 𝑀 to be performed.

The AMD Example
Age-related macular degeneration (AMD) is a major cause of blindness in the elderly. Klein et al. [1] reported a genomewide screen of 96 cases and 50 controls for polymorphisms associated with AMD. They examined 116,204 single-nucleotide polymorphisms (SNPs). Two of the SNPs are found to be strongly associated with the disease phenotype. This is an example to test equality of two binomial proportions of two independent groups (cases and controls). The required sample size for each marker is given in (2.2) or (2.4) with Δ2=2(𝑝1𝑝2)2𝑝𝑞, 𝐶=(𝑝1𝑞1+𝑝2𝑞2)/(2𝑝𝑞), and 𝑝=(𝑝1+𝑝2)/2. Illustration for sample size growth with the Bonferroni correction is plotted in Figure 2 against log𝑀 using the SNP rs1329428 (Table 1) identified in Klein et al. [1]. Using Bonferroni adjustment, the sample sizes are calculated to provide 90% power to detect the association at the familywise significance level 𝛼=5%. The open circles and plus signs are sample sizes 𝑛𝑀 using (2.2) according to the dominant and recessive odds ratios, respectively. The corresponding lines are sample sizes 𝑛𝑀 based on (2.4).

Table 1: An SNP from Klein et al. [1].
Figure 2: Sample sizes to detect the association at rs1329428 versus numbers of SNPS in genome wide screen of the AMD study.

The TDT Example
To test for linkage or association in family-based studies, the transmission/disequilibrium test (TDT) of Spielman et al. [15] examines the transmission of an allele from heterozygous parents to their affected offspring. If an allele is associated with the disease risk, its transmission may occur more than 50% of the times. Risch and Merikangas [16] studied the required sample size for TDT in affected sib pairs. TDT is equivalent to McNemar’s test for two correlated proportions with the hypothesis 𝐻0𝑝=0.5 versus 𝐻1𝑝>0.5, for the specified alternative 𝑝=𝑝𝐴, where 𝑝𝐴 is the probability that an 𝐴/𝐵 parent transmits allele 𝐴 to an affected offspring. The sample size (matched pairs) needed is given in (2.1) with 𝐶=2𝑝𝐴(1𝑝𝐴), Δ2=2(𝑝𝐴0.5)2𝑝𝐷, and 𝑝𝐷 is the projected proportion of discordant pairs among all matched pairs. If we assume that each family used in the analysis has only one marker heterozygous parent, then 𝑛 is the number of families required. Demonstration of sample sizes for TDT is plotted in Figure 3 using the setup given in Risch and Merikangas [16]. Using Bonferroni adjustment, the sample sizes are calculated to provide 1𝛽=90% power to identify a disease gene at the familywise significance level 𝛼=5%. The plus signs and open triangles are the sample size 𝑛𝑀 calculated based on (2.2) corresponding to disease frequencies equal to 0.1 and 0.5, respectively. The corresponding lines are for 𝑛𝑀 based on (2.4).

Figure 3: Number of families needed versus log𝑀 (base 10). Sample size for the TDT in the example of Risch and Merikangas [16], with disease frequencies of 0.1 (plus signs) and 0.5 (open triangles).
2.2. FDR Control

For the test of multiple hypotheses, such as the analysis of many genes using microarray, the outcomes can be described in Table 2.

Table 2: Possible outcome for testing 𝑀 hypotheses.

It is likely that many genes are differentially expressed in a microarray study [7]. A natural way to control the overall false positives is to control the expected proportion of false positives. Benjamini and Hochberg [8] defined the false discovery rate (FDR), using Table 2, as 𝑉FDR=𝑃(𝑅>0)𝐸𝑈𝑅>0,FDR=0for𝑅=0.(2.6) Storey [9] defines positive FDR (pFDR) as pFDR=FDR/𝑃(𝑅>0). When 𝑀 is large as assumed next, 𝑃(𝑅>0)1, unless the power 1𝛽 is too small, then FDR pFDR.

The required sample size for multiple testing depends on 𝛼,(1𝛽), 𝑀, and Δ of each individual gene. For easy exposition, we assume an equal effect size Δ for all differentially expressed genes, say 𝑚1 genes; thus, the power (1𝛽) of detecting any individual differentially expressed gene is the same for all of the 𝑚1 genes between samples of two conditions of sizes 𝑛1 and 𝑛2. The expected outcomes in multiple testing can be expressed as functions of 𝛼, 𝛽, 𝑚0, and 𝑚1 and are summarized in Table 3.

Table 3: Expected outcome for testing 𝑀 hypotheses.

By law of large numbers, from Table 3, FDR=𝐸(𝑉/𝑅)=𝑚0𝛼/(𝑚0𝛼+𝑚1(1𝛽)). Denote the desired FDR level by 𝑓. Then from the above equation, we have 𝛼fdr=𝑓𝑚1𝑓11𝑀11(1𝛽).(2.7) To account for the dependence among tests, we follow Shao and Tseng [17]. Let 𝑇𝑖 be the test statistic of an one-sided two sample z-test for the 𝑖th alternative hypothesis, let 𝑝𝑖 be its 𝑃 value, and let 𝑢𝑖=𝐼(𝑝𝑖<𝛼) be the rejection status at the level 𝛼; 𝑢𝑖=1 if the 𝑖th test result is a rejection and 0 otherwise. Furthermore, if we denote the pairwise correlation coefficient between two tests by 𝜌𝑈𝑖𝑗=Corr(𝑇𝑖,𝑇𝑗), then it can be shown that the correlation between 𝑢𝑖 and 𝑢𝑗, 𝜃𝑈𝑖𝑗=Corr(𝑢𝑖,𝑢𝑗) can be derived from the correlations of test statistics as follows: 𝜃𝑈𝑖𝑗=𝐹̃𝑧𝛼,̃𝑧𝛼;𝜌𝑈𝑖𝑗(1𝛽)2,𝛽(1𝛽)(2.8) where 𝐹 is the CDF of the standard bivariate normal distribution, and ̃𝑧𝛼=𝑧𝛼+Δ/𝑛11+𝑛21 [18]. Under local dependence assumptions, the total number of true discoveries, 𝑈=𝑚1𝑖=1𝑢𝑖, has an approximately normal distribution: 𝑈𝑁(𝑚1(1𝛽),𝜎2𝑈), where 𝜎2𝑈=𝑚1𝛽(1𝛽)[1+𝜃𝑈(𝑚11)], and 𝜃𝑈=(𝑚1(𝑚11))1𝑖𝑗𝜃𝑈𝑖𝑗 is the average correlation among true discoveries. The local dependence assumption can be viewed in a simplified formulation of the central limit theorem under the “strong mixing” given in Theorem 27.4 of Billingsely [19]. “Mixing” means, roughly, that random variables temporally far apart from one another are nearly independent. We think that the local dependence assumption is reasonable in many genetic studies. For example, linkage disequilibrium can result in local dependence of genetic markers. In biomarkers study, biomarkers of the same pathway are often correlated and result in local dependence.

It is often desirable to find sample size to ensure a familywise power Ψ of identifying at least a given fraction 𝑟(0,1) out of 𝑚1 true discoveries: Ψ=𝑃(𝑈[𝑚1𝑟]). The above normal approximation of 𝑈 allows a closed form solution for the comparison-wise 𝛽: 𝛽fdr=1𝑟12𝑟+4𝑚1𝑟(1𝑟)+12𝑚1,+2(2.9) where 𝑚1=𝑚1/{[1+𝜃𝑈(𝑚11)]𝑧21Ψ}. When 𝑚1 is large, to have a family-wise power Ψ in detecting at least 100𝑟% out of 𝑚1 true alternatives, and with an FDR 𝑓, the sample size needed for a one-sided z-test is given by (2.1), with 𝛼 and 𝛽 determined by (2.7) and (2.9) iteratively.

A Microarray Example.
We now consider a well-known dataset from a study of leukemia in Gloub et al. [2] to demonstrate the relationship between sample size and number of multiple comparisons when controlling FDR. The original purpose of the experiment described in Gloub et al. [2] is to identify the susceptible genes related to clinical heterogeneity in two subclass of leukemia: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). The dataset contains 7129 attributes from 47 patients with ALL and 25 patients with AML. We can apply (2.1), (2.7), and (2.9) iteratively to obtain the required sample size when controlling FDR. Figure 4 provides 3 different settings for controlling FDR 𝑓=5% with Ψ=90%. Based on the top 100 most differentially expressed genes in Gloub et al. [2], 𝜃𝑈=0.07 (see (2.9)). The open circles represent the sample sizes 𝑛𝑀 needed when the number of true alternatives 𝑚1 stays constant (𝑚1=40). In this case, we observe that the sample size is a linear function of log𝑀 as 𝑀 increases. The “plus” signs denote the sample sizes 𝑛𝑀 when the number of true alternatives increases in a slower pace than 𝑀 (𝑚1=2log𝑀); the sample size is also approximately a linear function of log 𝑀. The triangles denote the sample sizes 𝑛𝑀 when the proportion of true alternatives is constant (𝑚1/𝑀=10%), and the sample sizes roughly remain constant as the number of tests increases which is expected from (2.7). The lines in Figure 4 represent sample sizes 𝑛𝑀 based on (2.4).

Figure 4: Sample size versus log𝑀 (base 10) for controlling FDR 𝑓=5% with Ψ=90%. The open circles represent the sample sizes needed when the number of true alternatives 𝑚1 stays as constant (𝑚1=40), the plus signs give the sample sizes when 𝑚1=2log𝑀, and the triangles are the sample sizes when the proportion of true alternatives is constant (𝑚1=𝑀/10).

3. Discussion

In this short paper, we have shown that a large increase in the number of comparisons often only requires a small increase in the sample size. We further demonstrated that when controlling FDR, the sample size may even sometimes stay constant as the number of comparisons increases (Figure 4). The sample size required for testing 𝑀 hypotheses is generally not growing faster than a linear function of log𝑀, even when a simple Bonferroni adjustment is used, and the slope of the linear growth rate (in log𝑀) is small when detecting a large effect size. These results have important implications in practice due to the wide use of multiple comparisons.

In this paper, we discuss the sample size formulas based on fixed effect size in alternative hypotheses. In reality, the effect sizes may follow a distribution, and simulation method may be useful in determining the sample size. We used 𝑧-test to derive the sample size formula, because large sample size is usually required for studies with multiple comparisons. If the effect size is large and sample size is small, 𝑡-test may be more appropriate. However, we expect the relationship between sample size and the logarithm of number of comparisons made is still linear.

In practice, if feasible, using a conservative sample size can reduce the chance of obtaining false-positive results and ensure reproducibility [6]. The simple sample size formulas provided in this paper might be used to select a suitable sample size by varying other design parameters and by taking into consideration the reliability of the proposed designs. While FDR is very useful and is increasingly used in multiple comparisons, our experience in helping biomedical investigators and the analysis in this paper indicate that the simple Bonferroni approach can often provide conservative but useful sample sizes in many situations.


  1. R. J. Klein, C. Zeiss, E. Y. Chew et al., “Complement factor H polymorphism in age-related macular degeneration,” Science, vol. 308, no. 5720, pp. 385–389, 2005. View at Publisher · View at Google Scholar · View at Scopus
  2. T. R. Golub, D. K. Slonim, P. Tamayo et al., “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, vol. 286, no. 5439, pp. 531–537, 1999. View at Publisher · View at Google Scholar · View at Scopus
  3. P. H. Westfall and S. S. Young, Resample-Based Multiple Testing: Examples and Methods for p-Value Adjustment, John Wiley & Sons, New York, NY, USA, 1993.
  4. J. C. Hsu, Multiple Comparisons: Theory and Methods, Chapman & Hall, London, UK, 1996.
  5. P. H. Westfall and R. D. Wolfinger, “Multiple tests with discrete distributions,” American Statistician, vol. 51, no. 1, pp. 3–8, 1997. View at Google Scholar · View at Scopus
  6. N. J. Risch, “Searching for genetic determinants in the new millennium,” Nature, vol. 405, no. 6788, pp. 847–856, 2000. View at Publisher · View at Google Scholar · View at Scopus
  7. J. D. Storey and R. Tibshirani, “Statistical significance for genomewide studies,” Proceedings of the National Academy of Sciences of the United States of America, vol. 100, no. 16, pp. 9440–9445, 2003. View at Publisher · View at Google Scholar · View at Scopus
  8. Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society, Series B, vol. 57, no. 1, pp. 289–300, 1995. View at Google Scholar
  9. J. D. Storey, “A direct approach to false discovery rates,” Journal of the Royal Statistical Society, Series B, vol. 64, no. 3, pp. 479–498, 2002. View at Publisher · View at Google Scholar · View at Scopus
  10. Q. Yang, J. Cui, I. Chazaro, L. A. Cupples, and S. Demissie, “Power and type I error rate of fales discovery rate approaches in genome-wide association studies,” BMC Genetics, vol. 6, supplement 1, article S134, 2005. View at Publisher · View at Google Scholar · View at Scopus
  11. J. S. Witte, R. C. Elston, and L. R. Cardon, “On the relative sample size required for multiple comparisons,” Statistics in Medicine, vol. 19, no. 3, pp. 369–372, 2000. View at Google Scholar
  12. B. Rosner, Fundamentals of Biostatistics, Duxbury, Los Angeles, Calif, USA, 2006.
  13. Y. Ge, S. Dudoit, and T. P. Speed, “Resampling-based multiple testing for microarray data analysis,” Test, vol. 12, no. 1, pp. 1–77, 2003. View at Publisher · View at Google Scholar · View at Scopus
  14. Y. Huang, H. Xu, V. Calian, and J. C. Hsu, “To permute or not to permute,” Bioinformatics, vol. 22, no. 18, pp. 2244–2248, 2006. View at Publisher · View at Google Scholar · View at Scopus
  15. R. S. Spielman, R. E. McGinnis, and W. J. Ewens, “Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM),” American Journal of Human Genetics, vol. 52, no. 3, pp. 506–516, 1993. View at Google Scholar · View at Scopus
  16. N. Risch and K. Merikangas, “The future of genetic studies of complex human diseases,” Science, vol. 273, no. 5281, pp. 1516–1517, 1996. View at Google Scholar · View at Scopus
  17. Y. Shao and C. H. Tseng, “Sample size calculation with dependence adjustment for FDR-control in microarray studies,” Statistics in Medicine, vol. 26, no. 23, pp. 4219–4237, 2007. View at Publisher · View at Google Scholar
  18. H. Ahn and J. J. Chen, “Generation of over-dispersed and under-dispersed binomial variates,” Journal of Computational and Graphical Statistics, vol. 4, no. 1, pp. 55–64, 1995. View at Publisher · View at Google Scholar
  19. P. Billingsley, Probability and Measure, John Wiley & Sons, New York, NY, USA, 1995.