Research Article | Open Access
Conditional and Unconditional Tests (and Sample Size) Based on Multiple Comparisons for Stratified 2 × 2 Tables
The Mantel-Haenszel test is the most frequent asymptotic test used for analyzing stratified 2 2 tables. Its exact alternative is the test of Birch, which has recently been reconsidered by Jung. Both tests have a conditional origin: Pearson’s chi-squared test and Fisher’s exact test, respectively. But both tests have the same drawback that the result of global test (the stratified test) may not be compatible with the result of individual tests (the test for each stratum). In this paper, we propose to carry out the global test using a multiple comparisons method (MC method) which does not have this disadvantage. By refining the method (MCB method) an alternative to the Mantel-Haenszel and Birch tests may be obtained. The new MC and MCB methods have the advantage that they may be applied from an unconditional view, a methodology which until now has not been applied to this problem. We also propose some sample size calculation methods.
In statistics it is very usual to have to verify whether association exists between two dichotomic qualities. This is especially frequent in medicine, for example, where the aim is to assess whether the presence or absence of a risk factor conditions the presence or absence of a disease or compare two treatments whose answers are success or failure, and so forth. In all the cases the problem produces data whose frequencies are presented in a table: the two levels of one of the qualities are set out in the rows, the two levels of the other quality in the columns, and the observed frequencies are set out inside the table.
The exact and the asymptotic analyses of a table have their roots in the origins of statistics, and hundred of papers have been devoted to the problem . It is traditional to carry out the exact independence test using the Fisher exact test, which is a conditional test (because it assumes that the marginals of the rows and columns are previously fixed). More than thirty years has passed since the situation changed, and it is well known that the unconditional exact test tends to be less conservative and more powerful than the conditional test [2–4], because the loss of information as a result of conditioning may be as high as 26% . The unconditional tests assume that it is only the values that were really previously fixed: the marginal of the rows, the marginal of the columns or the total data in the table. This causes two types of unconditional test: that of the double binominal model (the first two cases) and that of the multinomial model (the third case). The same can be said of the asymptotic tests, generally based on Pearson’s chi-squared statistic with different corrections for continuity (cc). However, the unconditional exact tests have the great disadvantage of being very laborious to compute. An overall view of the problem can be seen in Martín Andrés [1, 6].
Frequently the individuals who take part in the study are stratified in groups based on a covariate such as sex or age, which gives rise to several tables. In this case the aim is to contrast the independence of both the original dichotomic qualities, bearing in mind the heterogeneity of the populations defined by the strata. To this end, the most frequent approach is to suggest a test under the null hypothesis of Mantel-Haenszel for which the odds ratio (or the risk ratio) for all the strata is equal to unity. For this purpose the most frequent asymptotic tests are those of Cochran  and Mantel and Haenszel , both of which are very similar; the exact version of the test is due to Birch  (and has recently been reconsidered by ). In all these cases the proposed tests are conditional and, when there is only one stratum, the test for the case of only one table is obtained (Fisher’s exact test or Pearson’s chi-squared test). Moreover, Jung  and Jung et al.  propose a sample size calculation method, asymptotic in the first and exact in the second.
The procedures indicated have the drawback of almost all the tests for a global null hypothesis like the one in question that the result of the global (stratified) test may not be compatible with that of the individual tests (the test for each stratum). In this paper, we propose a global test (MC test) which does not have this disadvantage because it is based on a multiple comparisons method: the global test is significant if and only if at least one of the individual tests is significant. In return the MC test will have the drawback of being less powerful, given that it must control both the alpha error of the global test and the alpha errors in the individual tests. Because of this, another procedure is proposed (MCB test) which only controls the alpha error of the global test (just as in the classic stratified tests), although the alpha error in the individual tests will only exceed the nominal value on a few occasions (and generally by very little). The two procedures are applicable from both the conditional and the unconditional point of view and also when carrying out an asymptotic test or an exact test. The advantage of applying them in the form of an unconditional test is that in this way the loss of power mentioned above is reduced with regard to the classic global tests. In addition this paper shows that the asymptotic tests function well, even for small samples, if they are carried out with the appropriate continuity correction. And finally, the sample size for almost all the cases studied (exact or asymptotic tests, conditional or unconditional tests) is determined.
2. Hypothesis Test
2.1. Notation, Models, and Example
In the following (without loss of generality) it will be assumed that each table refers to the successes or failures in two treatments which are applied to and individuals, respectively. Let be the number of strata, () the total of individuals in the stratum , the total sample size, and the number of successes and the number of failures with the treatments 1 and 2, respectively, and and the total number of successes and failures in the stratum respectively. These data may be summarized as shown in Table 1. Once the experiment has been performed, the values obtained will be written with an extra subindex “0,” that is, .
Let and ( and ) be the probabilities of success (failure) with treatments 1 and 2 in the stratum , respectively. The odds ratio for each stratum is , and the aim is to contrast the null hypothesis : against an alternative hypothesis with one tail (: for some ) or with two tails (K: for some j). This paper addresses only the case of one-sided test; for the two-tail test the procedure is similar.
In the previous description it was assumed that the data () of each stratum j proceed from a double binomial distribution of sizes and and probabilities and in groups 1 and 2, respectively. Because in each stratum there are two previously fixed values ( and ) the model will be referred to as Model 2; the model is very frequently used in practice so that it will serve here as a basis for defining and illustrating the procedures MC and MCB. If in each stratum there is conditioning in the observed value , then one has Model 3; now the three values , , and are previously fixed in each stratum and the only variable arises from a hypergeometric distribution. If only the values of are fixed in each stratum , one will get Model 1: proceeding from a multinomial distribution. Finally, if only the global sample size is fixed (so that now even the values for are obtained at random), one will have Model 0. With conditioning in the appropriate marginal, the model leads to the model (). Therefore, whatever the initial model (i.e., whatever the sampling method for the data obtained), by conditioning in all the nonfixed marginals one always obtains Model 3 (which is the one covered by Birch and Mantel and Haenszel).
Each model produces a different sample space, which is formed by the set of all possible values of the set of variables involved in the same. For example, the sample space of stratum under Model 2 consists of possible values of . Each transition from a Model to Model constitutes a loss of information, because the number of points of the new sample space is very much smaller than that of the previous one. Probably the most dramatic transition is that of Models 2 to 3, a transition in which the loss of information may reach 26% for . In addition, each transition implies using a conditional rather than an unconditional method of eliminating nuisance parameters, something which is generally never advisable .
The data in Table 2, which are given by Li et al. , are taken from preliminary analysis of an experiment of three groups to evaluate whether thymosin (treatment 1), compared to a placebo (treatment 2), has any effect on the treatment of bronchogenic carcinoma patients receiving radiotherapy. The one-sided values are by global conditional stratified exact test and , , and by Fisher’s individual conditional exact test in each stratum. If the global test is carried out to an error we conclude , so that now at least once. However no individual test has significance if these are carried out to an alpha error that respects the former global error; for example, by using Bonferroni’s method, the smaller of the three values . The same thing occurs if asymptotic tests are used. Our aim is to define procedures in which these incompatibilities will not occur.
2.2. Conditional Tests Obtained by Using Classic Methods (Model 3)
The value of exact test is . Table 3 shows this value and the remaining values in this paper. This result is based on determining the probability of all the configurations , , such as . Here is a test statistic determining the order in which the points of the sample space () enter the region , a region whose probability under yields the value of . Note that as the sample spaces in each stratum are , , and , the possible values of will be , which is the total number of points in the global sample space; of these, four belong to (three with and one with ), so that . Moreover note that, under the original Model 2, the number of points in the sample space of strata 1, 2, and 3 are , , and , respectively. The total points for the global sample space will be : more than two million, compared to only 24 in Model 3. To determine the value have developed various programs (see references in ); an easy way to get it is through http://www.openepi.com/Menu/OE_Menu.htm (option “Two by Two Table”).
|Note: MH = Mantel-Haenszel test; MC = multiple comparisons method; MCB = method based on the multiple comparisons.|
The asymptotic test of Mantel-Haenszel based on is asymptotically normal with mean and variance . Therefore the contrast statistic is , whose value patently does not agree with . However because the variable is discrete, it is convenient to carry out a continuity correction . As S jumps one space at a time, the should be 0.5 and so the statistic with will be . The new value itself is already compatible with the exact value.
2.3. MC and MCB Tests Based on the Criterion of the Multiple Comparisons: General Observations
Let us suppose that in each stratum the hypotheses : versus : to error are contrasted. Thereby and . If the global null hypothesis is rejected when there exists at least one in which the individual test rejected , then the alpha error of the global test ( versus ) will be In particular, if method MC is obtained (the “method of the multiple comparisons”), and its global alpha error will be Method MC guarantees the compatibility of the results of the global test and of the individual tests, because the global test is significant if and only if at least one of the individual tests is so. When , the global test is the same as the individual test.
On the basis of the above, in general the test can be defined as follows. In each stratum an order statistic will have been defined which allows the value for each one of its points to be determined. If the points from all strata are mixed, they are ordered from the lowest value of their value to the highest and will be introduced one by one into the global critical region until a given condition (stopping rule) has been verified; then , with the critical region formed by the points in the stratum which belong to . Let be the largest of the values of the points in . The real global alpha error of the test constructed thus will be given by expression (1).
When the stopping rule is “stop introducing points into when the maximum of the is as close as possible to (but less than or equal to ),” with given bythen method MC is obtained, and this method simultaneously controls global error and the individual error . Now, the critical region of each stratum consists of all the points whose value is smaller or equal to , , and the real global error will be .
It is a simpler process to obtain the value of some observed data. Let be the -value of the individual test in stratum . The first individual alpha error for which is concluded will be , so that for expression (2) the value of the global text will be
When the stopping rule is “stop introducing points into when is the closest possible to (but smaller than or equal to ),” method MCB is obtained (the method “based on the multiple comparisons”). Because now only the global error is controlled, its goal is similar to that of Jung’s method . The method MCB causes that , , and the real global error is . Note that , since , something to be expected given that method MC controls two errors and the MCB method controls only one of these.
Let us see how we can obtain the value of some observed data in which for example. The region which yields the first significance of the global test is obtained when the observed point in stratum 1 is the last introduced into , that is, when ; in the other strata it should be , but as close as possible to . Thus the value will be . It can now be seen that where are the values of the MC test when this is carried out to the error . Therefore and, for effects of calculating the value , the values and the regions will be written just as and , respectively. Thus, if is the largest value in stratum which is smaller than or equal to ,
Methods MC and MCB may be applied with exact methods or with asymptotic methods and to any of the three models, as illustrated in the following sections.
2.4. MC and MCB Tests under Model 3
The p values of the Fisher exact test in each stratum are , , and . So, and by expression (4). In order to apply method MCB the critical regions ( and 2) must be determined to the objective error . For , with ; thus and . This same occurs for (). For expression (5), (smaller than . Generally speaking the critical region of Birch  and Jung  has the form , while that of method MCB is in the form , with . It can be proved that this generally implies that the Birch method will yield a p value smaller than or equal to that of method MCB when the p values are similar or when the observed values are the highest possible.
Let us now apply an asymptotic test. In general, whatever the model is, the appropriate statistic is the chi-squared statistic :The appropriate value for the continuity correction depends on the assumed model, and that value is what causes the results of the three models to be different. When Pearson’s classic chi-squared statistic is obtained. In the case here of Model 3, by making the classic statistic (or the Yates chi-squared statistic) is obtained. Its maximum value is reached in stratum 3 (), which yields the p values and . In order to apply method MCB, one must obtain in the other two strata the first value of which is larger than or equal to . As there is none, , = 0.15132 and . Note that the asymptotic p values are similar to the exact ones, both with method MC and with method MCB. Despite the small size of the samples, the asymptotic methods function well (something which also occurs with the rest of the methods, as will be seen).
2.5. MC and MCB Tests under Model 2
The data in the example in reality proceeds from Model 2. In determining the p value of an observed table of Model 2 () the same steps are followed as in Model 3 (except the last, which is special): (1) define an order statistic , which does not need to be the same one in each stratum; (2) determine the set of points ; (3) calculate the probability of under : given by ; and (4) determine the p value as , where is the nuisance parameter that is eliminated by maximization (the most complicated step). Note that is the marginal probability of columns under . In the case of Model 3 there is only one order statistic possible , because the convexity of the region must be verified and the points ordered “from the largest to the smallest value of .” In the case of Model 2 there are many possible test statistics. One of these is the order of Boschloo : order the points from the smaller to larger value of its one-tailed p value obtained using the Fisher exact test. It is already known  that the unconditional test based on the order is uniformly more powerful (UMP) than Fisher’s own exact test. Although no unconditional order is UMP compared to the rest, the generally most powerful order is  the complex statistic of Barnard .
As far as we know, the only program that carries out the above calculations for the statistic is SMP.EXE, which may be obtained free of charge at http://www.ugr.es/local/bioest/software.htm. The program also gives the solution for other simpler test statistics. Using this program, because the minimum p value is then . In order to obtain one has to proceed as in the previous section, although now the process is now somewhat more difficult. In stratum 1, the table is the one that gives a larger p value , but smaller than or equal to . In stratum 2 the results are and . So, , a value which is similar to that of (the results are alike if other order statistics of the program SMP.EXE are used). It can be seen that the use of the unconditional method allows the inherent conservatism in the definitions of methods MC and MCB to be reduced.
In order to carry out the asymptotic test we shall use the optimal version of expression (6) for Model 2: is the value of expression (6) when (or 2) if (or ) . Now the maximum value is , whereby and (a value, i.e., very near the 0.1602 of the exact method). Proceeding as above, the first values of ( or 2) which are larger than or equal to are for and for . This makes , , and (which is also a value, i.e., very close to the 0.1533 of the exact method).
2.6. MC and MCB Tests under Models 1 and 0
Let us suppose now that the data contained in the example in Table 2 proceed from Model 1. The determining of the p value of an observed table ) is the same as in Model 2, but now the calculations are more complicated because the nuisance parameters must be eliminated (the marginal probabilities of rows and columns under ). Again there are many possible test statistics [1, 21], although none of them is UMP compared to the others. The generally more powerful statistic is again Barnard’s statistic  and, as far as we know, the only program to apply it is TMP.EXE which may be obtained free of charge at http://www.ugr.es/local/bioest/software.htm. The program also gives the solution using other simpler test statistics. Using this program, the minimum p value is and from this (substantially smaller than ).
In order to carry out the asymptotic test we shall use the optimal version of expression (6) for Model 1: is the value of expression (6) when . The statistic is given by Pirie and Hamdan . Now the maximum value is , with the result that and .
Method MCB (which is very laborious to calculate) is omitted here, because the large number of points in the sample space will make and so . Note that stratum 1 under Model 2 consists of points, but under Model 1 it consists of 2,925 points. For similar reasons, Model 0 can be treated as if it were Model 1 (by conditioning in the real obtained values ).
3. Sample Size under Model 2
3.1. Example and Conditional Solutions Obtained by Classic Methods
Jung  proposes a sample size calculation for its stratified exact test. For the example described in Section 2.1, he accepts Model 2 and sets out a case study with and . The aim is to determine the value of for the alternative hypotheses , a type I error of and a power of . Jung also assumes that , so that under the alternative hypothesis . His solution is . From what can be deduced from other parts of his paper, the detailed solution is , , , and . These values are included in Table 4 (as well as the most relevant ones obtained in all the following). This sample size provides a real error of and a real power of .
|Note: : of Mantel-Haenszel; MC = multiple comparisons method; of Model 2.|
Let us suppose that generally , with known values, and that the aim is to determine the values which guarantee the desired power, which implies using Model 2. The reasoning that follows is the same as that with which Casagrande et al.  and Fleiss et al.  obtained the classic formula for sample size in the comparison of two independent proportions. The solutions without cc that follow are a special case of those of Jung et al. . The test in Section 2.2 is based on the statistic , where . Because is distributed asymptotically as a normal distribution with the mean and the variance , will be asymptotically normal with the mean and the variance . Under , , with the result that the mean and variance of will be and of , respectively, with j. Because under the nuisance parameter is estimated by , it is usual to substitute it by its average value under , that is, by ; hence . Consequently the statistic will reach significance in the critical value which verifies , in which the number 0.5 corresponds to the indicated above and refers to a normal standard variable. Therefore , with the -percentile of the normal standard distribution. Under the parameters and are obtained in the values and which specify : and . Given the above, the error beta will be
If the solution is restricted to the case of , by making equal to the fraction of expression (7) and by working out , one obtains the equation , where , , and ; thereforeThe solutions and are those of the tests and , respectively. Frequently ; in this case expression (8) explicitly takes the following form:
For the example at the beginning of this section (in which ), if at first we restrict the solution to , expression (9) indicates that and . Assuming that in this example the values of are allowed to differ at most by 1, then the solution that is sought must be without or with cc. In the second phase, expression (7) indicates that in and is the first time that (=0.183) 0.2, so that this is the solution with cc that was being sought (). The solution without cc is obtained in the same way (, , and ), but it is too liberal.
3.2. Solution Using the Exact Method MC
For fixed values of the global error and the sample sizes (, ), the method MC described in Section 2.3 allows one to obtain the critical region and the real type 1 error . Moreover, let be the error beta for each individual test, with equal to the probability of the region under . Because of the way method MC was defined, the real global error beta will beIf , these values guarantee the desired power. If , it is necessary to increase some values of and/or and to repeat the previous procedure.
Let us initially assume that . The process for determining the sample sizes may be shortened if it begins with a value like that of expression (8). With the method MC, one obtains that is not a solution because , but is a solution because . The solution can now be refined allowing values to differ by a maximum of one. The final solution is , (), , and .
Unconditioned tests are more powerful when the sample sizes are slightly different , since the number of ties that produces any statistic that is used is reduced. By planning and making the values of consecutive, the solution , , and () is obtained, with and (the solution based on is worse). Actually, stratum 1 is of virtually no interest since in it . Despite everything, if it is introduced, the configuration , , and () is correct because and .
3.3. Solution Using the Asymptotic Method MC Based on the Chi-Square Test with cc
In the following the procedure is the same as in Section 2.1, assuming for the moment that and can be any values. The numerator of may be written as , where is the cc of Model 2 ( or 1 depending on whether and are equal or different, resp.) and (the base statistic for the test) is asymptotically normal with mean and variance ).
Under , and is asymptotically normal with mean 0 and variance , with . Because under the nuisance parameter is estimated by , it is usual to substitute it by its average value under , that is, by ; hence . If each individual test is realized to the error of expression (3), the critical value for will verify , in which the value corresponds to the cc indicated above; therefore .
Under , is asymptotically normal with mean and variance . Thus and applying the first equality in expression (10)in particular, if , then , and
For the data in the example, and by making the solution, the solution based on expression (12) is . This solution can be refined by allowing the values of to differ by a maximum of one, in which case the new solution, now based on expression (11), is , () with . If a cc is not carried out the solution is too liberal: , () with . By planning and making the values of consecutive, the solution , , and is obtained (as in the exact method), with = . This is the same result as for the configuration at the end of the previous section.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research was supported by the Ministerio de Economía y Competitividad, Spanish, Grant no. MTM2012-35591.
- A. Martín Andrés, “Entry Fisher's exact and Barnard's tests,” in Encyclopedia of Statistical Sciences, S. Kotz, N. L. Johnson, and C. B. Read, Eds., vol. 2, pp. 250–258, Wiley-Interscience, New York, NY, USA, 1998.
- L. L. McDonald, B. M. Davis, and G. A. Milliken, “A non-randomized unconditional test for comparing two proportions in a contingency table,” Technometrics, vol. 19, no. 2, pp. 145–157, 1977.
- A. M. Andrés and A. S. Mato, “Choosing the optimal unconditioned test for comparing two independent proportions,” Computational Statistics & Data Analysis, vol. 17, no. 5, pp. 555–574, 1994.
- A. Agresti, Categorical Data Analysis, Wiley-Interscience, 3rd edition, 2013.
- Y. Zhu and N. Reid, “Information, ancillarity, and sufficiency in the presence of nuisance parameters,” The Canadian Journal of Statistics, vol. 22, no. 1, pp. 111–123, 1994.
- A. Martín Andrés, “Comments on ‘Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations’,” Statistics in Medicine, vol. 27, no. 10, pp. 1791–1796, 2008.
- W. G. Cochran, “The 22 correction for continuity,” Iowa State College Journal of Science, vol. 16, pp. 421–436, 1942.
- N. Mantel and W. Haenszel, “Statistical aspects of the analysis of data from retrospective studies of disease,” Journal of the National Cancer Institute, vol. 22, no. 4, pp. 719–748, 1959.
- M. W. Birch, “The detection of partial association. I. The 2 × 2 case,” Journal of the Royal Statistical Society. Series B. Methodological, vol. 26, no. 2, pp. 313–324, 1964.
- S.-H. Jung, “Stratified Fisher's exact test and its sample size calculation,” Biometrical Journal, vol. 56, no. 1, pp. 129–140, 2014.
- S.-H. Jung, S.-C. Chow, and E. M. Chi, “A note on sample size calculation based on propensity analysis in nonrandomized trials,” Journal of Biopharmaceutical Statistics, vol. 17, no. 1, pp. 35–41, 2007.
- S. H. Li, R. M. Simon, and J. J. Gart, “Small sample properties of the Mantel-Haenszel test,” Biometrika, vol. 66, no. 1, pp. 181–183, 1979.
- A. Agresti, “Dealing with discreteness: making ‘exact’ confidence intervals for proportions, differences of proportions, and odds ratios more exact,” Statistical Methods in Medical Research, vol. 12, no. 1, pp. 3–21, 2003.
- A. Agresti, “A survey of exact inference for contingency tables,” Statistical Science, vol. 7, no. 1, pp. 131–177, 1992.
- D. R. Cox, “The continuity correction,” Biometrika, vol. 57, pp. 217–219, 1970.
- Z. Šidàk, “Rectangular confidence region for the means of multivariate normal distributions,” Journal of the American Statistical Association, vol. 62, pp. 626–633, 1967.
- L. J. Davis, “Exact tests for 2 × 2 contingency tables,” The American Statistician, vol. 40, no. 2, pp. 139–141, 1986.
- R. D. Boschloo, “Raised conditional level of significance for the 22 table when testing the equality of two probabilities,” Statistica Neerlandica, vol. 24, no. 1, pp. 1–35, 1970.
- E. S. Pearson, “The choice of statistical tests illustrated on the interpretation of data classed in a 22 table,” Biometrika, vol. 34, pp. 139–167, 1947.
- G. A. Barnard, “Significance tests for 22 tables,” Biometrika, vol. 34, pp. 123–138, 1947.
- G. Shan and G. Wilding, “Unconditional tests for association in contingency tables in the total sum fixed design,” Statistica Neerlandica, vol. 69, no. 1, pp. 67–83, 2015.
- M. Andrés and T. García, “Optimal unconditional test in 2×2 multinomial trials,” Computational Statistics and Data Analysis, vol. 31, no. 3, pp. 311–321, 1999.
- W. R. Pirie and M. A. Hamdan, “Some revised continuity corrections for discrete distributions,” Biometrics, vol. 28, no. 3, pp. 693–701, 1972.
- J. T. Casagrande, M. C. Pike, and P. G. Smith, “An improved approximate formula for calculating sample sizes for comparing two binomial distributions,” Biometrics, vol. 34, no. 3, pp. 483–486, 1978.
- J. L. Fleiss, A. Tytun, and H. K. Ury, “A simple approximation for calculating sample sizes for comparing independent proportions,” Biometrics, vol. 36, pp. 343–346, 1980.
Copyright © 2015 A. Martín Andrés et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.