Abstract

We proposed a statistical method to construct simultaneous confidence intervals on all linear combinations of means without assuming equal variance where the classical Scheffé's simultaneous confidence intervals no longer preserve the familywise error rate (FWER). The proposed method is useful when the number of comparisons on linear combinations of means is extremely large. The FWERs for proposed simultaneous confidence intervals under various configurations of mean variances are assessed through simulations and are found to preserve the predefined nominal level very well. An example of pairwise comparisons on heteroscedastic means is given to illustrate the proposed method.

1. Introduction

Multiple comparisons on a large number of linear combinations of means is of general interest in many applications. If an inferential statistical procedure relies on the number of comparisons, it may be quite challenge as the number of comparisons is increasing. Additionally, oftentimes we may not be able to make the assumption that all variances of means are equal. Many authors proposed various methods for multiple comparison on means in the past. Scheffé [1] proposed a method to construct simultaneous confidence intervals for all linear combinations of means while keeping Type I error under control. Since Scheffé's method constructs simultaneous confidence intervals for all possible linear combinations of means, his method has its own advantage when dealing with a large number of comparisons on linear combinations of means. It is understood that there are three major assumptions for Scheffé’s simultaneous confidence intervals to be constructed correctly. (1) The samples are independent, (2) the populations are normally distributed, and (3) populations have an equal variance. The third assumption, often referred to as homoscedasticity, is most vulnerable. The violation of homoscedasticity often results in inflation of the familywise error rate (FWER). As pointed out by Scheffé [2], his method has certain robustness when the group sample sizes are the same even when the variances are not equal. However, the FWER is out of control in situation where both the variances and sample sizes are unequal. No explicit formula is available so far for simultaneous confidence intervals on all linear combinations of means in the case of unequal variances.

The problem of comparisons on two means in the case of unequal population variances is known as the Behrens-Fisher problem [3]. Dunnett [4, 5] and Nel and van der Merwe [6] published simulation-based results on assessing different pairwise mean comparison procedures in the unequal variance case. Kim [7] proposed a practical solution to the Behrens-Fisher problem using the geometry of confidence ellipsoids for two mean vectors. Wilcox [8] tackled the Behrens-Fisher problem via trimmed means. Christensen and Rencher [9] compared Type I error rates and power levels in the Behrens-Fisher problem. Fouladi and Yockey [10] conducted a Monte Carlo study to evaluate the performance of the tests on means under the conditions of normality and abnormality. Hoover [11] discussed behavioral interventions with heterogeneous subgroup effects in clinical trials. In this paper, a method for constructing simultaneous confidence intervals on all linear combinations of means with unequal variances is proposed. Since there is no limitation for the number of linear combinations of means the proposed method may be used in situation where the comparisons on a large number of linear combinations of means is deemed to be necessary. The proposed simultaneous confidence intervals, to which we refer as the generalized Scheffé’s confidence intervals, have an explicit format that is similar to their classical counterparts. The equal mean variance assumption is no longer needed. In addition, these simultaneous confidence intervals become the classical Scheffé's confidence intervals when all population variances and sample sizes are equal. Most importantly, the proposed simultaneous confidence intervals preserve FWER in all configurations of variances and sample sizes.

2. Generalized Scheffé Confidence Intervals

Suppose that we have 𝐼 populations and let (𝜇𝑖,𝜎2𝑖) be the true mean and variance for population 𝑖. Let (𝑛𝑖,𝐷𝑖,𝑆2𝑖) be the sample size, sample mean, and sample variance of the 𝑖th population. In the case of equal variance among 𝐼 populations, that is, 𝜎2𝑖𝜎2, Scheffé simultaneous confidence intervals on all linear combinations of means 𝐼𝑖=1𝑐𝑖𝜇𝑖 are given by: 𝐼𝑖=1𝑐𝑖𝐷𝑖±𝐼𝐹𝛼,𝐼,𝑁𝐼MSE𝐼𝑖=1𝑐2𝑖𝑛𝑖,(2.1) where the mean squared error MSE=𝐼𝑖=1(𝑛𝑖1)𝑆2𝑖/(𝑁𝐼) is the pooled estimate of the common variance from 𝐼 populations; 𝐹𝛼,𝐼,𝑁𝐼 is the upper 𝛼th quantile from the 𝐹 distribution with degrees of freedom 𝐼, 𝑁𝐼; 𝑁=𝐼𝑖=1𝑛𝑖 is the total sample size. If 𝐼 constants 𝑐1,𝑐2,,𝑐𝐼 satisfy 𝐼𝑖=1𝑐𝑖=0, Scheffé’s simultaneous confidence intervals on all contracts 𝐼𝑖=1𝑐𝑖𝜇𝑖 are given by:𝐼𝑖=1𝑐𝑖𝐷𝑖±(𝐼1)𝐹𝛼,𝐼1,𝑁𝐼MSE𝐼𝑖=1𝑐2𝑖𝑛𝑖.(2.2) If pairwise comparisons are of interest, we can set one pair of (𝑐𝑖,𝑐𝑗) to be (1,1) and rest 𝑐𝑖s to be zero. This is a special case of contrast. Note that Scheffé’s intervals are useful when dealing with a large number of linear combinations of means. When the total number of observations and the number of populations are determined, the quantity 𝐹𝛼,𝐼,𝑁𝐼 stays the same regardless the number of simultaneous confidence intervals. For the Bonferroni approach, the width of the confidence intervals tend to be wider if the number of linear combinations of means is increasing. Suppose that we have 10 populations each with a sample size 10. If we have 100 simultaneous confidence intervals for the linear combinations of means, the 𝐹 in Scheffé's method is 10×𝐹0.05,10,10010=1.9635. If we apply Bonferroni’s approach the |𝑡(0.05/200,10010)|=3.6118. This means that the width of Scheffé's intervals may be shorter than the width of the Bonferroni's intervals. There is a breakdown point such that Scheffé's intervals may be shorter than the Bonferroni’s intervals when the number of linear combinations of means gets larger. This alerts the common perception that Scheffé's intervals are more conservative than Bonfferoni's intervals.

We now consider the problem of constructing simultaneous intervals without assuming equal variance. Let 𝑎𝑖=𝜎2𝑖/𝐼𝑖=1𝜎2𝑖 and define 𝑅1=𝐼𝑖=1𝑎𝑖𝐷𝑖𝜇𝑖𝜎𝑖/𝑛𝑖2=𝐼𝑖=1𝑎𝑖𝑌𝑖,𝑅2=𝐼𝑖=1𝑎𝑖𝑛𝑖𝑛1𝑖𝑆12𝑖𝜎2𝑖=𝐼𝑖=1𝑎𝑖𝑛𝑖𝑍1𝑖.(2.3)

Note that 𝑌𝑖𝜒21 and 𝑍𝑖𝜒2(𝑛𝑖1). Therefore, 𝑅1 and 𝑅2 are linear combinations of 𝜒2 variables with 𝐸(𝑅1)=𝐸(𝑅2)=1.

Finding the exact distribution of linear combination of 𝜒2 variables, known as Satterthwaite’s problem, is rather difficult. Satterthwaite tried to approximate this type of variable as a 𝜒2𝜈 random variable divided by its degrees of freedom 𝜈 (see [12]). This degree of freedom 𝜈 is then solved via the method of moment estimation. As noted in Casella and Berger [12], for a variable 𝑌𝜒2𝜈/𝜈, we have 𝐸(𝑌)=1. Hence 𝜈=2(𝐸𝑌)2=2Var(𝑌)Var(𝑌).(2.4)

We then set 𝑅1𝜒2𝜈1/𝜈1, and 𝑅2𝜒2𝜈2/𝜈2, where 𝜈1 and 𝜈2 are the respective degrees of freedom for 𝑅1 and 𝑅2. By applying the results above we can estimate 𝜈1 and 𝜈2. First we consider 𝜈1, which can be found as 𝜈1=2𝐼𝑖=1𝑎2𝑖𝑌Var𝑖=1𝐼𝑖=1𝑎2𝑖=𝐼𝑖=1𝜎2𝑖2𝐼𝑖=1𝜎4𝑖.(2.5)A natural estimate of 𝜈1 is given by ̂𝜈1=(𝐼𝑖=1𝑆2𝑖)2/𝐼𝑖=1𝑆4𝑖. For 𝜈2, we have 𝜈2=2𝐼𝑖=1𝑎2𝑖/𝑛𝑖12𝑍Var𝑖=1𝐼𝑖=1𝑎2𝑖/𝑛𝑖=1𝐼𝑖=1𝜎2𝑖2𝐼𝑖=1𝜎4𝑖/𝑛𝑖1.(2.6)

It can be estimated by ̂𝜈2 and ̂𝜈2=(𝐼𝑖=1𝑆2𝑖)2/𝐼𝑖=1𝑆4𝑖/𝑛𝑖1. Furthermore, note that 𝑅1 is independent of 𝑅2, therefore, 𝑅=𝑅1/𝑅2 has approximately the 𝐹 distribution with degrees of freedom 𝜈1 and 𝜈2. It turns out that 𝑅=𝑅1/𝑅2 has a very simple form𝑅𝑅=1𝑅2=𝐼𝑖=1𝑛𝑖𝐷𝑖𝜇𝑖2𝐼𝑖=1𝑆2𝑖𝐹𝜈1,𝜈2.(2.7) Note that if the 𝐼 populations have equal variance, 𝜎2𝑖𝜎2, we have 𝜈1=𝐼; additionally, if all populations have the same sample size, that is, 𝑛𝑖𝑛, then 𝜈2=𝑁𝐼.

To derive the generalized Scheffé's interval we would need the following projection lemma (see [13] pages 231-232). For 𝐼 real numbers 𝑧1,𝑧2,,𝑧𝐼 and all 𝐚=(𝑎1,𝑎2,,𝑎𝐼)𝐼 to satisfy the following inequality: 𝐼𝑖=1𝑎𝑖𝑦𝑖𝑟𝐼𝑖=1𝑎2𝑖1/2𝐼𝑖=1𝑎𝑖𝑧𝑖𝐼𝑖=1𝑎𝑖𝑦𝑖+𝑟𝐼𝑖=1𝑎2𝑖1/2,(2.8) the necessary and sufficient condition is 𝐼𝑖=1(𝑧𝑖𝑦𝑖)2𝑟2. We then choose 𝑧𝑖=𝑛𝑖𝜇𝑖 and let 𝐳=(𝑧1,𝑧2,,𝑧𝐼) satisfy 𝐼𝑖=1(𝑧𝑖𝑛𝑖𝐷𝑖)2𝐹𝛼,̂𝜈1,̂𝜈2𝐼𝑖=1𝑆2𝑖 which constitutes the interior of a 𝐼-dimensional sphere centered at the point (𝑛1𝐷1,𝑛2𝐷2,,𝑛𝐼𝐷𝐼) with radius 𝐹𝛼,̂𝜈1,̂𝜈2𝐼𝑖=1𝑆2𝑖. By applying the projection lemma to vector 𝐚, where 𝐚=(𝑐1/𝑛1,𝑐2/𝑛2,,𝑐𝐼/𝑛𝐼), we have 𝐼𝑖=1𝑛𝑖𝐷𝑖𝑛𝑖𝜇𝑖2𝐹𝛼,̂𝜈1,̂𝜈2𝐼𝑖=1𝑆2𝑖=𝐼𝑖=1𝑐𝑖𝑛𝑖𝑛𝑖𝜇𝑖𝐼𝑖=1𝑐𝑖𝑛𝑖𝑛𝑖𝐷𝑖±𝐹𝛼,̂𝜈1,̂𝜈2𝐼𝑖=1𝑆2𝑖𝐼𝑖=1𝑐2𝑖𝑛𝑖=𝐼𝑖=1𝑐𝑖𝜇𝑖𝐼𝑖=1𝑐𝑖𝐷𝑖±𝐹𝛼,̂𝜈1,̂𝜈2𝐼𝑖=1𝑆2𝑖𝐼𝑖=1𝑐2𝑖𝑛𝑖.(2.9) Choosing 𝐹𝛼,̂𝜈1,̂𝜈2, the 1𝛼 quantile of an 𝐹 distribution with ̂𝜈1 and ̂𝜈2 degrees of freedom, based on the results in (2.7), we have 𝑃𝐼𝑖=1𝑛𝑖𝐷𝑖𝑛𝑖𝜇𝑖2𝐹𝛼,̂𝜈1,̂𝜈2𝐼𝑖=1𝑆2𝑖=1𝛼.(2.10)

Applying the projection lemma this probability can be pivoted to give the following generalized 1𝛼 simultaneous confidence intervals for 𝐼𝑖=1𝑐𝑖𝜇𝑖,𝐼𝑖=1𝑐𝑖𝐷𝑖±𝐹𝛼,̂𝜈1,̂𝜈2𝐼𝑖=1𝑆2𝑖𝐼𝑖=1𝑐2𝑖𝑛𝑖.(2.11) For population mean 𝜇𝑖’s and their pairwise differences 𝜇𝑖𝜇𝑗, the generalized Scheffé’s confidence intervals are𝐷𝑖±𝐹𝛼,̂𝜈1,̂𝜈2𝐼𝑖=1𝑆2𝑖1𝑛𝑖,𝐷(2.12)𝑖𝐷𝑗±𝐹𝛼,̂𝜈1,̂𝜈2𝐼𝑖=1𝑆2𝑖1𝑛𝑖+1𝑛𝑗,(2.13) where 1𝑖𝑗𝐼. By comparing (2.1) with (2.11), it can be seen that the generalized Scheffé's confidence intervals are very similar to their classical counterparts.

3. Assessment of Familywise Error Rate

The Type I error in multiple comparisons is referred to as the probability of incorrectly rejecting at least one of the null hypotheses that make up the family. The validity of the proposed generalized Scheffé's confidence intervals largely lies in successfully controlling the FWER at a given nominal level 𝛼.

There are two major factors, population sample sizes and variances, which affect the performance of the Scheffé's confidence intervals. We will show through simulation that the FWER will be inflated in the situation where population variances are unequal.

A variety of configurations of variances and sample sizes will be selected to assess the performance of the generalized Scheffé method. To this end, the number of groups is chosen to be 𝐼=4. Without loss of generality, we use 0 for all population means, that is, (𝜇1,𝜇2,𝜇3,𝜇4)=(0,0,0,0). The specification of sample sizes and variances is given in Table 1.

Although Scheffé’s intervals apply to inference on all linear combinations, for simplicity, we have focused on two sets of inferences only: population means and their pairwise differences. For each configuration we conducted 5,000 simulation runs and for each run 95% Scheffé's intervals and generalized Scheffé's intervals on both population means and pairwise mean differences were computed. We then obtained the coverage rates that the proposed intervals contain the true means, which all equal 0.

Table 1 reports the coverage rates based on both methods. Note that the empirical FWER would be one minus the coverage rate. Clearly, in the case of equal variances, both methods give very similar rates of coverage for balanced design or unbalanced design. In the unequal variance case, the coverage rate of Scheffé’s method drops. However, its FWER still stays well within the nominal level, that is, around 𝛼=0.05, for balanced designs. This confirms the notion of Scheffé that his method is robust to heteroscedasticity when sample sizes from populations are equal. We notice that the FWER is inflated when sample sizes are different among the populations. It can be found from Table 1 that when (𝜎1,𝜎2,𝜎3,𝜎4)=(0.3,0.3,0.1,0.1), for sample sizes (𝑛1,𝑛2,𝑛3,𝑛4)=(5,5,10,10),(5,5,20,20),(10,10,20,20),(10,10,50,50) the FWERs are 12.3%, 27%, 11.6%, and 26.5%, respectively. When (𝜎1,𝜎2,𝜎3,𝜎4)=(3,3,1,1), for sample sizes (𝑛1,𝑛2,𝑛3,𝑛4)=(5,5,10,10),(5,5,20,20),(10,10,20,20),(10,10,50,50), the FWERs are 12.8%, 23.5%, 12.95%, and 27.45%, respectively. Note that these FWERs are all significantly greater than the nominal level 𝛼=0.05%. It can be seen that the greater the difference in sample sizes is the larger the corresponding FWER will be. On the other hand, the performance of the generalized Scheffé method is much more robust. For the same configuration settings, the FWERs based on the generalized Scheffé‘s intervals are between 0.025% and 0.038%. Although it is conservative, but it stays well within the nominal level of 𝛼=0.05.

It would also be interesting to see how different in width the two types of intervals are. Comparing (2.1) with (2.12), one can see that the difference between them are due to the following two terms: 𝑄1=𝐹𝛼,𝐼,𝑁𝐼𝑄𝐼MSE,2=𝐹𝛼,̂𝜈1,̂𝜈2𝑆2𝑖.(3.1)

The averaged 𝑄1 and 𝑄2 from 5,000 simulation runs are presented in Table 2.

It can be seen that they are very close to each other in the case of equal variances. However, in the case of unequal variances, 𝑄1 becomes over optimistically smaller than 𝑄2, which leads to the inflation of FWER. Finally, Scheffé's intervals are derived from the fact that the 𝐹 statistic 𝐹=𝐼𝑖=1𝐷𝑖𝜇𝑖2/𝐼MSE,(3.2) follows the 𝐹𝐼,𝑁𝐼 distribution under a number of assumptions. When these assumptions are violated, the performance of Scheffé's intervals would depend on how the above 𝐹 statistic deviates from the distribution 𝐹𝐼,𝑁𝐼. For the generalized Scheffé's intervals, the FWER largely depends on how accurately 𝑅1/𝑅2 approximates 𝐹𝜈1,𝜈2. Figure 1 plots the empirical distribution function of 𝑅1/𝑅2 and the 𝐹 statistic in (2.13), along with their designated 𝐹 distribution. We selected the following four different configurations of variances and sample sizes, which correspond to homoscedastic/heteroscedastic and balanced/unbalanced cases: (1)(𝜎1,𝜎2,𝜎3,𝜎4)=(1,1,1,1), (𝑛1,𝑛2,𝑛3,𝑛4)=(10,10,10,10), (10,10,50,50),(2)(𝜎1,𝜎2,𝜎3,𝜎4)=(3,3,1,1), (𝑛1,𝑛2,𝑛3,𝑛4)=(10,10,10,10), (10,10,50,50).

The configuration (1) indicates the equal variance for the 4 means with equal or different sample sizes. The configuration (2) indicates the unequal variances for the 4 means with equal or different sample sizes. We calculate the empirical distribution function of 𝑅1/𝑅2, and it can be seen that they are nearly overlaps with 𝐹𝜈1,𝜈2 in all four cases of configurations of variance and sample sizes (1(a)–4(a) in Figure 1). The overlapping between edf of 𝑅1/𝑅2 and 𝐹𝜈1,𝜈2 suggests an excellent approximation of the 𝐹 distribution to the ratio of 𝑅1 and 𝑅2. In addition, the edf of the 𝐹 statistic also matches well with the distribution 𝐹𝐼,𝑁𝐼 (1(b)–3(b) in Figure 1), except in the unbalanced heteroscedastic case where Scheffé's method fails (4(b) in Figure 1). This explains why the FWER is inflated in the case of unequal variances.

One last comment, the above simulation results suggest that the widths of the generalized Scheffé intervals tend to be wider than that of the Scheffé intervals. This is our overall impression, but may not always be true in general. In the simulations, from time to time, we observed narrower generalized Scheffé intervals. We will see this feature from the data analysis example in the next section.

4. Example of Data Analysis

Solomon et al. [14] studied smoking behavior in pregnant women. They examined the women's determination to quit smoking while pregnant. They interviewed 349 women at their first prenatal visit, all of whom were smokers when they became pregnant, and were classified into four groups: precontemplation (PC), contemplation (C), preparation (P), and action (A). Their intention was to look at the subsequent smoking behavior of these subjects during the course of pregnancy, but one important consideration was how much these women smoked when they became pregnant. The sample sizes, means, and standard deviations of these four groups, in terms of cigarettes smoked per day when they became pregnant, are given in Table 4. Noting that the smallest sample size is 37, we do not need to worry about the normality assumption even if the response of interest is count or integer.

Table 3 presents the 95% Scheffé’s intervals and the generalized Scheffé intervals for the four group means and their differences. Since both sample sizes and variances are quite different from each other, the generalized Scheffé intervals are more reliable.

One may make a number of inferences with a joint confidence level of 95%. For example, women in the preparation (P) group have an average number of cigarettes every day ranging from 26.09 to 31.51, which seems to be the most frequent smoker group. There is no significant difference found between group P and group PC, because their difference has a confidence interval (8.87,0.87) that includes 0. It is also quite interesting to notice that the generalized Scheffé’s intervals are even narrower than the Scheffé intervals.

5. Discussion

Among others, the Scheffé method is one of the commonly-used method to make simultaneous inference on all linear combinations of means. Scheffé intervals are for all possible linear combinations of means and this brings benefit if a large number of linear combinations of means need to be compared. Assumption of equal variance for all means is needed to control type I error. When this assumption is violated the proposed method can be conveniently used for constructing simultaneous confidence intervals where type I error is controlled at a prespecified nominal level. Results from simulations show that the FWER of the proposed simultaneous confidence intervals are well preserved at a nominal level and the equal variance assumption can be simply ignored.