Journal of Probability and Statistics

Journal of Probability and Statistics / 2012 / Article
Special Issue

Advanced Designs and Statistical Methods for Genetic and Genomic Studies of Complex Diseases

View this Special Issue

Research Article | Open Access

Volume 2012 |Article ID 537474 | https://doi.org/10.1155/2012/537474

Yukun Liu, Pengfei Li, Yuejiao Fu, "Testing Homogeneity in a Semiparametric Two-Sample Problem", Journal of Probability and Statistics, vol. 2012, Article ID 537474, 15 pages, 2012. https://doi.org/10.1155/2012/537474

Testing Homogeneity in a Semiparametric Two-Sample Problem

Academic Editor: Yongzhao Shao
Received18 Nov 2011
Accepted24 Jan 2012
Published01 Apr 2012

Abstract

We study a two-sample homogeneity testing problem, in which one sample comes from a population with density 𝑓(𝑥) and the other is from a mixture population with mixture density (1−𝜆)𝑓(𝑥)+𝜆𝑔(𝑥). This problem arises naturally from many statistical applications such as test for partial differential gene expression in microarray study or genetic studies for gene mutation. Under the semiparametric assumption 𝑔(𝑥)=𝑓(𝑥)𝑒𝛼+𝛽𝑥, a penalized empirical likelihood ratio test could be constructed, but its implementation is hindered by the fact that there is neither feasible algorithm for computing the test statistic nor available research results on its theoretical properties. To circumvent these difficulties, we propose an EM test based on the penalized empirical likelihood. We prove that the EM test has a simple chi-square limiting distribution, and we also demonstrate its competitive testing performances by simulations. A real-data example is used to illustrate the proposed methodology.

1. Introduction

Let 𝑥1,…,𝑥𝑛0 be a random sample from a population with distribution function 𝐹, and let 𝑦1,…,𝑦𝑛1 be a random sample from a population with distribution function 𝐻. Testing whether the two populations have the same distribution, that is, 𝐻0∶𝐹=𝐻 versus 𝐻1∶𝐹≠𝐻, with both 𝐹 and 𝐻 completely unspecified, will require a nonparametric test. Since 𝐻1∶𝐹≠𝐻 is a very broad hypothesis, many times one may want to consider some more specified alternative, for example, the two populations only differ in location. In the present paper, we will consider a specified alternative in which one of the two samples has a mixture structure. More specifically, we have𝑥1,…,𝑥𝑛0i.i.d.∼𝑓(𝑥),𝑦1,…,𝑦𝑛1i.i.d.∼ℎ(𝑦)=(1−𝜆)𝑓(𝑦)+𝜆𝑔(𝑦),(1.1) where 𝑓(𝑥)=𝑑𝐹(𝑥)/𝑑𝑥, 𝑔(𝑦)=𝑑𝐺(𝑦)/𝑑𝑦, ℎ(𝑦)=𝑑𝐻(𝑦)/𝑑𝑦, and 𝜆∈(0,1) is an unknown parameter sometimes called contamination proportion. The problem of interest is to test 𝐻0∶𝑓=ℎ or equivalently 𝜆=0. This particular two-sample problem arises naturally in a variety of statistical applications such as test for partial differential gene expression in microarray study, genetic studies for gene mutation, case-control studies with contaminated controls, or the test of a treatment effect in the presence of nonresponders in biological experiments (see Qin and Liang [1] for details).

If no auxiliary information is available, this is merely the usual two-sample goodness-of-fit problem. There has been extensive literature on it; see Zhang [2] and references therein. However, these tests are not suitable for the specific alternative with a mixture structure as they might be inferior comparing with methods that are designed for the specific alternative. In this paper, we will propose an empirical likelihood-based testing procedure for this specific mixture alternative under Anderson’s semiparametric assumption [3]. Motivated by the logistic regression model, the semiparametric assumption proposed by Anderson [3] links the two distribution functions 𝐹 and 𝐺 through the following equation:log𝑔(𝑥)𝑓(𝑥)=𝛼+𝛽𝑥,(1.2) where 𝛼 and 𝛽 are both unknown parameters. There are many examples where the logarithm of the density ratio is linear in the observations.

Example 1.1. Let 𝐹 and 𝐺 be the distribution functions of Binomial (𝑚,𝑝1) and Binomial (𝑚,𝑝2), respectively. We refer the densities 𝑓 and 𝑔 to the probability mass functions corresponding to 𝐹 and 𝐺, respectively. Then, log𝑔(𝑥)𝑓(𝑥)=𝑚log1−𝑝21−𝑝1𝑝+log21−𝑝1𝑝11−𝑝2𝑥.(1.3)

Example 1.2. Let 𝐹 be the distribution function of 𝑁(𝜇1,ğœŽ2) and 𝐺 the distribution function of 𝑁(𝜇2,ğœŽ2). Then, log𝑔(𝑥)=1𝑓(𝑥)2ğœŽ2𝜇21−𝜇22+1ğœŽ2𝜇2−𝜇1𝑥.(1.4)
In practice, one may need to apply some sort of transformation to the data (e.g., logarithm transformation) in order to justify the use of the semiparametric model assumption (1.2).

Example 1.3. Let 𝐹 and 𝐺 be the distribution functions of log𝑁(𝜇1,ğœŽ2) and log𝑁(𝜇2,ğœŽ2), respectively. It is clear that the density ratio is a linear function of the log-transformed data: log𝑔(𝑥)=1𝑓(𝑥)2ğœŽ2𝜇21−𝜇22+1ğœŽ2𝜇2−𝜇1log𝑥.(1.5)

Example 1.4. Let 𝐹 and 𝐺 be the distribution functions of Gamma(𝑚1,𝜃) and Gamma(𝑚2,𝜃), respectively. In this case, log𝑔(𝑥)Γ𝑚𝑓(𝑥)=log1Γ𝑚2+𝑚1−𝑚2𝑚log𝜃+2−𝑚1log𝑥.(1.6)

The semiparametric modeling assumption (1.2) is very flexible and has the advantage of not putting any specific restrictions on the functional form of 𝑓. Under this assumption, various approaches have been proposed to test homogeneity in the two-sample problem (see [1, 4, 5] and references therein). This paper adds to this literature by introducing a new type of test statistics which are based on the empirical likelihood [6, 7].

The empirical likelihood (EL) is a nonparametric likelihood method which has many nice properties paralleling to the likelihood methods, for example, it is range-preserving, transform-respect, Bartlett correctable, and a systematic approach to incorporating auxiliary information [8–11]. In general, if the parameters are identifiable, the empirical likelihood ratio (ELR) test has a chi-square limiting distribution under null hypothesis. However, for the aforementioned testing problem, the parameters under 𝐻0 are not identifiable, which results in an intractable null limiting distribution for the ELR test. To circumvent this problem, we would add a penalty to the log EL to penalize 𝜆 being too close to zero. Working like a soft threshold, the penalty makes the parameters roughly identifiable. Intuitively, the penalized (or modified) ELR test should restore the usual chi-square limiting distribution. Unfortunately two things hinder the direct use of the penalized ELR test. One is that, to the best of our knowledge, there is no feasible algorithm to compute the penalized ELR test statistic. The other one is that there has been no research on the asymptotic properties of the penalized ELR test. Therefore, one cannot obtain critical values for the penalized ELR test regardless through simulations or an asymptotic reference distribution. We find that the EM test [12, 13] based on the penalized EL is a nice solution to the testing problem.

The remainder of this paper is organized as follows. In Section 2, we introduce the ELR and the penalized ELR. The penalized EL-based EM test is given in Section 3. A key computational issue of the EM test is discussed in Section 4. Sections 5 and 6 contain a simulation study and a real-data application, respectively. For clarity, all proofs are postponed to the appendix.

2. Empirical Likelihood

Let {𝑡1,…,𝑡𝑛0,𝑡𝑛0+1,…,𝑡𝑛}={𝑥1,…,𝑥𝑛0,𝑦1,…,𝑦𝑛1} denote the combined two-sample data, where 𝑛=𝑛0+𝑛1. Under Anderson’s semiparametric assumption (1.2), the likelihood of two-sample data (1.1) is𝐿=𝑛0𝑖=1𝑡𝑑𝐹𝑖𝑛𝑗=𝑛0+11−𝜆+𝜆𝑒𝛼+𝛽𝑡𝑗𝑡𝑑𝐹𝑗.(2.1) Let ğ‘â„Ž=𝑑𝐹(ğ‘¡â„Ž), ℎ=1,…,𝑛. The EL is just the likelihood 𝐿 with constraints ğ‘â„Žâ‰¥0, âˆ‘ğ‘›â„Ž=1ğ‘â„Ž=1 and âˆ‘ğ‘›â„Ž=1ğ‘â„Ž(𝑒𝛼+ğ›½ğ‘¡â„Žâˆ’1)=0. The corresponding log-EL is𝑙=ğ‘›î“â„Ž=1logğ‘â„Ž+𝑛1𝑗=1log1−𝜆+𝜆𝑒𝛼+𝛽𝑦𝑗.(2.2) We are interested in testing𝐻0∶𝜆=0or(𝛼,𝛽)=(0,0).(2.3) Under the null hypothesis, the constraint âˆ‘ğ‘›â„Ž=1ğ‘â„Ž(𝑒𝛼+ğ›½ğ‘¡â„Žâˆ’1)=0 will always hold and sup𝐻0𝑙=−𝑛log𝑛. Under alternative hypothesis, for any fixed (𝜆,𝛼,𝛽), maximizing 𝑙 with respect to ğ‘â„Žâ€™s leads to the log-EL function of (𝜆,𝛼,𝛽):𝑙(𝜆,𝛼,𝛽)=âˆ’ğ‘›î“â„Ž=1𝑒log1+𝜉𝛼+ğ›½ğ‘¡â„Žâˆ’1−𝑛log𝑛+𝑛1𝑗=1log1−𝜆+𝜆𝑒𝛼+𝛽𝑦𝑗,(2.4) where 𝜉 is the solution to the following equation:ğ‘›î“â„Ž=1𝑒𝛼+ğ›½ğ‘¡â„Žâˆ’1𝑒1+𝜉𝛼+ğ›½ğ‘¡â„Žî€¸âˆ’1=0.(2.5) Hence, the EL ratio function 𝑅(𝜆,𝛼,𝛽)=2{𝑙(𝜆,𝛼,𝛽)+𝑛log𝑛} and the ELR is denoted as 𝑅=sup𝑅(𝜆,𝛼,𝛽).

The null hypothesis 𝐻0 holds for 𝜆=0 regardless of (𝛼,𝛽), or (𝛼,𝛽)=(0,0) regardless of 𝜆. This implies that the parameter (𝜆,𝛼,𝛽) is not identifiable under 𝐻0, resulting in rather complicated asymptotic properties of the ELR. One may consider the modified or penalized likelihood method [14] and define the penalized log-EL function 𝑝𝑙(𝜆,𝛼,𝛽)=𝑙(𝜆,𝛼,𝛽)+log(𝜆). Accordingly the penalized EL ratio function is 𝑝𝑅(𝜆,𝛼,𝛽)=2{𝑝𝑙(𝜆,𝛼,𝛽)−𝑝𝑙(1,0,0)}=−2ğ‘›î“â„Ž=1𝑒log1+𝜉𝛼+ğ›½ğ‘¡â„Žâˆ’1+2𝑛1𝑗=1log1−𝜆+𝜆𝑒𝛼+𝛽𝑦𝑗+2log(𝜆),(2.6) where 𝜉 is the solution to (2.5). The penalty function log(𝜆) goes to −∞ as 𝜆 approaches 0. Therefore, 𝜆 is bounded away from 0, and the null hypothesis in (2.3) then reduces to (𝛼,𝛽)=(0,0). That is, the parameters in the penalized log-EL function is asymptotically identifiable. However, the asymptotic behavior of the penalized ELR test is still complicated. Meanwhile, the computation of the penalized ELR test statistic is another obstacle of the implementation of the penalized ELR method. No feasible and stable algorithm has been found for this purpose. An EL-based EM test proposed in this paper provides an efficient way to solve the problem.

3. EL-Based EM Test

Motivated by Chen and Li [12] and Li et al. [13], we propose an EM test based on the penalized EL to test the hypothesis (2.3). The EM test statistics are derived iteratively. We first choose a finite set of Λ={𝜆1,…,𝜆𝐿}⊂(0,1], for instance, Λ={0.1,0.2,…,0.9,1.0}, and a positive integer 𝐾 (2 or 3 in general). For each 𝑙=1,…,𝐿, we proceed the following steps.

Step 1. Let 𝑘=1 and 𝜆𝑙(𝑘)=𝜆𝑙. Calculate (𝛼𝑙(𝑘),𝛽𝑙(𝑘))=argmax𝛼,𝛽𝑝𝑅(𝜆𝑙(𝑘),𝛼,𝛽).

Step 2. Update (𝜆,𝛼,𝛽) by using the following algorithm for 𝐾−1 times.
Substep 2.1. Calculate the posterior distribution, 𝑤(𝑘)𝑗𝑙=𝜆𝑙(𝑘)𝛼exp𝑙(𝑘)+𝛽𝑙(𝑘)𝑦𝑗1−𝜆𝑙(𝑘)+𝜆𝑙(𝑘)𝛼exp𝑙(𝑘)+𝛽𝑙(𝑘)𝑦𝑗,𝑗=1,…,𝑛1,(3.1) and update 𝜆 by 𝜆𝑙(𝑘+1)=argmax𝜆𝑛1𝑗=11−𝑤(𝑘)𝑗𝑙log(1−𝜆)+𝑛1𝑗=1𝑤(𝑘)𝑗𝑙log(𝜆)+log(𝜆).(3.2)
Substep 2.2. Update (𝛼,𝛽) by (𝛼𝑙(𝑘+1),𝛽𝑙(𝑘+1))=argmax𝛼,𝛽𝑝𝑅(𝜆𝑙(𝑘+1),𝛼,𝛽).
Substep 2.3. Let 𝑘=𝑘+1 and continue.

Step 3. Define the test statistics 𝑀𝑛(𝐾)(𝜆𝑙)=𝑝𝑅(𝜆𝑙(𝐾),𝛼𝑙(𝐾),𝛽𝑙(𝐾)).

The EM test statistic is defined asEM𝑛(𝐾)𝑀=max𝑛(𝐾)𝜆𝑙,𝑙=1,…,𝐿.(3.3) We reject the null hypothesis 𝐻0 when the EM test statistic is greater than some critical value determined by the following limiting distribution.

Theorem 3.1. Suppose 𝜌=𝑛1/𝑛∈(0,1) is a constant. Assume the null hypothesis 𝐻0 holds and 𝐸(ğ‘¡â„Ž)=0 and Var(ğ‘¡â„Ž)=ğœŽ2∈(0,∞) for ℎ=1,…,𝑛. For 𝑙=1,…,𝐿 and any fixed 𝑘, it holds that 𝜆𝑙(𝑘)−𝜆𝑙=𝑜𝑝(1),𝛼𝑙(𝑘)=𝑂𝑝𝑛−1,𝛽𝑙(𝑘)=ğ‘¦âˆ’ğ‘¥ğœ†ğ‘™ğœŽ2+𝑜𝑝𝑛−1/2,(3.4) where 𝑥=(1/𝑛0)∑𝑛0𝑖=1𝑥𝑖 and 𝑦=(1/𝑛1)∑𝑛1𝑗=1𝑦𝑗.

Remark 3.2. The assumption ğ¸ğ‘¡â„Ž=0 is only for convenience purpose and unnecessary. Otherwise, we can replace ğ‘¡â„Ž and 𝛼 with ğ‘¡â„Žâˆ’ğ¸(ğ‘¡â„Ž) and 𝛼+𝛽𝐸(ğ‘¡â„Ž).

Theorem 3.3. Assume the conditions of Theorem 3.1 hold and 1∈Λ. Under the null hypothesis (2.3), EM𝑛(𝐾)→𝜒21 in distribution, as ğ‘›â†’âˆž.

We finish this section with an additional remark.

Remark 3.4. We point out that the idea of the EM-test can also be generalized to more general models such as log(𝑔(𝑥)/𝑓(𝑥))=𝛼+𝛽1𝑥+⋯+𝛽𝑘𝑥𝑘 for some integer 𝑘 or log(𝑔(𝑥)/𝑓(𝑥))=𝛼+𝛽1𝑡1(𝑥)+⋯+𝛽𝑘𝑡𝑘(𝑥)with 𝑡𝑖(⋅)’s being known functions.

4. Computation of the EM Test

A key step of the EM test procedure is to maximize 𝑝𝑅(𝜆,𝛼,𝛽) with respect to (𝛼,𝛽) for fixed 𝜆. In this section, we propose a computation strategy which provides stable solution to this optimization problem. Throughout this section, 𝜆 is suppressed to be fixed.

The objective function is 𝑝𝑅(𝜆,𝛼,𝛽)=𝐺(𝜉∗,𝛼,𝛽) where𝐺(𝜉,𝛼,𝛽)=−2ğ‘›î“â„Ž=1𝑒log1+𝜉𝛼+ğ›½ğ‘¡â„Žâˆ’1+2𝑛1𝑗=1log1−𝜆+𝜆𝑒𝛼+𝛽𝑦𝑗+2log(𝜆)(4.1) and 𝜉∗=𝜉∗(𝛼,𝛽) is the solution to𝜕𝐺𝜕𝜉=−2ğ‘›î“â„Ž=1𝑒𝛼+ğ›½ğ‘¡â„Žâˆ’1𝑒1+𝜉𝛼+ğ›½ğ‘¡â„Žî€¸âˆ’1=0.(4.2) If (𝛼,𝛽) is the maximum point of 𝑝𝑅(𝜆,𝛼,𝛽), it should generally satisfy𝜕𝐺𝜕𝛼=−2ğ‘›î“â„Ž=1𝜉𝑒𝛼+ğ›½ğ‘¡â„Žî€·ğ‘’1+𝜉𝛼+ğ›½ğ‘¡â„Žî€¸âˆ’1+2𝑛1𝑗=1𝜆𝑒𝛼+𝛽𝑦𝑗1−𝜆+𝜆𝑒𝛼+𝛽𝑦𝑗=0.(4.3) Combining (4.2) and (4.3) leads to1𝜉=𝑛𝑛1𝑗=1𝜆𝑒𝛼+𝛽𝑦𝑗1−𝜆+𝜆𝑒𝛼+𝛽𝑦𝑗.(4.4) Putting this expression of 𝜉 back into (4.1), we have a new function𝐻(𝛼,𝛽)=−2ğ‘›î“â„Ž=1𝑒log1+𝛼+ğ›½ğ‘¡â„Žî€¸1−1𝑛𝑛1𝑗=1𝜆𝑒𝛼+𝛽𝑦𝑗1−𝜆+𝜆𝑒𝛼+𝛽𝑦𝑗+2𝑛1𝑗=1log1−𝜆+𝜆𝑒𝛼+𝛽𝑦𝑗.(4.5) It can be verified that 𝐻(𝛼,𝛽) is almost surely concave in a neighborhood of (0,0) given 𝜆, which means that maximizing 𝐻(𝛼,𝛽) with respect to (𝛼,𝛽) gives the maximum of 𝑝𝑅(𝜆,𝛼,𝛽) for fixed 𝜆. The stability of the method is illustrated by the following simulation study.

5. Simulation Study

We consider two models in Examples 1.3 and 1.4 with 𝜇1=0, 𝜇2=𝜇, and ğœŽ2=1 for Example 1.3, and 𝑚1=1, 𝑚2=𝑚, and 𝜃=1 for Example 1.4. Nominal levels of 0.01, 0.05, and 0.10 are considered. The logarithm transformation is applied to the original data before using the EM test. The initial set Λ={0.1,0.2,…,1} and iteration number 𝐾=3 are used to calculate the EM test statistic.

One competitive method for testing homogeneity under the semiparametric two-sample model is the score test proposed by Qin and Liang [1]. This method is based on𝑆(𝛼,𝛽)=𝜕𝑙(𝜆,𝛼,𝛽)|||𝜕𝜆𝜆=0=𝑛1𝑗=1𝑒𝛼+𝛽𝑦𝑗−1,(5.1) where 𝑙(𝜆,𝛼,𝛽) is the log empirical likelihood function given in (2.4). Let (𝛼1,̂𝛽1)=argmax𝛼,𝛽𝑙(1,𝛼,𝛽). The score test statistic was defined as 𝑇1=𝑆(𝛼1,̂𝛽1)/(1+𝑛1/𝑛0), which has a 𝜒21 limiting distribution under the null hypothesis.

We compare the EM test and the score test in terms of type I error and power. We calculate the type I errors of each method under the null hypothesis based on 20,000 repetitions and the power under the alternative models based on 2,000 repetitions. For fair comparison, simulated critical values are used to calculate the power. We consider two sample sizes: 50 and 200 and 𝐾=1,2,3. Tables 1 and 2 contain the simulation results for the log-normal models and Tables 3 and 4 for the gamma models.


𝜆 𝜇 Level E M 𝑛 ( 1 ) E M 𝑛 ( 2 ) E M 𝑛 ( 3 ) SC test

0 10 11.9 12.2 12.2 11.5
0 5 6.3 6.5 6.5 6.4
0 1 1.6 1.6 1.6 1.9
0.1 1 10 14.8 14.5 14.5 14.6
0.1 1 5 8.5 8.6 8.6 8.6
0.1 1 1 2.5 2.5 2.5 2.4
0.1 2 10 27.2 28.1 28 25.6
0.1 2 5 17.8 18.4 18.4 16.7
0.1 2 1 6.6 6.7 6.7 6.2
0.1 3 10 47.1 48.3 48.3 41.4
0.1 3 5 34.2 35.6 35.4 30.8
0.1 3 1 15.4 15.6 15.6 14.9
0.2 1 10 25.5 25.9 26 25.6
0.2 1 5 16.4 16.4 16.4 17
0.2 1 1 5.4 5.3 5.3 5.7
0.2 2 10 62.2 62.7 62.7 56.7
0.2 2 5 50.6 51.3 51.2 45.9
0.2 2 1 28.4 28.5 28.5 24.7
0.2 3 10 88.3 88.8 88.8 81
0.2 3 5 81 82.3 82.3 73.4
0.2 3 1 61.7 61.9 61.9 51.5
0.3 1 10 43.3 42.9 42.8 42.8
0.3 1 5 31.3 31.1 31.1 31.6
0.3 1 1 14.2 14.2 14.2 13.9
0.3 2 10 88.1 88.5 88.5 84.2
0.3 2 5 80.8 80.8 80.8 76.8
0.3 2 1 61.5 61.5 61.5 55.3
0.3 3 10 99.3 99.3 99.3 97
0.3 3 5 98 98.2 98.2 94.8
0.3 3 1 93 93.2 93.2 85.2


𝜆 𝜇 Level E M 𝑛 ( 1 ) E M 𝑛 ( 2 ) E M 𝑛 ( 3 ) SC test

0 10 10.4 10.5 10.6 10.2
0 5 5.5 5.6 5.6 5.4
0 1 1.2 1.2 1.2 1.2
0.1 1 10 26.5 26.7 26.5 26.2
0.1 1 5 17.2 17.2 17.2 16.4
0.1 1 1 5.8 5.9 6 5.6
0.1 2 10 68.3 69 69.2 58.4
0.1 2 5 58.5 58.8 58.9 47.4
0.1 2 1 37 37.1 37.4 25.1
0.1 3 10 96.4 96.8 97 84.4
0.1 3 5 94.6 94.8 95.2 77.6
0.1 3 1 86.2 87.2 87.4 58.6
0.2 1 10 63 62.9 62.8 62.1
0.2 1 5 50.2 50 50 49.4
0.2 1 1 27.8 27.6 27.5 26.2
0.2 2 10 99.2 99.3 99.4 97.5
0.2 2 5 98.6 98.6 98.6 95
0.2 2 1 95.1 95.2 95.2 85.5
0.2 3 10 100 100 100 100
0.2 3 5 100 100 100 99.9
0.2 3 1 100 100 100 99.2
0.3 1 10 89.5 89.5 89.6 89
0.3 1 5 84 83.9 83.9 82.6
0.3 1 1 65.1 64.9 64.6 63
0.3 2 10 100 100 100 100
0.3 2 5 100 100 100 99.9
0.3 2 1 100 100 100 99.7
0.3 3 10 100 100 100 100
0.3 3 5 100 100 100 100
0.3 3 1 100 100 100 100


𝜆 𝑚 Level E M 𝑛 ( 1 ) E M 𝑛 ( 2 ) E M 𝑛 ( 3 ) SC test

0 10 12.2 12.5 12.5 12.1
0 5 6.4 6.6 6.6 6.7
0 1 1.4 1.4 1.4 2.3
0.1 2 10 14.9 15.1 15.2 12
0.1 2 5 8.8 8.9 8.9 6.4
0.1 2 1 2.8 2.8 2.8 0.6
0.1 3 10 19.6 19.9 19.9 14.1
0.1 3 5 13.2 13.2 13.2 7.7
0.1 3 1 4.3 4.4 4.4 1
0.1 4 10 25.5 26.4 26.5 17
0.1 4 5 17.5 17.9 17.9 9.2
0.1 4 1 6.3 6.4 6.4 1.1
0.2 2 10 22.9 22.7 22.8 17.6
0.2 2 5 14.4 14.3 14.3 9.2
0.2 2 1 4.5 4.7 4.7 1.2
0.2 3 10 39.6 39.9 40 27.4
0.2 3 5 29.1 29.5 29.5 16.7
0.2 3 1 14.3 14.4 14.4 4
0.2 4 10 61.1 61.7 61.7 37
0.2 4 5 49.2 49.6 49.6 24.1
0.2 4 1 28.4 28.6 28.6 6.6
0.3 2 10 36.3 36.4 36.4 28.6
0.3 2 5 26.1 25.9 25.9 16.9
0.3 2 1 11.9 11.9 11.9 3.1
0.3 3 10 67.2 67.2 67.2 48.9
0.3 3 5 55.8 55.8 55.8 35.1
0.3 3 1 34 34 34.1 11.3
0.3 4 10 87.9 88.1 88.2 67.5
0.3 4 5 81.8 82.2 82.2 53.4
0.3 4 1 63.1 63.3 63.4 21.4


𝜆 𝑚 Level E M 𝑛 ( 1 ) E M 𝑛 ( 2 ) E M 𝑛 ( 3 ) SC test

0 10 11.2 11.3 11.3 11.1
0 5 5.9 5.9 6 5.7
0 1 1.2 1.2 1.2 1.4
0.1 2 10 23.1 22.7 22.7 19.7
0.1 2 5 14.2 14.2 14.2 11.7
0.1 2 1 5.1 5.1 5.2 3.1
0.1 3 10 39.6 39.8 39.9 29.5
0.1 3 5 29 29.4 29.6 19
0.1 3 1 13.2 13.4 13.5 4.8
0.1 4 10 62.3 62.5 62.7 37.5
0.1 4 5 52.2 52.5 52.8 26.2
0.1 4 1 32.5 33.2 33.7 8.5
0.2 2 10 49 48.9 48.9 43.8
0.2 2 5 36.6 36.7 36.5 30.6
0.2 2 1 19.4 19.4 19.4 11.4
0.2 3 10 88.2 88.2 88.4 73
0.2 3 5 81.5 81.6 81.6 61.2
0.2 3 1 64.6 64.6 64.8 34.6
0.2 4 10 98.9 98.9 98.9 87.1
0.2 4 5 98 98.1 98.1 79.7
0.2 4 1 94.3 94.2 94.2 54.5
0.3 2 10 78.5 78.5 78.6 73
0.3 2 5 70.1 70 70 62.5
0.3 2 1 48.7 48.8 48.8 34.9
0.3 3 10 99.2 99.2 99.2 96.1
0.3 3 5 98.8 98.8 98.8 93
0.3 3 1 96.5 96.5 96.5 78.8
0.3 4 10 100 100 100 99.4
0.3 4 5 100 100 100 98.7
0.3 4 1 100 100 100 92.5

The results show that the EM test and the score test have similar type I errors. For both methods, the type I errors are somehow larger than the nominal levels when the sample size is 𝑛=50; they are close to the nominal levels when the sample size is increased to 𝑛=200. For the log-normal models, two methods have almost the same power when the alternatives are close to each other such as 𝜇=1; the EM test becomes much more powerful when the alternatives are distant and the sample size increases. In the case of 𝑛=50, 𝜆=0.2, 𝜇=3, and nominal level 0.01, the EM test has a 10% gain in power compared with the score test; the gain rushes up to almost 30% when 𝜆=0.1, 𝜇=3, and the sample size increases to 𝑛=200. For the gamma models, the advantage of the EM test is more obvious. For both sample sizes 𝑛=50 and 200, the EM test is more powerful than the score test.

6. Real Example

We apply our EM test procedure to the drug abuse data [15] in a study of addiction to morphine in rats. In this study, rats got morphine by pressing a lever and the frequency of lever presses (self-injection rates) after six-day treatment with morphine was recorded as response variable. The data consist of the number of lever presses for five groups of rats: four treatment groups with different dose levels and one saline group (control group).

We analyzed the response variables (the number of lever presses by rats) of the treatment group at the first dose level and the control group. The data is tabulated in Table  3 of Fu et al. [5]. Following Boos and Browine [16] and Fu et al. [5], we analyze the transformed data, log10(𝑅+1) with 𝑅 being the number of lever presses by rats. Instead of using the parametric models as Boos and Browine [16] and Fu et al. [5], we adopt Anderson’s semiparametric approach. That is, we assume that the response variables in control group comes from 𝑓(𝑥), while the response variables in treatment group comes from ℎ(𝑥)=(1−𝜆)𝑓(𝑥)+𝜆𝑔(𝑥) with 𝑔(𝑥)/𝑓(𝑥)=exp(𝛼+𝛽𝑥). The EM test statistics for testing homogeneity under the semiparametric two-sample model are found to be EM𝑛(1)=14.090, EM𝑛(2)=14.150, and EM𝑛(3)=14.167. Calibrated by the 𝜒21 limiting distribution, the 𝑃values are all around 0.02%. We also applied the score test of Qin and Liang [1]. The score test statistic is 9.417 with the 𝑃 value equal to 0.2% calibrated by the 𝜒21 limiting distribution. We also used the permutation methods to get the 𝑃values of the two types of tests. Based on 50,000 permutations, the 𝑃 values of the three EM test statistics are all around 0.03%, and the 𝑃 value of the score test is around 0.5%. In accordance with Fu et al. [5], both methods suggest a significant treatment effect, while the proposed EM test has much stronger evidence than the score test.

Appendix

Proofs

The proofs of Theorems 3.1 and 3.3 are based on the three lemmas given below. Lemma A.1 assesses the order of the maximum empirical likelihood estimators of 𝛼 and 𝛽 with 𝜆 bounded away from 0 under the null hypothesis. Lemma A.2 shows that the EM iteration updates the value of 𝜆 by the amount of order 𝑜𝑝(1). Theorem 3.1 is then proved by iteratively using Lemmas A.1 and A.2. Lemma A.3 gives an approximation of the penalized ELR for any 𝜆 bounded away from 0, based on which we prove Theorem 3.3.

Lemma A.1. Assume the conditions of Theorem 3.1. Let 𝜆∈[𝜖,1] for some constant 𝜖>0 and (𝛼,𝛽)=argmax𝛼,𝛽𝑝𝑅(𝜆,𝛼,𝛽). Then, we have 𝛼=𝑂𝑝𝑛−1,𝛽=ğ‘¦âˆ’ğ‘¥ğœ†ğœŽ2+𝑜𝑝𝑛−1/2(A.1) with 𝑥=1/𝑛0∑𝑛0𝑖=1𝑥𝑖 and 𝑦=1/𝑛1∑𝑛1𝑗=1𝑦𝑗.

Proof. Since 𝜆≥𝜖>0, the parameters (𝛼,𝛽) in the empirical likelihood ratio are identifiable. Therefore, (𝛼,𝛽) are √𝑛-consistent to the true value (0,0), that is, 𝛼=𝑂𝑝(𝑛−1/2) and 𝛽=𝑂𝑝(𝑛−1/2) [10].
Following the arguments in Section 4, the maximum empirical likelihood estimate (𝛼,𝛽) should satisfy (here 𝜆 is suppressed to 𝜆) 𝜕𝐺𝜉,𝛼,𝛽𝜕𝛼=−2ğ‘›î“â„Ž=1𝜉𝑒𝛼+ğ›½ğ‘¡â„Ž1+𝜉𝑒𝛼+ğ›½ğ‘¡â„Žî‚âˆ’1+2𝑛1𝑗=1𝜆𝑒𝛼+𝛽𝑦𝑗1−𝜆+𝜆𝑒𝛼+𝛽𝑦𝑗=0,(A.2)𝜕𝐺𝜉,𝛼,𝛽𝜕𝛽=−2ğ‘›î“â„Ž=1𝜉𝑒𝛼+ğ›½ğ‘¡â„Žğ‘¡â„Ž1+𝜉𝑒𝛼+ğ›½ğ‘¡â„Žî‚âˆ’1+2𝑛1𝑗=1𝜆𝑒𝛼+𝛽𝑦𝑗𝑦𝑗1−𝜆+𝜆𝑒𝛼+𝛽𝑦𝑗=0(A.3) with 1𝜉=𝑛𝑛1𝑗=1𝜆𝑒𝛼+𝛽𝑦𝑗1−𝜆+𝜆𝑒𝛼+𝛽𝑦𝑗.(A.4)
Applying Taylor expansion on the right-hand side of (A.4), we get 𝑛𝜉=1𝑛𝜆+𝑜𝑝(1).(A.5) Further applying first-order Taylor expansion to (A.2) and using (A.4), we get 𝑛𝜉1−𝜉𝛼+𝜉1âˆ’ğœ‰î‚ğ‘›î“â„Ž=1ğ‘¡â„Žğ›½âˆ’ğ‘›1𝜆1−𝜆𝛼−𝜆1−𝜆𝑛1𝑗=1𝑦𝑗𝛽=𝑂𝑝(𝑛)𝛼2+𝛽2.(A.6) Note that both 𝛼 and 𝛽 are of order 𝑂𝑝(𝑛−1/2) and that both âˆ‘ğ‘›â„Ž=1ğ‘¡â„Ž and ∑𝑛1𝑗=1𝑦𝑗 have order 𝑂𝑝(𝑛1/2). Combining (A.5) and (A.6) yields 𝑛1𝜆(𝜆−𝑛1/𝑛𝜆)𝛼=𝑂𝑝(1). Therefore, 𝛼=𝑂𝑝(𝑛−1).
Similarly, first-order Taylor expansion of (A.3) results in 0=âˆ’ğœ‰ğ‘›î“â„Ž=1ğ‘¡â„Žâˆ’ğœ‰î‚€1âˆ’ğœ‰î‚ğ‘›î“â„Ž=1ğ‘¡â„Žğ›¼âˆ’ğœ‰î‚€1âˆ’ğœ‰î‚ğ‘›î“â„Ž=1𝑡2â„Žğ›½+𝜆𝑛1𝑗=1𝑦𝑗+𝜆1−𝜆𝑛1𝑗=1𝑦𝑗𝛼+𝜆1−𝜆𝑛1𝑗=1𝑦2𝑗𝛽+𝑂𝑝(𝑛)𝛼2+𝛽2.(A.7) With the same reasoning as for 𝛼, it follows from (A.7) that 𝑛1𝜆𝑛1−1ğ‘›ğœ†î‚¶ğœŽ2−𝑛1𝜆1âˆ’ğœ†î‚ğœŽ2𝛽=𝜆𝑛1𝑗=1𝑦𝑗−𝑛1ğ‘›ğœ†ğ‘›î“â„Ž=1ğ‘¡â„Ž+𝑜𝑝𝑛1/2.(A.8) After some algebra, we have 𝛽=(𝑦−𝑥)/(ğœ†ğœŽ2)+𝑜𝑝(𝑛−1/2), which completes the proof.

Suppose that 𝜆, 𝛼, and 𝛽 have the properties given in Lemma A.1. For 𝑗=1,…,𝑛1, let 𝑤𝑗=𝜆exp(𝛼+𝛽𝑦𝑗)/(1−𝜆+𝜆exp(𝛼+𝛽𝑦𝑗)). The updated value of 𝜆 is𝜆∗=argmax𝜆𝑛1𝑗=11−𝑤𝑗log(1−𝜆)+𝑛1𝑗=1𝑤𝑗log(𝜆)+log(𝜆).(A.9) It can be verified that the close form of 𝜆∗ is given by 𝜆∗=(1/(𝑛1∑+1))(𝑛1𝑗=1𝑤𝑗+1). We now show that the above iteration only changes the value of 𝜆 by an 𝑜𝑝(1) term.

Lemma A.2. Assume the conditions of Lemma A.1 hold. Then, 𝜆∗=𝜆+𝑜𝑝(1).

Proof. Let ̂∑𝜆=𝑛1𝑗=1𝑤𝑗/𝑛1. According to Lemma A.1, 𝛼=𝑜𝑝(1) and 𝛽=𝑜𝑝(1). Applying the first-order Taylor expansion, we have ̂1𝜆=𝑛1𝑛1𝑗=1𝜆exp𝛼+𝛽𝑦𝑗1−𝜆+𝜆exp𝛼+𝛽𝑦𝑗=𝜆+𝑂𝑝(1)𝛼+𝛽=𝜆+𝑜𝑝(1).(A.10) Some simple algebra work shows that 𝜆∗−̂𝜆=1−𝜆𝑛1+1=𝑜𝑝(1).(A.11) Therefore, 𝜆∗=𝜆+𝑜𝑝(1), and this finishes the proof.

Proof of Theorem 3.1. With the above two technical lemmas, the proof is the same as that of Theorem  1 in Li et al. [13] and therefore is omitted.

The next lemma is a technical preparation for proving Theorem 3.3. It investigates the asymptotic approximation of the penalized ELR for any 𝜆 bounded away from 0.

Lemma A.3. Assume the conditions of Theorem 3.1 and 𝜆∈[𝜖,1] for some 𝜖>0. Then, 𝑝𝑅𝜆,𝛼,𝛽=𝑛𝜌(1−𝜌)ğœŽâˆ’2𝑦−𝑥2+2log𝜆+𝑜𝑝(1).(A.12)

Proof. With Lemma A.1, we have 𝛼=𝑂𝑝(𝑛−1) and 𝛽=𝑂𝑝(𝑛−1/2). Applying second-order Taylor expansion on 𝑝𝑅(𝜆,𝛼,𝛽) and noting that 𝜕𝑝𝑅/𝜕𝛼|(𝛼,𝛽)=(0,0)=0, we have 𝑝𝑅𝜆,𝛼,𝛽−=2ğœ‰ğ‘›î“â„Ž=1ğ‘¡â„Ž+𝜆𝑛1𝑗=1𝑦𝑗𝛽−𝜉1âˆ’ğœ‰î‚ğ‘›î“â„Ž=1𝑡2â„Žâˆ’ğœ†î‚€1−𝜆𝑛1𝑗=1𝑦2𝑗𝛽2+2log𝜆+𝑜𝑝(1).(A.13) Using (A.5) and the facts that both âˆ‘ğ‘›â„Ž=1𝑡2ℎ/𝑛 and ∑𝑛1𝑗=1𝑦2𝑗/𝑛1 converge to ğœŽ2 in probability, the above expression can be simplified to 𝑝𝑅𝜆,𝛼,𝛽𝑛=21𝑛0𝑛𝜆𝑦−𝑥𝑛𝛽−1𝑛0𝑛𝜆2ğœŽ2𝛽2+2log𝜆+𝑜𝑝(1).(A.14) Plugging in the approximation 𝛽=(𝑦−𝑥)/(ğœ†ğœŽ2)+𝑜𝑝(𝑛−1/2), we get 𝑝𝑅𝜆,𝛼,𝛽=𝑛1𝑛0𝑛𝑦−𝑥2ğœŽ2+2log𝜆+𝑜𝑝(1)=𝑛𝜌(1−𝜌)ğœŽâˆ’2𝑦−𝑥2+2log𝜆+𝑜𝑝(1).(A.15) This completes the proof.

Proof of Theorem 3.3. Without loss of generality, we assume 0<𝜆1<𝜆2<⋯<𝜆𝐿=1. According to Theorem 3.1 and Lemma A.3, for 𝑙=1,…,𝐿, we have 𝜆𝑝𝑅𝑙(𝐾),𝛼𝑙(𝐾),𝛽𝑙(𝐾)=𝑛𝜌(1−𝜌)ğœŽâˆ’2𝑦−𝑥2𝜆+2log𝑙+𝑜𝑝(1).(A.16) This leads to EM𝑛(𝐾)=max1≤𝑙≤𝐿𝜆𝑝𝑅𝑙(𝐾),𝛼𝑙(𝐾),𝛽𝑙(𝐾)=𝑛𝜌(1−𝜌)ğœŽâˆ’2𝑦−𝑥2+𝑜𝑝(1),(A.17) where the remainder is still 𝑜𝑝(1) since the maximum is taken over a finite set.
Note that when 𝑛 tends to infinity, √𝑛(𝑦−𝑥)→𝑁(0,ğœŽ2/[𝜌(1−𝜌)])in distribution. Therefore, EM𝑛(𝐾)⟶𝜒21(A.18) in distribution as 𝑛 goes to infinity. This completes the proof.

References

  1. J. Qin and K. Y. Liang, “Hypothesis testing in a mixture case-control model,” Biometrics, vol. 67, pp. 182–193, 2011. View at: Publisher Site | Google Scholar | Zentralblatt MATH
  2. J. Zhang, “Powerful two-sample tests based on the likelihood ratio,” Technometrics, vol. 48, no. 1, pp. 95–103, 2006. View at: Publisher Site | Google Scholar
  3. J. A. Anderson, “Multivariate logistic compounds,” Biometrika, vol. 66, no. 1, pp. 17–26, 1979. View at: Publisher Site | Google Scholar | Zentralblatt MATH
  4. T. Lancaster and G. Imbens, “Case-control studies with contaminated controls,” Journal of Econometrics, vol. 71, no. 1-2, pp. 145–160, 1996. View at: Publisher Site | Google Scholar | Zentralblatt MATH
  5. Y. Fu, J. Chen, and J. D. Kalbfleisch, “Modified likelihood ratio test for homogeneity in a two-sample problem,” Statistica Sinica, vol. 19, no. 4, pp. 1603–1619, 2009. View at: Google Scholar | Zentralblatt MATH
  6. A. B. Owen, “Empirical likelihood ratio confidence intervals for a single functional,” Biometrika, vol. 75, no. 2, pp. 237–249, 1988. View at: Publisher Site | Google Scholar | Zentralblatt MATH
  7. A. B. Owen, “Empirical likelihood ratio confidence regions,” The Annals of Statistics, vol. 18, no. 1, pp. 90–120, 1990. View at: Publisher Site | Google Scholar | Zentralblatt MATH
  8. P. Hall and B. La Scala, “Methodology and algorithms of empirical likelihood,” International Statistical Review, vol. 58, pp. 109–127, 1990. View at: Publisher Site | Google Scholar | Zentralblatt MATH
  9. T. DiCiccio, P. Hall, and J. Romano, “Empirical likelihood is Bartlett-correctable,” The Annals of Statistics, vol. 19, no. 2, pp. 1053–1061, 1991. View at: Publisher Site | Google Scholar | Zentralblatt MATH
  10. J. Qin and J. Lawless, “Empirical likelihood and general estimating equations,” The Annals of Statistics, vol. 22, no. 1, pp. 300–325, 1994. View at: Publisher Site | Google Scholar | Zentralblatt MATH
  11. S. E. Ahmed, A. Hussein, and S. Nkurunziza, “Robust inference strategy in the presence of measurement error,” Statistics & Probability Letters, vol. 80, no. 7-8, pp. 726–732, 2010. View at: Publisher Site | Google Scholar | Zentralblatt MATH
  12. J. Chen and P. Li, “Hypothesis test for normal mixture models: the EM approach,” The Annals of Statistics, vol. 37, no. 5, pp. 2523–2542, 2009. View at: Publisher Site | Google Scholar | Zentralblatt MATH
  13. P. Li, J. Chen, and P. Marriott, “Non-finite Fisher information and homogeneity: an EM approach,” Biometrika, vol. 96, no. 2, pp. 411–426, 2009. View at: Publisher Site | Google Scholar | Zentralblatt MATH
  14. J. Chen, “Penalized likelihood-ratio test for finite mixture models with multinomial observations,” The Canadian Journal of Statistics, vol. 26, no. 4, pp. 583–599, 1998. View at: Publisher Site | Google Scholar | Zentralblatt MATH
  15. J. R. Weeks and R. J. Collins, “Primary addiction to morphine in rats,” Federation Proceedings, vol. 30, p. 277, 1971. View at: Google Scholar
  16. D. D. Boos and C. Brownie, “Mixture models for continuous data in dose-response studies when some animals are unaffected by treatment,” Biometrics, vol. 47, pp. 1489–1504, 1991. View at: Publisher Site | Google Scholar

Copyright © 2012 Yukun Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


More related articles

 PDF Download Citation Citation
 Download other formatsMore
 Order printed copiesOrder
Views1100
Downloads551
Citations

Related articles

We are committed to sharing findings related to COVID-19 as quickly as possible. We will be providing unlimited waivers of publication charges for accepted research articles as well as case reports and case series related to COVID-19. Review articles are excluded from this waiver policy. Sign up here as a reviewer to help fast-track new submissions.