About this Journal Submit a Manuscript Table of Contents
Journal of Probability and Statistics
Volume 2012 (2012), Article ID 913560, 14 pages
http://dx.doi.org/10.1155/2012/913560
Research Article

Robust Semiparametric Optimal Testing Procedure for Multiple Normal Means

1Department of Statistics, Iowa State University, Ames, IA 50011, USA
2Department of Veterinary Diagnostic and Production Animal Medicine, Iowa State University, Ames, IA 50011, USA

Received 27 March 2012; Accepted 10 May 2012

Academic Editor: Yongzhao Shao

Copyright © 2012 Peng Liu and Chong Wang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

In high-dimensional gene expression experiments such as microarray and RNA-seq experiments, the number of measured variables is huge while the number of replicates is small. As a consequence, hypothesis testing is challenging because the power of tests can be very low after controlling multiple testing error. Optimal testing procedures with high average power while controlling false discovery rate are preferred. Many methods were constructed to achieve high power through borrowing information across genes. Some of these methods can be shown to achieve the optimal average power across genes, but only under a normal assumption of alternative means. However, the assumption of a normal distribution is likely violated in practice. In this paper, we propose a novel semiparametric optimal testing (SPOT) procedure for high-dimensional data with small sample size. Our procedure is more robust because it does not depend on any parametric assumption for the alternative means. We show that the proposed test achieves the maximum average power asymptotically as the number of tests goes to infinity. Both simulation study and the analysis of a real microarray data with spike-in probes show that the proposed SPOT procedure performs better when compared to other popularly applied procedures.

1. Introduction

The problem of statistically testing mean difference for each of thousands of variables is commonly encountered in genomic studies. For example, the popularly applied microarray technology allows the gene expression study of tens of thousands of genes simultaneously. The recent advance of next-generation sequencing technology allows the measurement of gene expression in an even higher dimension. These high-throughput technologies have revolutionized the way genomic studies progress and provided rich data to explore. However, these experiments are expensive, and as a consequence, such experiments typically involve only a few samples for each treatment group. This results in the “large 𝑝, small 𝑛” problem for hypothesis testing, and the power of the statistical tests can be very low after controlling the multiple testing error, such as the false discovery rate (FDR).

The normalized signal intensities from microarray experiments are generally assumed to follow normal distributions [14]. The recently emerging next-generation sequencing data may also be modeled approximately using normal distributions, when the number of reads are large or under certain transformation [5]. Thus multiple testing problem for normal means has wide applications in genetic and genomic studies, and it is also a general statistical question of interest.

Several testing procedures have been proposed in the context of microarray study, including the SAM test [6], Efron's 𝑡-test [7], the regularized 𝑡-test [8], the 𝐵-statistic [1] and its multivariate counterpart, the 𝑀𝐵-statistic [9], the test of Wright and Simon [10], the moderated 𝑡-test [2], the 𝐹𝑆 test [3] and the test of [11] which is similar to the 𝐹𝑆 test, the 𝐹SS test [4], and the LEMMA test [12]. Although numerous procedures have been proposed, very few can be justified to achieve the optimal power. Among these procedures, Hwang and Liu [4] proposed a framework and showed that an optimal testing procedure can be derived within such a framework. They also proposed a test with maximum average power (the MAP test) and an approximated version, the 𝐹SS test. Here the optimality was defined in terms of maximizing the power averaged across all tests for which the null hypotheses are false while controlling FDR. This method provides theoretical guide for developing optimal multiple testing procedures. The popularly applied moderated 𝑡-statistic developed by Smyth [2] can also be shown to achieve optimal power asymptotically under different distributional assumptions from the 𝐹SS test. Both the moderated 𝑡-statistic and the 𝐹SS test assume that the mean expression levels (or the mean of interesting contrasts) of all genes follow a normal distribution although the parameters for this distribution vary between the two tests. Yet in practice such distribution depends on the population of genes selected in a particular study and often does not follow the prespecified parametric distribution. This raises concerns about the robustness of the moderated 𝑡 and the 𝐹SS tests.

The objective of this paper is to develop an optimal and robust multiple testing procedure without any distributional assumptions on the mean. As in Hwang and Liu [4], the optimality is defined in terms of maximizing the power averaged across all tests for which the null hypotheses are false while controlling FDR. We develop a semiparametric optimal testing procedure which we abbreviate as the SPOT procedure. The distribution of the mean expression across genes is not assumed to follow a parametric model which makes our method robust to violations to normal assumptions. We find that the SPOT procedure works very well in simulation studies and in an analysis of real microarray data with spike-in probes.

The remaining of this paper is organized as follows. We first introduce necessary notations in Section 2. Then, in Section 3, we describe the general concepts of optimal testing procedures. We propose our semiparametric optimal testing (SPOT) procedure in Section 4 and describe its implementation in Section 5. Section 6 presents simulation studies. Section 7 shows the analysis result of a real microarray dataset. Section 8 provides a summary of this paper.

2. Notations

An appropriate linear model is typically fitted for each gene based on the design of a microarray experiment. Section 2 of Smyth [2] provides a nice description of this topic. Given the linear model, suppose that we have an interesting contrast to test for each gene. This contrast may be the difference between the means of two treatment groups or linear combination of means from several treatment groups. For the simplicity of description, we call the genes whose contrast means are not zero as the differentially expressed (DE) genes and the genes whose contrast means equal to zero as equivalently expressed (EE) genes. After fitting the linear model for each gene, we obtain an estimate for the contrast for each gene, 𝑋𝑔. In addition, we get the estimate of the sample residual variance, 𝑠2𝑔, for each gene. For each 𝑔=1,,𝐺, 𝑋𝑔 and 𝑠2𝑔 are related to true parameters, 𝜇𝑔 and 𝜎2𝑔, by 𝑋𝑔𝜇𝑔,𝜎2𝑔𝑁(𝜇𝑔,𝜈𝑔𝜎2𝑔) and 𝑠2𝑔𝜎2𝑔(𝜎2𝑔/𝑑𝑔)𝜒2𝑑𝑔, where 𝜇𝑔 is the contrast mean for gene 𝑔, 𝜎2𝑔 is the true residual variance for gene 𝑔, and the coefficients 𝜈𝑔 and 𝑑𝑔 are determined by the design of the experiment. Two examples are given as follows.

Example 2.1. Two-channel microarray experiment to compare two treatments. Assume that each sample from treatment A is paired randomly with a sample from treatment B and each pair of samples is cohybridized onto one slide. After normalization and appropriate transformation, the difference of normalized expression measurements between the two samples on each slide is analyzed for each gene. Hence, this is a paired sample case and the number of data points for each gene is 𝑛, the number of slides. We are interested in identifying DE genes. In this case, 𝑋𝑔 is the mean difference of the paired samples for gene 𝑔. 𝑠2𝑔 is the sample variance for gene 𝑔. So 𝜈𝑔=1/𝑛 and 𝑑𝑔=𝑛1.

Example 2.2. Affymetrix microarray experiment with two independent samples. Assume sample sizes are 𝑛1 and 𝑛2 for treatment A and treatment B, respectively. The statistic 𝑋𝑔 is the difference in sample means of normalized expression measurements between two groups for gene 𝑔. 𝑠2𝑔 is the pooled sample variance. Then 𝜈𝑔=1/𝑛1+1/𝑛2 and 𝑑𝑔=𝑛1+𝑛22.
Given the data 𝑋𝑔 and 𝑠2𝑔, an ordinary t-test with statistic 𝑡𝑔=𝑋𝑔/𝜈𝑔𝑠𝑔 may be used to test the null hypothesis 𝐻0𝑔: 𝜇𝑔=0. However, the power of such tests is low after controlling multiple testing error. So statistical methods with higher power are in demand for such high-dimensional testing problem as encountered in gene expression studies.

3. Optimal Testing Procedures

In the analysis of high-dimensional gene expression data such as microarray data, we are more interested in the average behavior of the tests across all genes rather than the performance of an individual test. Because the dimension of tests is huge, multiple testing errors should be controlled to avoid too many type I errors. Controlling FDR is an important method for controlling multiple testing errors and is widely used for genomic studies. Although many testing procedures have been developed as reviewed in Section 1, the paper by Hwang and Liu [4] provides some theoretical guide on how to derive optimal testing procedures within an empirical Bayes framework. The optimal tests are defined to be the ones that maximize the power averaged across all genes for which the null hypotheses are false while controlling FDR. Such optimal tests have been called MAP tests, where MAP stands for maximum average power [4].

In a Bayesian framework, we assume model parameters like 𝜇𝑔 and 𝜎2𝑔 follow some distributions. The residual variances of genes, 𝜎2𝑔, have been modeled by prior distribution like inverse gamma [2, 10] or log-normal [4] distribution independent of whether the null hypothesis is true or false. For EE genes, the mean of contrast 𝑋𝑔, 𝜇𝑔, is equal to 0. For DE genes, the mean 𝜇𝑔 is not 0 almost surely. Denote the alternative distribution of 𝜇𝑔 by 𝜋1(). Based on the Neyman-Pearson fundamental lemma, for a randomly selected gene 𝑔, the most powerful test statistic for testing 𝐻0𝑔𝜇𝑔=0 versus 𝐻1𝑔𝜇𝑔𝜋1(𝜇𝑔) is given by 𝑇NP𝑔=𝑓𝑋𝑔,𝑠2𝑔𝜇𝑔,𝜎2𝑔𝜋1𝜇𝑔𝜋𝜎2𝑔𝑑𝜇𝑔𝑑𝜎2𝑔𝑓𝑋𝑔,𝑠2𝑔𝜇𝑔=0,𝜎2𝑔𝜋𝜎2𝑔𝑑𝜎2𝑔,(3.1) where 𝜋() denote the prior distributions of 𝜎2𝑔. And the test rejects the null hypothesis 𝐻0𝑔 when 𝑇NP𝑔 is large. The simultaneous testing procedure where all genes are tested using the most powerful statistics 𝑇NP𝑔,𝑔=1,2,,𝐺, achieves the highest average power while controlling FDR, as proved in Hwang and Liu [4].

One popular multiple-testing method for microarray data is the moderated  𝑡-test proposed by Smyth [2]. Smyth proposed to model the residual variance 𝜎2𝑔 with the prior distribution: 1𝜎2𝑔1𝑑0𝑠20𝜒2𝑑0,(3.2) where 𝜒2𝑑0 denotes a chi-square distribution with degrees of freedom 𝑑0 and 𝑠20 is another hyperparameter. This prior distribution is equivalent to an inverse-gamma distribution and has been shown to fit real data well. Compared to a standard t-test statistic 𝑡𝑔=𝑥𝑔/𝜈𝑔𝑠𝑔, Smyth's moderated t-statistic takes the form of ̃𝑡𝑔=𝑥𝑔𝜈𝑔̃𝑠𝑔,(3.3) where ̃𝑠2𝑔=𝑠2𝑔𝑑𝑔+𝑠20𝑑0𝑑𝑔+𝑑0(3.4) is a shrinkage estimator of 𝜎2𝑔 by shrinking 𝑠2𝑔 toward 𝑠20.

In practice, the unknown hyperparameters 𝑑0 and 𝑠20 for the distribution of the variance 𝜎2𝑔 can be estimated consistently by the method of moments, that is, equating the empirical and expected first two moments of log𝑠2𝑔 [2]. Smyth [2] showed that the moderated 𝑡-test is equivalent to the 𝐵 statistic proposed in Lönnstedt and Speed [1] which was derived as the posterior odds under the assumption that the distribution of 𝜇𝑔 under the alternative hypothesis follows 𝑁(0,𝜈0𝜎2𝑔), where 𝜈0 is a constant. In fact, we can prove the claim that the moderated 𝑡-test achieves the optimal average power asymptotically under their assumptions for 𝜇𝑔 and 𝜎2𝑔. The proof is in the appendix.

However, note that the assumption that 𝜇𝑔 of DE genes follows a normal distribution with mean zero is restrictive. It is likely that, for example, there are more upregulated genes than downregulated genes for some studies which suggests that the mean of 𝜇𝑔 should be positive. Hwang and Liu [4] have proposed a more general normal prior distribution of 𝜇𝑔 for DE genes: 𝜋1𝜇𝑔𝑁𝜃,𝜏2𝑔,(3.5) where the mean of this distribution is not necessarily zero but to be estimated based on data. In addition, the variance for this distribution does not depend on the residual variance. Under this model, they have derived an optimal test and an approximated version of the test statistic (𝐹SS test) that is computationally faster. The 𝐹SS statistic shrinks both the estimate of mean 𝜇𝑔 and the estimate of variance 𝜎2𝑔.

Both the moderated 𝑡-test and the 𝐹SS test have been shown to achieve optimal power asymptotically under the assumption of normal distribution for the alternative means. Simulation studies also confirm that the power of the tests is superior under the model assumptions. However, a single normal distribution assumption on 𝜇𝑔 for DE genes may not be appropriate for all cases and the distribution of 𝜋1(𝜇𝑔) may consist of a mixture of different subgroup distributions, for example, a mixture of two normal distributions with one having a negative mean and the other having a positive mean. If the parametric distributional assumptions of 𝜋1(𝜇𝑔) are violated, the power of an optimal test built under those assumptions will suffer.

4. Semiparametric Optimal Testing (SPOT) Procedure

To obtain a more robust procedure, we propose to model the distribution of the mean 𝜇𝑔 nonparametrically while still deriving the optimal procedure. For the variance 𝜎2𝑔, the inverse gamma distributional assumption is reasonable and works well in practice, so we still keep this assumption. Hence, we will derive a semiparametric optimal testing procedure that we call the SPOT procedure.

Note that the numerator and denominator of the most powerful test statistic (3.1) are the joint marginal distributions of (𝑋𝑔,𝑠2𝑔), under the alternative and null hypothesis, respectively. By denoting the marginal distributions by 𝑚1𝑋𝑔,𝑠2𝑔=𝑓𝑋𝑔,𝑠2𝑔𝜇𝑔,𝜎2𝑔𝜋1𝜇𝑔𝜋𝜎2𝑔𝑑𝜇𝑔𝑑𝜎2𝑔,𝑚0𝑋𝑔,𝑠2𝑔=𝑓𝑋𝑔,𝑠2𝑔𝜇𝑔=0,𝜎2𝑔𝜋𝜎2𝑔𝑑𝜎2𝑔,(4.1) statistic (3.1) becomes 𝑇NP𝑔=𝑚1𝑋𝑔,𝑠2𝑔𝑚0𝑋𝑔,𝑠2𝑔.(4.2) The null marginal distribution 𝑚0(𝑋𝑔,𝑠2𝑔) only involves integration with respect to variance 𝜎2𝑔. With consistent estimators of hyperparameters as proposed in Smyth [2], we can estimate 𝑚0(𝑋𝑔,𝑠2𝑔) consistently. For the alternative marginal distribution 𝑚1(𝑋𝑔,𝑠2𝑔), it is hard to find a consistent estimator without any distributional assumption on 𝜇𝑔. If we were to know which genes are DE, then we could estimate 𝑚1(𝑋𝑔,𝑠2𝑔) nonparametrically with observed values of (𝑋𝑔,𝑠2𝑔) from the DE gene population. Many nonparametric density estimators are consistent, for example, the histogram estimators and the kernel density estimators with proper choices of bandwidths [13]. However, the knowledge of differential expression is the research question of the study and of course is not available for all genes. Considering all genes without separating those that are differentially expressed from those that are not, we have a mixture distribution of differentially expressed and nondifferentially expressed genes. The mixture density of the marginal distributions, denoted by 𝑚𝑚(𝑋𝑔,𝑠2𝑔), can be estimated consistently by nonparametric density estimators with observed (𝑋𝑔,𝑠2𝑔) for all genes 𝑔=1,,𝐺. Can this consistent estimator of 𝑚𝑚(𝑋𝑔,𝑠2𝑔) help us construct a most powerful test statistic, together with a consistent estimator of 𝑚0(𝑋𝑔,𝑠2𝑔)?

Suppose that 𝑝0 and 𝑝1 are proportions of EE and DE genes, respectively, with 0𝑝0,𝑝11 and 𝑝0+𝑝1=1, then the mixture marginal density is 𝑚𝑚𝑋𝑔,𝑠2𝑔=𝑝0𝑚0𝑋𝑔,𝑠2𝑔+𝑝1𝑚1𝑋𝑔,𝑠2𝑔.(4.3)

The ratio of mixture marginal density 𝑚𝑚(𝑋𝑔,𝑠2𝑔) and the null marginal density 𝑚0(𝑋𝑔,𝑠2𝑔) is a monotonic function of the statistic 𝑇NP𝑔 expressed in formula (4.2) because 𝑚𝑚𝑋𝑔,𝑠2𝑔𝑚0𝑋𝑔,𝑠2𝑔=𝑝0𝑚0𝑋𝑔,𝑠2𝑔+𝑝1𝑚1𝑋𝑔,𝑠2𝑔𝑚0𝑋𝑔,𝑠2𝑔,=𝑝0+𝑝1𝑚1𝑋𝑔,𝑠2𝑔𝑚0𝑋𝑔,𝑠2𝑔.(4.4) Thus the test that rejects the null hypothesis when 𝑚𝑚(𝑋𝑔,𝑠2𝑔)/𝑚0(𝑋𝑔,𝑠2𝑔) is large is also a most powerful test. Note that to calculate this statistic, we only need to estimate 𝑚𝑚(𝑋𝑔,𝑠2𝑔) and 𝑚0(𝑋𝑔,𝑠2𝑔) but do not have to estimate the proportions 𝑝0 and 𝑝1.

Let 𝑚𝑚(𝑋𝑔,𝑠2𝑔) denote any consistent density estimator of 𝑚𝑚(𝑋𝑔,𝑠2𝑔), and let 𝑚0(𝑋𝑔,𝑠2𝑔) denote any consistent estimator of 𝑚0(𝑋𝑔,𝑠2𝑔), such that 𝑚𝑚𝑋𝑔,𝑠2𝑔𝑃𝑚𝑚𝑋𝑔,𝑠2𝑔as𝐺,𝑚0𝑋𝑔,𝑠2𝑔𝑃𝑚0𝑋𝑔,𝑠2𝑔as𝐺,(4.5) where 𝑃 denotes convergence in probability. Then the statistic 𝑚𝑚(𝑋𝑔,𝑠2𝑔)/𝑚0(𝑋𝑔,𝑠2𝑔) has the optimal testing power asymptotically. Notice the convergence with respect to 𝐺, which is usually huge in the microarray and RNA-seq studies.

We have already discussed the availability of a parametric consistent estimator of 𝑚0(𝑋𝑔,𝑠2𝑔) through estimating the hyperparameters 𝑑0 and 𝑠20 of 𝜎2𝑔 in Section 3. For 𝑚𝑚(𝑋𝑔,𝑠2𝑔), any theoretically consistent density estimator 𝑚𝑚(𝑋𝑔,𝑠2𝑔) of joint data (𝑋𝑔,𝑠2𝑔) can be used to construct the test statistic (4.4) with asymptotically optimal average power. For example, nonparametric estimators such as histograms, kernel density estimates, and local polynomial estimators can all be utilized. As our test statistic 𝑚𝑚(𝑋𝑔,𝑠2𝑔)/𝑚0(𝑋𝑔,𝑠2𝑔) involves both parametric and nonparametric parts, we name it the semiparametric optimal test (SPOT).

5. Implementation of SPOT

In this section, we discuss details in implementation of the proposed SPOT procedure.

5.1. Estimation of 𝑚0(𝑋𝑔,𝑠2𝑔)

The null marginal density 𝑚0𝑋𝑔,𝑠2𝑔=𝑓𝑋𝑔,𝑠2𝑔𝜇𝑔=0,𝜎2𝑔𝜋𝜎2𝑔𝑑𝜎2𝑔=𝑒𝑥2𝑔/(2𝑣𝑔𝜎2𝑔)2𝜋𝑣𝑔𝜎2𝑔1/2𝑑𝑔2𝜎2𝑔𝑑𝑔/2𝑠2(𝑑𝑔/21)𝑒𝑑𝑔𝑠2𝑔/(2𝜎2𝑔)Γ𝑑𝑔𝑑/20𝑠202𝑑0/2𝜎2(𝑑0𝑔/2+1)𝑒𝑑0𝑠20/2𝜎2𝑔Γ𝑑0/2𝑑𝜎2𝑔=𝐶2𝑠𝑔2(𝑑/21)𝑥2𝑔/𝑣𝑔+𝑑0𝑠20+𝑑𝑔𝑠2𝑔2(1+𝑑0+𝑑𝑔)/2,(5.1) where 𝐶2 is a constant. As in Smyth [2] and Hwang and Liu [4], we assume that the distribution of 𝜎2𝑔 does not depend on whether a gene is DE or EE. Then, all genes are used to estimate the parameters 𝑑0 and 𝑠20. We apply the method of moments proposed in Smyth [2] to get estimates of 𝑑0 and 𝑠20. Replacing unknown parameters 𝑑0 and 𝑠20 in 𝑚0(𝑋𝑔,𝑠2𝑔) by their consistent method of moments estimates leads to a consistent estimator 𝑚0(𝑋𝑔,𝑠2𝑔) of 𝑚0(𝑋𝑔,𝑠2𝑔).

5.2. A Hybrid Method for Estimation of 𝑚𝑚(𝑋𝑔,𝑠2𝑔)

Although any consistent estimator 𝑚𝑚(𝑋𝑔,𝑠2𝑔) can be used to construct a SPOT statistic of the form 𝑚𝑚(𝑋𝑔,𝑠2𝑔)/𝑚0(𝑋𝑔,𝑠2𝑔), in practice, a density estimator that converges fast would be always preferred. It is known that the accuracy of the density estimators goes down quickly as the dimension increases [13]. We have tried a few two-dimensional density estimators for 𝑚𝑚(𝑋𝑔,𝑠2𝑔), including the kernel estimators. Due to the curse of dimensionality, the direct two-dimensional density estimators do not perform as satisfactory as a hybrid estimator that we develop and would suggest to use. This hybrid estimator has a component that is similar to kernel estimators, whereas it also utilizes the prior information on variances 𝜎2𝑔 to help improving the accuracy.

In constructing this estimator, we first estimate the marginal density of 𝑋𝑔 by the typical kernel density estimate: 𝑓𝑥𝑔=1𝐺𝐺𝑖=11𝐾𝑥𝑔𝑥𝑖,(5.2) where is a positive value known as bandwidth. We estimate the conditional density of 𝑓(𝑠2𝑔𝑥𝑔) by using 𝑓𝑠2𝑔𝑥𝑔=𝑓𝑠2𝑔𝜎2𝑔,𝑥𝑔𝑓𝜎2𝑔𝑥𝑔𝑑𝜎2𝑔=𝑓𝑠2𝑔𝜎2𝑔𝑓𝜎2𝑔𝑥𝑔𝑑𝜎2𝑔,(5.3) where the second equality is a result of the independence between 𝑠2𝑔 and 𝑥𝑔 given the parameter 𝜎2𝑔. The distribution of 𝑠2𝑔𝜎2𝑔 is (𝜎2𝑔/𝑑𝑔)𝜒2𝑑𝑔 for normal-distributed observations. Now we need to estimate 𝑓(𝜎2𝑔𝑥𝑔). Denote the set of genes that lie within bandwidth distance to gene 𝑔 as {𝐴𝑔𝑖𝐴𝑔 if and only if |𝑥𝑖𝑥𝑔|<}. We estimate 𝑓(𝜎2𝑔𝑥𝑔) by the following approximation that is based on the neighborhood of 𝑥𝑔, 𝐴𝑔: 1#𝐴𝑔𝑖𝐴𝑔𝑓𝑠2𝑖𝜎2𝜋𝜎2𝑓𝑠2𝑖𝜎2𝜋𝜎2𝑑𝜎2.(5.4)

The #{𝐴𝑔} in formula denotes the number of genes in set 𝐴𝑔. Substituting exact parametric form of 𝑓(𝑠2𝑔𝜎2) and 𝜋(𝜎2) into above formulas leads to the explicit form 𝑓𝑠2𝑔𝑥𝑔=𝐶31#𝐴𝑔𝑖𝐴𝑔𝑠𝑔2(𝑑/21)𝑑𝑔𝑠2𝑖+𝑑0𝑠20+𝑑𝑔𝑠2𝑔2(𝑑0+𝑑𝑔)/2,(5.5) where 𝐶3 is a constant. The product between the kernel estimate 𝑓(𝑥𝑔) and the conditional estimate 𝑓(𝑠2𝑔𝑥𝑔) provides us a joint density estimator of the mixture 𝑚𝑚𝑋𝑔,𝑠2𝑔=𝑓𝑥𝑔𝑓𝑠2𝑔𝑥𝑔.(5.6) With this approximation, we cannot theoretically show that the resulting estimator, 𝑚𝑚(𝑋𝑔,𝑠2𝑔), is consistent but it works better in practice than the consistent kernel density estimator of the joint density 𝑚𝑚(𝑋𝑔,𝑠2𝑔).

6. Simulation Study

In order to evaluate the performance of our proposed SPOT procedure, we performed three simulation studies. The gene expression data were simulated from Normal  (𝜇𝑔𝑖,𝜎2𝑔) for observations of gene 𝑔 in treatment group 𝑖. The way to sample 𝜇𝑔𝑖 and 𝜎2𝑔 differs across simulation studies. We assume that there are two treatment groups and 3 replicates per treatment group. For each simulation setting, one hundred sets of gene expression data were independently simulated, and each dataset included 10,000 genes. The performances of the SPOT, moderated 𝑡, 𝐹SS, and ordinary 𝑡-test statistics were evaluated for by comparing their average behavior averaged across the 100 datasets.

6.1. Simulation Study I

In the first simulation study, we have two settings that differ in the number of DE genes. For the first setting, 𝐺1=2,500 are DE genes whereas the other 𝐺0=7,500 are EE. In the second setting, only 𝐺1=1,800 are DE while the other 𝐺0=8,200 are EE. Gene expression means 𝜇𝑔𝑖 and variances 𝜎2𝑔 were simulated as follows: 𝜇𝑔1𝜇=0𝑔;𝑔2Normal0.5,0.32for𝑔=1to0.3𝐺1;𝜇𝑔2Normal1,0.32for𝑔=0.3𝐺1+1to0.9𝐺1;𝜇𝑔2𝑡1(0.5)for𝑔=0.9𝐺1+1to𝐺1;𝜇𝑔2𝐺=0for𝑔=1𝜎+1to10000;2𝑔Gamma(2,4)𝑔.(6.1)

For each simulated data, SPOT, moderated 𝑡, 𝐹SS, and ordinary 𝑡-test statistics were calculated and evaluated using the number of selected true positives at various FDR levels. The plots of number of true positives versus FDR for SPOT, moderated 𝑡, 𝐹SS, and ordinary 𝑡-test statistics are shown in Figure 1. Simulation settings 1 and 2 generated similar results. The ordinary 𝑡-test is the poorest method under comparison. The moderated 𝑡-test is considerably better than the ordinary 𝑡-test although it is worse than 𝐹SS test. Our proposed SPOT test is superior to all other three methods, with the largest number of true positive findings than the other three statistics at the same FDR levels.

fig1
Figure 1: Simulation study I: plots of number of true positives (# TP) versus false discovery rate (FDR) from analyses using SPOT, moderated t, and 𝐹SS methods.
6.2. Simulation Study II

To check how the variance distribution affects the relative ranking of the SPOT procedure, we did another simulation study the same as the setting 1 of simulation study I except that the variances were simulated from a log-normal distribution, which is the assumption under which the 𝐹SS test was derived. As Figure 2 shows, the results are similar to those from simulation I. The SPOT procedure still performs much better than all the other three methods.

913560.fig.002
Figure 2: Simulation study II: plots of number of true positives (# TP) versus false discovery rate (FDR) from analyses using SPOT, moderated t, and 𝐹SS methods.
6.3. Simulation Study III

Typically, the parametric test achieves higher power than the nonparametric test if the parametric assumption is appropriate. To check the robustness of the SPOT procedure, we simulated data under the parametric assumption for both 𝜇𝑔𝑖 and 𝜎2𝑔 under which the 𝐹SS test was derived. Specifically, for the 2,500 differentially expressed genes, 𝜇𝑔𝑖 were drawn from a normal distribution with mean 1.2 and standard deviation 0.3, 𝜎2𝑔 were sampled from a log-normal distribution with parameters −0.96 and 0.8. Figure 3 shows that the SPOT procedure and the 𝐹SS test are comparable to each other when FDR is small (less than 0.05) and they are both much better than the moderated 𝑡-test and the ordinary 𝑡-test. When FDR is between 0.05 and 0.15, the 𝐹SS test is the best while the SPOT procedure is the next best performing procedure, which is still much better than the moderated 𝑡-test and the ordinary 𝑡-test.

913560.fig.003
Figure 3: Simulation study III: plots of number of true positives (# TP) versus false discovery rate (FDR) from analyses using SPOT, moderated t, and 𝐹SS methods.

7. Evaluation Using the Golden Spike Microarray Data

In this section, we compare the performances of different methods using a real microarray dataset from experiments conducted using Affymetrix GeneChip in the Golden Spike Project. The Golden Spike Project generated microarray datasets comparing two replicated groups in which the relative concentrations of a large number of genes are known. The two groups are the spike-in group and the control group, each with three chips. Data and information related to this project are available through the website http://www2.ccr.buffalo.edu/halfon/spike/. More specifically, the Golden Spike dataset included 1309 individual cRNAs “spiked in” at known relative concentrations between the two groups. The fold-changes between the spike-in and control group were assigned at different levels for different cRNAs, and the levels ranged from 1.2 to 4. Hence, these cRNAs were truly “differentially expressed” between groups and we consider them as DE genes. In addition, a background sample of 2551 RNA species was present at identical concentrations in both samples. So these 2551 RNA species were not differentially expressed between the two groups. With the knowledge of the true differential expression status, this real microarray dataset provides an ideal case to evaluate the performances of different methods without imposing any distributional assumption for variances and means as usually is done in simulation studies.

With the summary dataset downloaded from the Golden Spike Project website, we calculated the SPOT, the moderated 𝑡, the ordinary 𝑡, and the 𝐹SS statistics and evaluated their performances using the true statuses of RNA based on the design. Figure 4 shows the plots of number of true positives versus FDR for the ordinary 𝑡, moderated 𝑡, 𝐹SS, and SPOT procedures over a range of FDR [0,0.15] which is of most practical interest. It can be observed that the performance of the SPOT procedure improves over the performances of the other three methods throughout the whole range of FDR in these plots. In addition, the improvement is substantial at lower FDR levels. For example, the SPOT procedure detects 754 true positives at the FDR level of 1% while the moderated 𝑡-test only detects 466 and the 𝐹SS test only detects 442 true positives (Table 1).

tab1
Table 1: Golden Spike data: number of true positives selected by three testing procedures at critical FDR levels.
913560.fig.004
Figure 4: Golden spike data: plots of number of true positive (TP) genes versus false discovery rate (FDR) from analysis using SPOT, moderated 𝑡,  and 𝐹SS tests.

8. Summary

In this paper, we have derived a semiparametric optimal testing (SPOT) procedure for high-dimensional gene expression data analysis. Although the method is illustrated for analyzing microarray data, it can be applied to any high-dimensional testing problem with normal model. Our test statistic is justified to be asymptotically most powerful, without any assumption on the mean parameter of differential expression. The asymptotic property is derived when the number of genes is large, which is reasonable for high-dimensional gene expression studies. We also provided an approximate version to implement the SPOT procedure in practice and evaluated the performance of our proposed test statistic using both simulation studies and real microarray data analysis. The proposed SPOT method is shown to outperform the popularly applied moderated  𝑡 and the 𝐹SS statistics, which are optimal only under certain normality conditions of the mean. There is still potential in improving the performance of SPOT procedure if better density estimates can be found for the marginal distributions 𝑚𝑚(𝑋𝑔,𝑠2𝑔) and 𝑚0(𝑋𝑔,𝑠2𝑔).

Appendix

Proof of the Claim That the Moderated t-Test Achieves the Optimal Average Power Asymptotically under the Assumptions That 𝜇𝑔𝑁(0,𝜈0𝜎2𝑔) and 1/𝜎2𝑔1/𝑑0𝑠20𝜒2𝑑0

Under Smyth's [2] model assumptions, the most power test statistic formula (3.1) derived under the Neyman-Pearson lemma becomes 𝑇NP𝑔=𝑓𝑋𝑔,𝑠2𝑔𝜇𝑔,𝜎2𝑔𝜋1𝜇𝑔𝜋1𝜎2𝑔𝑑𝜇𝑔𝑑𝜎2𝑔𝑓𝑋𝑔,𝑠2𝑔𝜇𝑔=0,𝜎2𝑔𝜋0𝜎2𝑔𝑑𝜎2𝑔=𝑒(𝑥𝑔𝜇𝑔)2/2𝑣𝑔𝜎2𝑔/2𝜋𝑣𝑔𝜎2𝑔1/2𝑑𝑔/2𝜎2𝑔𝑑𝑔/2𝑠𝑑𝑔2𝑒𝑑𝑔𝑠2𝑔/2𝜎2𝑔𝑑/Γ𝑔𝒜/2𝑒𝑥2𝑔/2𝑣𝑔𝜎2𝑔/2𝜋𝑣𝑔𝜎2𝑔1/2𝑑𝑔/2𝜎2𝑔𝑑𝑔/2𝑠2(𝑑𝑔/21)𝑑/Γ𝑔𝑒/2𝑑𝑔𝑠2𝑔/2𝜎2𝑔=𝐶1𝑥2𝑔/𝑣0+𝑣𝑔+𝑑0𝑠20+𝑑𝑔𝑠2𝑔𝑥2𝑔/𝑣𝑔+𝑑0𝑠20+𝑑𝑔𝑠2𝑔(1+𝑑0+𝑑𝑔)/2,(A.1)where 𝒜 denotes (𝑒𝜇2𝑔/(2𝑣0𝜎2𝑔)/(2𝜋𝑣0𝜎2𝑔)1/2)(𝑑0𝑠20/2)𝑑0/2(𝜎𝑑0𝑔+2/Γ(𝑑0/2))𝑒𝑑0𝑠20/(2𝜎2𝑔)𝑑𝜇𝑔𝑑𝜎2𝑔,  and   denotes (𝑑0𝑠20/2)𝑑0/2(𝜎2(𝑑0𝑔/21)/Γ(𝑑0/2))𝑒𝑑0𝑠20/(2𝜎2𝑔)𝑑𝜎2𝑔,  which is a monotonic function of Smyth's [2] moderated t-statistic, with 𝐶1 being some constant. Thus the claim follows with existence of consistent estimates of 𝑑0 and 𝑠20, which has been shown in Smyth [2].

References

  1. I. Lönnstedt and T. Speed, “Replicated microarray data,” Statistica Sinica, vol. 12, no. 1, pp. 31–46, 2002.
  2. G. K. Smyth, “Linear models and empirical Bayes methods for assessing differential expression in microarray experiments,” Statistical Applications in Genetics and Molecular Biology, vol. 3, article 3, 2004. View at Publisher · View at Google Scholar
  3. X. Cui, J. Hwang, J. Qiu, N. J. Blades, and A. Churchill, “Improved statistical tests for differential gene expression by shrinking variance components estimates,” Biostatistics, vol. 6, no. 1, pp. 59–75, 2005. View at Publisher · View at Google Scholar
  4. J. T. G. Hwang and P. Liu, “Optimal tests shrinking both means and variances applicable to microarray data analysis,” Statistical Applications in Genetics and Molecular Biology, vol. 9, article 36, 2010. View at Publisher · View at Google Scholar
  5. T. Cai, J. Jeng, and H. Li, “Robust detection and identification of sparse segments in ultra-high dimensional data analysis,” Journal of the Royal Statistical Society: Series B, vol. 14, part 4, 2012. View at Publisher · View at Google Scholar
  6. V. G. Tusher, R. Tibshirani, and G. Chu, “Significance analysis of microarrays applied to the ionizing radiation response,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 9, pp. 5116–5121, 2001. View at Publisher · View at Google Scholar · View at Scopus
  7. B. Efron, R. Tibshirani, J. D. Storey, and V. Tusher, “Empirical Bayes analysis of a microarray experiment,” The Journal of the American Statistical Association, vol. 96, no. 456, pp. 1151–1160, 2001. View at Publisher · View at Google Scholar
  8. P. Baldi and A. D. Long, “A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes,” Bioinformatics, vol. 17, no. 6, pp. 509–519, 2001. View at Scopus
  9. Y. C. Tai and T. P. Speed, “A multivariate empirical Bayes statistic for replicated microarray time course data,” The Annals of Statistics, vol. 34, no. 5, pp. 2387–2412, 2006. View at Publisher · View at Google Scholar
  10. G. W. Wright and R. M. Simon, “A random variance model for detection of differential gene expression in small microarray experiments,” Bioinformatics, vol. 19, no. 18, pp. 2448–2455, 2003. View at Publisher · View at Google Scholar · View at Scopus
  11. T. Tong and Y. Wang, “Optimal shrinkage estimation of variances with applications to microarray data analysis,” The Journal of the American Statistical Association, vol. 102, no. 477, pp. 113–122, 2007. View at Publisher · View at Google Scholar
  12. H. Bar, J. Booth, E. Schifano, and M. T. Wells, “Laplace approximated EM microarray analysis: an empirical Bayes approach for comparative microarray experiments,” Statistical Science, vol. 25, no. 3, pp. 388–407, 2010. View at Publisher · View at Google Scholar
  13. L. Wasserman, All of Nonparametric Statistics, Springer Texts in Statistics, Springer, New York, NY, USA, 2006.