Abstract

The problem of estimating the proportion, , of the true null hypotheses in a multiple testing problem is important in cases where large scale parallel hypotheses tests are performed independently. While the problem is a quantity of interest in its own right in applications, the estimate of can be used for assessing or controlling an overall false discovery rate. In this article, we develop an innovative nonparametric maximum likelihood approach to estimate . The nonparametric likelihood is proposed to be restricted to multinomial models and an EM algorithm is also developed to approximate the estimate of . Simulation studies show that the proposed method outperforms other existing methods. Using experimental microarray datasets, we demonstrate that the new method provides satisfactory estimate in practice.

1. Introduction

Estimating the proportion of true null hypotheses in a multiple testing setup is very crucial in wanting to assess and/or control false discovery rate, which is quite significant in genomics, disease discovery, and cancer discovery. Langaas et al. [1] remarked “An important reason for wanting to estimate is that it is a quantity of its own right. In addition, a reliable estimate of is important when we want to assess or control multiple error rates, such as the false discovery rate FDR of Benjamini and Hochberg [2].” In the case of testing for differential expression in DNA microarrays, the proportion of differentially expressed genes is , and it is important to know whether 5% or 35% of the genes, for example, are differentially expressed, even if we cannot identify these genes (see Langaas et al. [1]). Multiple testing refers to any instance that involves the simultaneous testing of several hypotheses. A common feature in genomes studies is the analysis of a large number of simultaneous measurements in a small number of samples. One must decide whether the findings are truly causative correlations or just the byproducts of multiple hypothesis testing (Gyorffy et al. [3]). If one does not take the multiplicity of tests into account, then the probability that some of the true null hypotheses are rejected by chance alone may be unduly large.

In a multiple hypothesis testing problem, m null hypotheses are tested simultaneously; that is, we test for , simultaneously. Assume that the m tests are constructed based on the observed p values, , respectively. The unknown quantity to be estimated is the proportion of the true null hypotheses among . Introduce the i.i.d Bernoulli random variables with . Then can be interpreted in terms of the multiple testing problems as follows:for

We assume the p values, , are continuous and independent random variables, so that the p values are independently and identically distributed as when the null hypotheses are all true. One chooses to reject or fail to reject each null hypothesis based on the corresponding value. Consequences of the tests are summarized in Table 1.

In Table 1   is the number of null hypotheses; is the observable random variable representing the number of hypotheses rejected. Note that all other random variables , and in Table 1 are unobservable.

The problem of estimating the proportion has naturally arisen in assessing or controlling an overall false rejection rate in a simultaneous hypotheses testing problems. A reliable estimate of is crucial when we want to control and/or assess the false discovery rate (FDR) proposed by Benjamini and Hochberg [2], defined as where and

Benjamini and Hochberg [2] prove that Simes’ procedure (Simes [4]) has the FDR controlled at level if the underlying test statistics and the corresponding values are continuous and identically and independently distributed. Specifically, they show Consequently, if is the FDR level that one wishes to achieve and if can be estimated efficiently, say by , then can be chosen to be to gain additional testing powers in the multiple testing problem with FDR being under control. If is overestimated substantially, however, the value of is underestimated substantially, leading to significantly narrower simultaneous confidence intervals for multiple comparisons and significant reduction of testing power for the multiple testing problems. On the other hand, if is chosen via some other procedure, say the Bonferroni method in the multiple comparison problem, assessing FDR accurately via estimating efficiently is then of great interest.

The paper is organized as follows. Section 2 is a review of existing estimating methods; Section 3 introduces the new estimating procedure; Section 4 contains simulation results and Section 5 presents application of the new estimating procedure to real life examples.

2. Existing Estimating Methods

2.1. Mixture Model Framework

A mixture model can be used to fit the values in order to estimate the proportion, , of the true hypotheses, where large scale parallel hypotheses are performed independently. The estimate for can be based on the mixture model for the common density of the values described as follows:where is the conditional probability density function of the value under an alternative (see Langaas et al. [1]). Using this mixture representation, we are able to characterize the maximum likelihood estimate for The estimation method is derived under the assumption of independent and identically distributed values. The null values are uniformly distributed on . We should note that describes the configuration of true alternative populations among the underlying populations; it seems that a nonparametric approach to the estimation of is more appealing. Without loss of generality, we define as the alternative hypothesis, which will be used throughout this section. Three nonparametric estimators proposed recently by other authors are described and discussed in the sequel subsections.

2.2. Storey’s Method

Consider the common marginal mixture density of the values, for any : On the basis that is typically small, a large majority of the values in the interval should be corresponding to the true null hypothesis and thus uniformly distributed on the interval , of which most should be close to 1, so that is approximately zero. Let . Note that should be approximately equal to the product of and the length of the interval ; that is, . Hence, Storey [6] proposes that the proportion of true null hypotheses, , is estimated by (with an appropriately chosen ) The value of has impacts on the behavior of . has a large bias and small variance when is small and a small bias and large variance when is big, respectively. Since both extreme values of have a bias-variance trade-off, Storey et al. [7] propose bootstrapping, which is a resampling technique, to choose when estimating so as to minimize the mean square error of . The resulting estimator is denoted by .

Note that the idea leading to Storey’s estimator is to treat the addition term as zero to get rid of the complication caused by the unknown alternative distribution. As a consequence, Storey’s estimator will tend to overestimate in a size of , at least theoretically. However, the anticipated size of overestimate becomes less significant when is closer to 1; that is, is closer to 0. The biasedness of as an estimator for can be dominating, when is close to 1, as evident in the simulation results in Section 4 and as observed by other authors as well, for example, Langaas et al. [1].

2.3. Convest Method

Let and be defined in (6). Langaas et al. [1] prove that if is twice differentiable, convex, and decreasing, can be expressed as where the kernel density with being any probability measure on . Thus Langaas et al. [1] are able to characterize the nonparametric maximum likelihood estimate for for the density as where is the nonparametric maximum likelihood estimate of . Let values be the ordered statistics of the values. The nonparametric maximum likelihood estimator of is given by Langaas et al. [1] then propose to estimate the proportion by It is noted that as the estimator is constructed via a density estimate at or about the upper bound of support, namely, , it can be conservative and overestimate when the assumption is questionable or slowly as . The overestimating problem can be even more severe when is not large; that is, is not so small. However, the remarks in the end of Section 2.2 are applicable to as well.

2.4. Average Estimate Approach

The average estimate approach was motivated by Storey’s method. Jiang and Doerge [8] observe (and many other authors point out as well) that Storey’s estimator has a large bias and small variance when is small and a small bias and large variance when is big. Since both extremes of have a bias-variance trade-off, Jiang and Doerge [8] propose to combine Storey’s estimate with different values of that vary from a small extreme to a large extreme.

Let , and suppose that, for each , we have Storey’s estimate Jiang and Doerge [8] propose to estimate by the average of over the values of ; that is, This estimate aims to minimize both the bias and the variance at the same time, if ’s are selected appropriately.

Define as equally spaced points in the interval such that the interval is divided into small intervals with equal length ; specifically, . Let and for . DefineJiang and Doerge [8] then propose to estimate by To apply the estimate, one has to choose a value of . Jiang and Doerge [8] develop a bootstrapping algorithm to pick the optimal . It should be noted that, as an average of Storey’s estimates, this estimate is expected to inherit conservativeness from Storey’s method.

3. New Method

We propose a finite mixture model of the uniform distribution and a multinomial distribution with a mixing proportion to fit the values. Denote UM for this finite mixture distribution. By this approach, the alternative distribution defined in (6) is restricted to the multinomial distribution family, M. Alternatively, the multinomial distribution can be viewed as a parametric approximation to the nonparametric unknown density , similarly to the idea of the empirical likelihood method (see Owen [9]). So this procedure is considered as nonparametric.

To apply the approach, we need to settle two things first: (a) convert the continuous-type observations into discrete data with categories and (b) select an integer .

3.1. Selection of

It is often the case in applications that the values are highly skewed (see Storey and Tibshirani [10], Zhao et al. [11], and Markitsis and Lai [12]). In this case, we recommend Doane’s modification of Sturges’ rule for selection of as follows, to count for skewness of the mixture distribution (6) (see Doane [13] and Sturges [14]): where is an estimate of the skewness coefficient. In the case of symmetry, we recommend to adapt Sturges’ rule to determine the selection of :

3.2. Transformation of ’s

To transform ’s, partition the unit interval into subintervals with equal width . Define for , . Keep in mind that From Storey’s Bayesian interpretation for the multiple testing problem, given that the alternative is true with a probability of , the value follows the distribution . In the same way, the transformed data ’s can be interpreted as follows. Given that the alternative is true with a probability of , is a multinomial random vector and is distributed as M, for . Therefore, are independently and identically distributed as the finite mixture distribution UM and the maximum likelihood estimate for thus results from the transformed data ’s. Explicitly, together with maximizes the log-likelihood function

3.3. EM Algorithm

Note that maximizing the nonlinear log-likelihood function (22) can be complicating. The EM algorithm may be used easily to obtain an approximation to . In order to do so, we introduce a latent Bernoulli variable that indicates the component membership in the finite mixture distribution UM. That is, and given , follows M. Note that has the distributionfor or 1 and or 1, .

Let , , be a random sample of size from model (23). In the present problem, ’s are only available data for analysis and ’s are unobservable and so considered as missing values. This defines a missing-value model. The log-likelihood with the complete data is given bywhere , and , for . We are ready to describe the EM algorithm.

E-Step. Let be the current approximations to the maximum likelihood estimates under the model UM. For the next approximation, the E-step establishes the expected log-likelihood function as

M-Step. In the M-step, is maximized to yield the next approximation and . Consider where

From , we have

To sum up, let be the th approximations to the maximum likelihood estimates of that maximize the log-likelihood function defined in (22). Then the th approximation with the EM algorithm is given by So the EM iteration process goes as shown in Algorithm 1.

  Input: transformed from the observed -values
  Output: Estimate of
(1) begin
(2)  Initialization: Set .
(3)  repeat
(4)     Set be the sth (current) approximation.
       Compute
           ,
               
(5)  until  ;
(6) Then .

It is well known that each EM iteration gets closer to the maximum of log-likelihood but only in a linear convergency rate. If the components are similar in their densities, then the convergence is extremely slow. The convergence will also be slow when the maximum likelihood solution requires some of the weight parameters to be zero, because the algorithm can never reach such a boundary point. An additional and related problem is that of deciding when to stop the algorithm. One risk to a naive user is the natural tendency to use a stopping rule for the algorithm based on the changes in the parameters or the likelihood being sufficiently small. Taking smalls steps does not indicate that we are close to the solution.

To combat this problem, Lindsay [15] exploited the regularity of the EM algorithm process to predict, via the device known as Aitken acceleration, the value of the log-likelihood at the maximum likelihood solution. The Aitken’s acceleration rule is usually recommended to predict the maximum efficiently and suitably whenever one is using a linearly convergent algorithm with a slow rate of convergence. If we let , and be the log-likelihood values for the three consecutive iterations, then the predicted final value is where . Terminate the EM iteration when is sufficiently small.

4. Simulation Studies

In order to investigate the properties of the estimators described in Section 2 and compare the performance to that of the newly developed estimating procedure described in Section 3, we conducted simulation experiments. The generation of simulated data and the calculation of the estimates were both done in the language . Simulation studies were conducted with the values based on a one-sided -test in the finite normal mixture model with various values of and for performance comparisons.

Monte Carlo data were simulated independently from and each value was computed by . where is the cumulative distribution function of standard normal . For the generation of simulated data, three true values of were considered, namely, , with the sample size , and 1,000. In each case, Monte Carlo trials were performed. In implementation of the EM algorithm to compute the proposed new estimate , the determination rule of the EM algorithm is . The EM algorithm converged for all the simulated datasets. In each case, the performance of the proposed new estimate was compared with those of several existing procedures through the same set of Monte Carlo data. Specifically, in the simulations, we considered the following existing methods:(1)Storey’s bootstrap estimate denoted by (2)Langaas et al.’s convex estimate denoted by (3)Jiang and Doerge’s average method estimate denoted by Storey’s estimate was computed by using the package nFDR and the convex estimate by the package limma with the function convest. Table 2 summarizes the simulation results. It is clearly shown that the new estimate outperformed substantially over the existing methods with comparable standard errors, while performs much worse than as expected.

5. Application to Real Life Microarray Data

To further evaluate the performance of the new method in comparison to the three existing methods, consider the real life data from the DNA microarray experiments reported in Golub et al. [5] that can be downloaded from the package multtest. The dataset was used by many authors (see [11, 16] and the references therein) to illustrate the applications of proposed estimating methods for the proportion of true null hypotheses in multiple testing problems. The dataset consists of 38 bones marrow samples, 27 acute lymphoblastic leukemia (ALL) samples and 11 acute myeloid leukemia (AML) samples, obtained from acute leukemia patients at the time of diagnosis, before chemotherapy. RNA prepared from bone marrow mononuclear cells was hybridized to high-density oligonucleotide microarrays, produced by Affymetrix and containing probes for 7,129 human genes. But after preprocessing, there were only 3,051 genes accurately readable resulting in a microarray matrix. For comparison of gene expressions between two groups, let and be the true mean intensities for each gene, in groups 1 and 2, respectively, to determine whether the gene is differentially expressed, that is, to testWith each of 3,051 genes, a two-sample Welch -statistic is computed and its two-sided value is computed under a central -student distribution with degree 36 of freedom. A histogram of these values is displayed in Figure 1. It is clear that the values are highly skewed to the right.

The four estimates for the proportion of nondifferentially expressed genes among 3,051 genes, , , , , and , are given in Table 3. It appears that the four estimates are significantly different, varying from as low as 0.4150 to as high as . Noting that it has been commented by many authors that the estimates , , and are usually conservative, we conclude that the lower estimate of 0.42 by the proposed new estimating procedure appears as expected and might be closer to the true value of .

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.