Abstract

Multiple testing has been widely adopted for genome-wide studies such as microarray experiments. For effective gene selection in these genome-wide studies, the optimal discovery procedure (ODP), which maximizes the number of expected true positives for each fixed number of expected false positives, was developed as a multiple testing extension of the most powerful test for a single hypothesis by Storey (Journal of the Royal Statistical Society, Series B, vol. 69, no. 3, pp. 347–368, 2007). In this paper, we develop an empirical Bayes method for implementing the ODP based on a semiparametric hierarchical mixture model using the “smoothing-by-roughening" approach. Under the semiparametric hierarchical mixture model, (i) the prior distribution can be modeled flexibly, (ii) the ODP test statistic and the posterior distribution are analytically tractable, and (iii) computations are easy to implement. In addition, we provide a significance rule based on the false discovery rate (FDR) in the empirical Bayes framework. Applications to two clinical studies are presented.

1. Introduction

The comprehensive, gene expression microarrays are a powerful tool for screening the differentially expressed genes among different phenotypes such as clinical subtypes and prognostic classes of disease from a large pool of candidate genes. Gene screening studies have the potential to be useful for elucidating disease biology and aggressiveness, identifying new therapeutic targets, and developing new molecular diagnostics for optimized medicine for individual patients [14]. The high dimensionality of microarray data, however, has posed special challenges in extracting a small number of relevant genes from a large quantity of noise variables by gene screening analysis. In addition to the control of false positives [57], improvement of the efficiency of gene screening analysis is important.

For efficient screening of differentially expressed genes, Storey [8] developed the optimal discovery procedure (ODP), which can be interpreted as an extension of the most powerful test for a single hypothesis testing [9] to multiple testing. Storey defined an optimality criterion for multiple testing that maximizes the expected number of true positives (ETP) for each fixed level of expected false positives (EFP) [8]. The ODP was developed as a testing procedure that achieves this optimality; it improves the overall performance of multiple tests by “borrowing strength” across tests. However, in applying the ODP, the following two components must be estimated: (a) the true status of each significance test (null or alternative) and (b) the true probability distribution corresponding to each test [8]. To address these estimation problems, Storey et al. [10] constructed the generalized likelihood ratio statistic [11], using maximum likelihood estimates. Cao et al. [12] also proposed a Bayesian approach, and Woo et al. [13] developed a computationally efficient method called “modular ODP.”

More recently, Noma and Matsui [14] developed an empirical Bayes approach for the ODP that can effectively circumvent the estimation problems of (a) and (b). Under a hierarchical, random effects model, they showed that the ODP was derived as a testing rule based on the marginal likelihood ratio statistic. In their numerical evaluations based on simulations, the empirical Bayes method nearly achieved the theoretical bound of ETP (i.e., average power) and performed well, compared with the existing methods under a broad range of conditions [14].

A critical aspect of this empirical Bayes approach is that a fully parametric form in the hierarchical modelling must be specified. Although most of the empirical Bayes methods previously discussed in microarray studies [1521] assume fully parametric natural conjugate models for the prior distribution, this assumption might be restrictive because of possible substantial diversity of effect size among informative genes caused by complex molecular pathways, whose distribution is generally unknown in microarray studies. In this article, we develop an empirical Bayes method for the ODP based on a semiparametric hierarchical mixture model, in which the prior distribution for effect sizes is not specified parametrically. We estimate the nonparametric component of the prior distribution by applying the “smoothing-by-roughening” approach [22, 23] and provide an effective computational method for the ODP statistic.

This article is organized as follows. In Section 2, we provide a brief overview of the theoretical results regarding the ODP methods by Storey [8] and Noma and Matsui [14]. In Section 3, we describe the semiparametric hierarchical mixture model and how to implement the ODP method. We present applications to prostate cancer and lymphoma clinical studies in Section 4. Finally, we provide a discussion in Section 5.

2. The Optimal Discovery Procedure in Multiple Significance Testing

2.1. Storey’s Optimal Discovery Procedure

In this section, we briefly review the theoretical results regarding the ODP methods. We denote the datasets from genes as , where . In microarray studies, corresponds to a set of log-transformed, normalized gene expressions from samples for the th gene (see Section 4). Also, suppose that the datasets are random vectors defined in a common probability space. The optimal goal when identifying differentially expressed genes is to find as many true positives as possible, without incurring too many false positives [10]. Storey [8] defined an optimal criterion of multiple testing as a rule of statistical significance that maximizes ETP for a certain fixed EFP level.

He derived the following lemma that gives the multiple testing procedure that achieves the optimal criterion.

Lemma 1 (Storey [8]). Suppose the th test has a null density and alternative density . Also, assume that the null hypothesis is true for the th test and the alternative is true for , without loss of generality. The following significance threshold statistic achieves the ODP criterion: Given a fixed cut-off chosen to attain an acceptable EFP level, the null hypothesis for the th test is rejected if and only if .

It should be noted that the ODP statistic is composed of all the density functions on tests. In other words, through borrowing strengths across tests, the overall power is improved more than the conventional most powerful test for single hypothesis tests [9, 11]. Also, the optimality result is held regardless of correlation structure among the datasets .

As mentioned in Section 1, for applying , there are two practical issues: (a) estimation of the true status of each hypothesis (null or alternative), corresponding to the numerator and denominator of , and (b) estimation of the true probability density functions ’s and ’s at which ’s are evaluated. Storey et al. [10] provided an practical estimating method, motivated by the generalized likelihood ratio test for single significance tests [11]. They noticed that the significance rule based on is equivalent to that of in Lemma 1. For each test , let be an estimate of the null density function with all the unknown parameters replaced by their maximum likelihood estimates under the constraints of the null hypothesis, and let be the analogous estimate for the alternative hypothesis using the unconstrained maximum likelihood estimates. Then, Storey et al. [10] proposed to use the canonical plug-in estimate: where is an estimate of the status of the th hypothesis. For estimating , Storey et al. [10] ranked the tests based on a univariate statistic (e.g., the -statistic for two-class comparison) and estimated it as when the th test fell below the estimated proportion of true nulls by Storey [6], and otherwise. Other estimating procedures for were also proposed by Cao et al. [12] and Woo et al. [13].

2.2. Empirical Bayes Approach by Noma and Matsui [14]

As Storey [8] pointed out, the ODP is analogous to the well-known shrinkage estimation for multiple-point estimation [24, 25]. Because the shrinkage estimation is interpreted as the empirical Bayes estimation [2628], hierarchical modelling would be a natural formulation for information sharing across tests, both in deriving the ODP and in developing estimation methods for implementing it [14].

Suppose that the probability density functions for have the same parametric form , where is the parameter of interest and is the nuisance parameter for the th test. We assume the following prior distribution for the parameters (): where and are the hyperparameters of the prior distribution.

Lemma 2 (Noma and Matsui [14]). Under the empirical Bayes framework with the prior distribution (4), the following significance threshold function achieves the ODP criterion: For a fixed cut-off chosen to reach an acceptable EFP level, the null hypothesis for the th test is rejected if and only if .

See Noma and Matsui [14] for interpretations of .

For the specification of the prior distributions and , Noma and Matsui [14] adopted the empirical Bayes method [26, 27]. The empirical Bayes implementation is used to obtain estimates for and from the data, which can then be substituted into the ODP statistic, , where and are the estimators of and , respectively.

In comparison with the existing ODP estimation methods based on , the empirical Bayes approach has two advantages. First, it estimates only the hyperparameters and for the random effects models, and not the parameters in the density function , for which a large number of parameters (thousands or more in microarray studies) must be estimated as in the procedure of Storey et al. [10]. Second, in the empirical Bayes approach, the estimation of the true status, null or alternative, for each test is not needed.

3. Semiparametric Hierarchical Mixture Model and the ODP

3.1. Semiparametric Hierarchical Mixture Model

In this section, we consider applying the ODP to concrete microarray analyses of two-class comparisons (with classes denoted as “0” and “1”), for example, good prognosis and poor prognosis, on the basis of the expression levels of genes from samples. The gene expression data considered here comprise normalized log ratios from two-color complementary DNA arrays or normalized log signals from oligonucleotide arrays (e.g., Affymetrix GeneChip). Let and be the sample sizes of classes 0 and 1, respectively. Let and be the means of the gene expression levels for classes 0 and 1, respectively. We consider the detection of differentially expressed genes to be when . That is, testing problem for

We assume that gene expressions of the th gene are normally distributed within each class, where is a common within-class variance (), as in Storey et al. [10]. Let be the difference of the mean of the two classes for the th gene (), , which is the parameter of interest. Let , denote the sample means of the gene expression levels for classes 0 and 1, respectively. We consider the following translation: Using this translation, the nuisance parameters ’s are eliminated in the models of and . Specifically, where . Note that the testing problems in (7) are equivalent to testing Let and .

For applying the ODP to the testing problems under model (10), we assume the following random effects distributions for the model parameters (): where in the first component of is the Dirac delta function, representing nondifferential expression between two classes. The second component of represents a component of the differential expression, and and are the mixing proportions for the components . The random effects distribution of the gene-specific variances is specified as the parametric conjugate model for analytical tractability. For modeling , instead of adopting a fully parametric conjugate normal distribution model as done by Noma and Matsui [14], we apply the “smoothing-by-roughening” approach of Laird and Louis [22] and Shen and Louis [23]. This approach was proposed to obtain the nonparametric maximum likelihood estimate of the prior distribution in empirical Bayes analysis [29, 30]. This approach starts with a smooth estimate of the prior distribution and uses the EM algorithm to “roughen” it towards the nonparametric maximum likelihood estimate. For modeling , we place supports at equally spaced discrete mass points at a series of nonzero points , for sufficiently large values of . We assume that where . We denote the hyperparameters of the distribution function for the differential component as . Shen and Louis [23] provided a detailed guideline for implementing the smoothing-by-roughening approach based on their numerical evaluations. Shen and Louis [23] recommended that the number of grid points should be at least 200. For simplicity, they suggested the grid points can be equally spaced on the support of the distribution. Increasing the number of grid points and spacing them according to the prior density will provide a better approximation of and therefore will improve the estimation, particularly for the tail areas. However, too large numbers might not necessarily be beneficial. For more details involving analytical and numerical evaluations, see the guideline in Shen and Louis [23]. In this article, we refer to this hierarchical model as the “semiparametric hierarchical mixture model.” The important property of this model is that the posterior distributions of and can be obtained analytically. We provide the analytical representation of the posterior distributions in Section 3.4.

The maximum likelihood estimates of the hyperparameters are obtained by the EM algorithm [31]. Because the posterior distributions of and are obtained analytically, the E-step of the EM algorithm can be implemented easily. The details of the EM algorithm are presented in the appendix.

3.2. The Optimal Discovery Procedure

The ODP statistic for the testing problems (11) is obtained as the marginal likelihood ratio. The marginal likelihood of the null hypothesis corresponds to the marginal likelihood of the nondifferential component: Similarly, the marginal likelihood of the alternative hypothesis in (11) corresponds to that of the differential component: under the semiparametric hierarchical mixture model.

Thus, the ODP statistic (the marginal likelihood ratio) is given as By plugging-in the maximum likelihood estimate of , the empirical Bayes testing procedure is implemented based on Note that the ODP statistic is obtained in closed form using the maximum likelihood estimate of .

3.3. Assessing Significance

For assessing significance, we can obtain an empirical Bayes estimator of FDR [57] under the semiparametric hierarchical mixture model as follows. For any given threshold , we denote the significant region for each test. Let denote the set of indices for the significant genes. In the framework of Bayesian selection rules, the FDR accords to the misclassification error rate of differential/non-differential classification [6, 7, 32], and one of the well-known Bayesian estimator of the FDR [33, 34] is where Accordingly, an empirical Bayes estimator of for the significant region can be obtained by plugging-in the maximum likelihood estimate of .

Note that we conducted small simulations for checking correctness of the proposed ODP procedure and these results were provided in the Supplementary Materials available online at http://dx.doi.org/10.1155/2013/568480.

3.4. The Posterior Distribution

Under the semiparametric hierarchical mixture model, the posterior distributions of and can be obtained analytically as noted above; this gives a computational advantage for implementing Bayesian inference. The posterior distribution will be used to derive estimates and confidence intervals of ’s, in addition to the statistical significance measures such as the -value.

Let and denote indicator variables to which the th gene component belongs . The joint posterior distribution of is expressed as a mixture of differential and non-differential components: where . The weight is given in Section 3.3, and . The posterior distribution of the differential component is expressed as where The marginal posterior distribution of can be obtained as follows:

4. Applications

We illustrate our method using real data sets from two clinical studies with microarrays. Because previous simulation studies have indicated that the Storey’s method [10] is more powerful than many other multiple testing methods [10, 12, 35], we chose to use this method as a reference.

4.1. Prostate Cancer Example

Efron [36, 37] analyzed a modified gene expression data set from Singh et al. [38] with genes for samples; prostate cancer patients and healthy controls. For the proposed empirical Bayes approach, we placed the grid points for equally spaced on by 0.01 (except for 0; ). The hyperparameters were estimated as ,,,, and the estimated prior distribution for is given in Figure 1. The estimated prior distribution was skewed and multimodal. This indicates that analytically tractable parametric models (for example, the normal distribution) might be inadequate in this case.

Figure 2 presents the results of multiple testing by the method of Storey et al. [10] and the empirical Bayes method developed in this article. This plot summarizes the correspondence of the numbers of significant genes and -values [6, 7] for varying values of the cut-off . For a fair comparison, the same FDR-controlling method presented in Section 3.3 was applied to these two methods. For any levels of the -values, the number of significant genes identified by the proposed method was greater than that of the Storey et al. [10] method.

4.2. Lymphoma Example

Dave et al. [39] analyzed Affymetrix HG-U133A and U133B (Affymetrix, Santa Clara, CA) microarray data for 191 biopsy specimens from patients with untreated follicular lymphoma with . The data are available from the BRB-ArrayTools Data Archive for Human Cancer Gene Expression [40]. We compared patients who died within 5 years (poor prognosis; ) with patients who survived for more than 5 years (good prognosis; ).

For the maximum likelihood estimate of the semiparametric prior distribution, the hyperparameters were estimated as and. The estimated prior distribution for is given in Figure 3. (The grid points were placed on equally spaced by 0.01 except for 0; .) Compared with the prostate cancer data, large masses of the estimated are located in the regions of small , and thus the signal on differential expression contained in the data would be weaker than that in the prostate cancer data. In Figure 4, the relationships of the numbers of significant genes and their -values are presented. The number of significant genes identified by the empirical Bayes method was greater than that of the method of Storey et al. [10].

5. Discussion

For efficient gene screening using high-dimensional microarray data, multiple testing methodologies are effective tools for controlling false positives findings. Although the multiple testing can provide a relevant framework to ensure control of false positives for a set of significant genes (e.g., FDR [57]), researchers would like to find as many true positive genes as possible for further investigations. The optimal discovery procedure (ODP), introduced by Storey [8], would provide an efficient solution for this requirement. Storey et al. [10] provided desirable results in many numerical evaluations [10, 12, 35].

In this article, we proposed an empirical Bayes ODP method based on a semiparametric hierarchical model, through relaxing the parametric prior assumption in the empirical Bayes ODP method by Noma and Matsui [14]. Hierarchical mixture models and empirical Bayes methods form a basis for information sharing across genes and provide a framework for efficient multiple testing [1521]. Although empirical Bayes methods are more robust to prior misspecifications than eliciting a single prior, parametric empirical Bayes methods can suffer from a lack of robustness when the true prior is not from the assumed parametric family, as pointed out by Morris [28], such as a failure to incorporate the mixture structure or specify the form of the effect-size distribution . Actually, the estimated prior distribution in the prostate cancer example (Section 4.1) had a multimodal and skewed shape. In such examples, natural conjugate families might be inadequate for modeling an unknown random effects distribution. The “smoothing-by-roughening” approach [22, 23] is one of the flexible modeling methods in empirical Bayes inference [4143] and could be a basis for effective empirical Bayes inference for a broad range of situations. An extension of our method is to specify a nonparametric distribution for the random effect distribution of through applying the smoothing-by-roughening approach, although it requires substantial computation for estimating a large number of parameters and can suffer from unstable parameter estimates.

Previously, many efficient gene selection methods have been proposed for microarray studies, not only the multiple testing methodologies, for example, Bayesian ranking methods [20, 21, 43]. For evaluating practical values of these methods, comprehensive numerical investigations, including large-scale simulations and applications to many real datasets, would be a worthwhile subject.

Appendix

A. The EM Algorithm

The maximum likelihood estimate of the hyperparameters is obtained using the EM algorithm [31]. Regarding s, s, s, and s as missing variables, the log complete data likelihood is expressed as The first and second terms in the sum do not depend on the hyperparameters. Thus the objective function in each -step is given by where , . The superscripts indicate the number of the iterations. Solving the optimization problem, the -step is expressed as Also, the optimization problem for and is equivalent to the maximum likelihood estimation of the shape and scale parameters for the inverse gamma distribution. It can be solved simply by any numerical optimization methods provided in standard software (e.g., nlm or optim in R).

Acknowledgment

This work was supported by a Grant-in-Aid from Japan Society for the Promotion of Science (24800081).

Supplementary Materials

Small simulation results for evaluating the validity of the proposed method and its performances compared with the Storey et al. (Biostatistics 2007; 8: 347-368)’s method.

  1. Supplementary Material