Journal of Probability and Statistics

Volume 2016, Article ID 3937056, 7 pages

http://dx.doi.org/10.1155/2016/3937056

## Estimating the Proportion of True Null Hypotheses in Multiple Testing Problems

^{1}Manufacturing, Toxicology and Applied Statistical Sciences, Janssen Research & Development, Spring House, PA 19002, USA^{2}Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, OH 43403, USA

Received 26 July 2016; Revised 19 October 2016; Accepted 8 November 2016

Academic Editor: Shein-chung Chow

Copyright © 2016 Oluyemi Oyeniran and Hanfeng Chen. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The problem of estimating the proportion, , of the true null hypotheses in a multiple testing problem is important in cases where large scale parallel hypotheses tests are performed independently. While the problem is a quantity of interest in its own right in applications, the estimate of can be used for assessing or controlling an overall false discovery rate. In this article, we develop an innovative nonparametric maximum likelihood approach to estimate . The nonparametric likelihood is proposed to be restricted to multinomial models and an EM algorithm is also developed to approximate the estimate of . Simulation studies show that the proposed method outperforms other existing methods. Using experimental microarray datasets, we demonstrate that the new method provides satisfactory estimate in practice.

#### 1. Introduction

Estimating the proportion of true null hypotheses in a multiple testing setup is very crucial in wanting to assess and/or control false discovery rate, which is quite significant in genomics, disease discovery, and cancer discovery. Langaas et al. [1] remarked “An important reason for wanting to estimate is that it is a quantity of its own right. In addition, a reliable estimate of is important when we want to assess or control multiple error rates, such as the false discovery rate FDR of Benjamini and Hochberg [2].” In the case of testing for differential expression in DNA microarrays, the proportion of differentially expressed genes is , and it is important to know whether 5% or 35% of the genes, for example, are differentially expressed, even if we cannot identify these genes (see Langaas et al. [1]). Multiple testing refers to any instance that involves the simultaneous testing of several hypotheses. A common feature in genomes studies is the analysis of a large number of simultaneous measurements in a small number of samples. One must decide whether the findings are truly causative correlations or just the byproducts of multiple hypothesis testing (Gyorffy et al. [3]). If one does not take the multiplicity of tests into account, then the probability that some of the true null hypotheses are rejected by chance alone may be unduly large.

In a multiple hypothesis testing problem,* m* null hypotheses are tested simultaneously; that is, we test for , simultaneously. Assume that the* m* tests are constructed based on the observed* p* values, , respectively. The unknown quantity to be estimated is the proportion of the true null hypotheses among . Introduce the i.i.d Bernoulli random variables with . Then can be interpreted in terms of the multiple testing problems as follows:for

We assume the* p* values, , are continuous and independent random variables, so that the* p* values are independently and identically distributed as when the null hypotheses are all true. One chooses to reject or fail to reject each null hypothesis based on the corresponding value. Consequences of the tests are summarized in Table 1.