Abstract

Detection of differentially expressed genes from expressed sequence tags (ESTs) data has received much attention. An empirical Bayesian method is introduced in which gene expression patterns are estimated and used to define detection statistics. Significantly differentially expressed genes can be declared given detection statistics. Simulation is done to evaluate the performance of proposed method. Two real applications are studied.

1. Introduction

It is important to detect differentially expressed genes, for example, exploring the key genes related to certain diseases. As the EST sequencing technology develops, a large number of EST databases from a variety of tissues are available. Enormous EST collections provide opportunities to quantify gene expression levels [1]. Efficient statistical methods are in great demand.

Several methods have been proposed to detect significantly differentially expressed (SDE) genes from EST data [2]. Fisher's exact test was used by the Cancer Genome Anatomy Project [3]. Audic and Claverie [4] developed a Bayesian method. GT statistic [5] and R statistic [6] were proposed for multilibrary comparison. In each method, gene-specific detection statistics quantify differences of gene expression levels and SDE genes are declared by their rankings.

An empirical Bayesian method is proposed to detect SDE genes. The relative gene expression abundances are estimated in each library, and a new detection statistic is derived for each gene. In Section 2, simulation experiments suggest that the proposed method outperforms those existing methods. Real applications are also studied in Section 2. Statistical methods are described in Section 3. The possibility of extending the method for multiple libraries is indicated in Section 4.

2. Results

Let and be the gene expression patterns in two libraries, where is the relative abundance of gene in library . The absolute difference between relative abundances is . Given a sample of ESTs from library , an empirical Bayes estimator for is defined in Section 3. Given gene seen in both samples, define . Given gene seen in only one sample, for example, sample 2, define if and otherwise, which is conservative in the sense that possibly underestimates . Gene is declared to be SDE if is relatively large.

2.1. Simulation

In a simulation experiment, EST frequencies are generated from a multinomial distribution with sample size and probability vector , where , , from , from , and and are two distributions over . The proposed methods, Fihser's exact test, test, AC statistic, and R statistic, are studied. Given a cutoff point , the efficiency of a statistical method is measured by , the expected percentage of the true first SDE genes being correctly declared as the first SDE genes. The average of estimated is calculated from 500 replications.

In the first four experiments, and the results are presented in Figure 1. Note that , , , and , , , respectively, where is the uniform distribution on , is degenerate at , is transformed from the beta distribution with shape parameters and by for , and is the gamma distribution with shape and scale . For each cutoff point are calculated. Clearly the proposed method has better performance than others.

In the second four experiments, , , and respectively, and the results are presented in Figure 2. Note that and in Figures 2(a) and 2(b) and and in Figures 2(c) and 2(d). The proposed method is usually the best one among all methods studied.

2.2. Real Applications

One example concerns Chinese spring wheat drought stressed leaf cDNA library (7235) and root cDNA library (#ASP), available at TIGR gene indexes database (downloaded at http://www.tigr.org/tdb/tgi, 01/06/2006). In each EST sample, there are totally 790 and 1306 sequenced ESTs, respectively. After removing the unannotated 103 and 194 ESTs, the annotated ESTs are clustered into 465 and 804 groups with each group associated with a unique gene. Only those well-annotated ESTs are used. The first 20 SDE genes by the proposed method are listed in Table 1, among which 7, 7, 7, and 7 genes are in the set of first 20 SDE genes by Fisher's exact test, test, AC statistic, and R statistic, respectively.

Another example concerns pinus gene expression level comparison in root gravitropism April 2003 test library (#FH3) and root control 2 (late) library (#FH4), also from TIGR, in which 2513 and 1132 ESTs associated with 1211 and 605 genes are well annotated and clustered. Table 2 lists the first 20 SDE genes by the proposed method, among which 4, 4, 5, and 3 genes are in the set of the first 20 SDE genes by Fisher's exact test, test, AC statistic, and R statistic, respectively.

3. Methods

Suppose that there are genes in a library. Let be the number of ESTs from gene , a Poisson variable with mean . Given a prior distribution on the , the posterior mean of is , where is a Poisson mixture. A gene is observed if and only if . Conditioning on , follows a zero-truncated Poisson mixture or a mixture of truncated Poisson, where Let be the odds that a gene is unseen. Write if and otherwise.

Let denote the number of genes with exactly ESTs in the sample. The nonparametric maximum likelihood estimator for is whose calculation is discussed in [7]. It is difficult to estimate well [8]. There are lower bound estimators, for example, [9], where is the number of observed expressed genes. An empirical Bayes estimator for isAs the relative abundance satisfies , let , where

4. Discussion

A new statistical method is proposed to compare the gene expression patterns in two cDNA libraries. It can be extended to multilibrary comparison, for example, considering all pairwise comparisons among multiple libraries [3].