Table of Contents Author Guidelines Submit a Manuscript
International Journal of Plant Genomics
Volume 2008 (2008), Article ID 892927, 12 pages
Research Article

Bayesian Mixture Model Analysis for Detecting Differentially Expressed Genes

1Department of Pathology & Laboratory Medicine, University of California, Irvine, CA 92697, USA
2Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA

Received 27 August 2007; Accepted 18 November 2007

Academic Editor: Nengjun Yi

Copyright © 2008 Zhenyu Jia and Shizhong Xu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Control-treatment design is widely used in microarray gene expression experiments. The purpose of such a design is to detect genes that express differentially between the control and the treatment. Many statistical procedures have been developed to detect differentially expressed genes, but all have pros and cons and room is still open for improvement. In this study, we propose a Bayesian mixture model approach to classifying genes into one of three clusters, corresponding to clusters of downregulated, neutral, and upregulated genes, respectively. The Bayesian method is implemented via the Markov chain Monte Carlo (MCMC) algorithm. The cluster means of down- and upregulated genes are sampled from truncated normal distributions whereas the cluster mean of the neutral genes is set to zero. Using simulated data as well as data from a real microarray experiment, we demonstrate that the new method outperforms all methods commonly used in differential expression analysis.

1. Introduction

Current microarray technology allows us to measure the expression of thousands of genes simultaneously in a small chip. Many advanced statistical methods have been developed to analyze data generated from this technology. For example, a Bayesian model has been proposed to normalize gene expression data and select genes that are differentially expressed [1]. Jörnsten et al. [2] developed a Bayesian method to impute missing values prior to any statistical analysis. Efron et al. [3], Broët et al. [4], Edwards et al. [5], Do et al. [6] used a Bayesian approach to identify differentially expressed genes. Methods of [79] are examples for Bayesian clustering analysis.

In this study, we focus on detecting differentially expressed genes in two-sampled designs, in which only two conditions (control and treatment) are examined and each condition is replicated several times. The data are assumed to have been properly imputed and normalized as needed. Much effort has been made on this study to detect genes whose expression levels respond to the treatment.

Numerous methods have been suggested based on the P-values of a test statistic. The 𝑃-values are obtained from separate tests and reported for all genes [10, 11]. Cui and Churchill [12] reviewed the test statistics for differential expression for microarray experiments. The gene-specific summary statistic provides a basis for the rank ordering of genes and the creation of a short list of genes declared as differentially expressed. However, genes are studied separately in the sense that calculating the test statistic for one gene does not use information from other genes. In limma (linear model for microarray data), Smyth [13] cleverly borrowed information from the ensemble of genes to make inference for individual gene based on the moderate t-statistic. Some other researchers also took advantages of shared information by examining data jointly. Efron et al. [3] proposed a mixture model methodology implemented via an empirical Bayes approach. For each gene, the expression levels were mapped to a single 𝑧-score, which is assumed to arise from two distributions (affected and unaffected). Pan [14] used a mixture model to cluster genes based on gene specific 𝑡-statistic. They are examples where the idea of clustering was first used in differential expression analysis (clustering methods are usually applied to microarray data generated from multiple conditions). Similarly, the methods of [46, 15] all used a mapped quantity, like the 𝑧-score or the 𝑡-statistic, to classify genes into different groups. These methods are different from the proposed Bayesian mixture model in that genes were clustered based on a summary statistic for these methods while our Bayesian method clusters genes based on the pattern of expression. It is more desirable to use the expression profile than to use a summary score for cluster analysis because information is bound to be lost when mapping the expression profile into a single score. Recently, Newton et al. [16] developed a new method in which gene expression are directly used for classification. Newton et al. [16] used a hierarchical model that consists of two levels. The first level of the model describes the conditional distributions of the observed measurement of gene expression given the means of the control (𝜇1𝑖) and the treatment (𝜇2𝑖), where 𝑖 indexes the gene. The second level describes the distribution of 𝜇1𝑖 and 𝜇2𝑖 jointly by a mixture of three multivariate distributions with constrained parameters of 𝜇1>𝜇2, 𝜇1=𝜇2, and 𝜇1<𝜇2. Genes become linked by virtue of having 𝜇1𝑖 and 𝜇2𝑖 drawn from a common distribution. Parameter estimation was accomplished via the Expectation-Maximization (EM) algorithm [17].

For the Bayesian clustering method presented in this study, the observed gene expression levels are described by a regression model as done by Gusnanto et al. [15]. For each gene, the irrelevant intercept is removed by a special normalization scheme. The slope of the regression represents the difference of expression under the two conditions. The Bayesian method is implemented via the Markov chain Monte Carlo (MCMC) algorithm. The regression coefficient of each gene is assumed to be sampled from a mixture of three normal distributions with constrained parameters. The three distributions are 𝑁(𝛽𝑘,𝜈𝑘) for 𝑘=1,2,3, where 𝛽1<0, 𝛽2=0, and 𝛽3>0 are the constrained parameters. The proposed new method actually turns the problem of a complicated multivariate mixture distribution of Newton et al. [16] into that of a univariate mixture distribution.

The new Bayesian method proposed in this study (Method I) is compared to five different methods that are commonly used in differential expression analysis. These four methods, called Methods II, III, IV, V, and VI respectively, are described below. Method II is the regularized 𝑡-test (SAM, Tusher et al. [10]), in which a 𝑡-like score is calculated for each gene. Genes with scores greater than an adjusted threshold are said to have altered expressions under the treatment. Permutations are used to calculate the false discovery rate (FDR, Benjamini and Liu [18]). Method III is the model-based cluster analysis of Pan [14] where the variable used for clustering is the 𝑡-test statistic. Genes that are assigned into a cluster with a mean 𝑡 value significantly different from zero are declared to be differentially expressed. Method IV is the hierarchical mixture model method of Newton et al. [16], where the expected expression values of the 𝑖th gene under the control (𝜇1𝑖) and the treatment (𝜇2𝑖) are assumed to arise from a mixture of three distributions. Clustering is actually conducted based on the latent parameters (𝜇1𝑖 and 𝜇2𝑖). The cluster of genes with 𝜇1𝑖𝜇2𝑖 are declared as differentially expression genes. A short list of significant genes is created based on FDR at 0.05. Method V is the linear model and empirical Bayes method (limma) of Smyth [13]. In limma, a hierarchical linear model is used to describe the gene expression levels. A moderated 𝑡-statistic, with extra information borrowed from other genes, is calculated for each gene. Adjusted 𝑃-values are calculated for genes based on the moderated 𝑡-statistics to achieve FDR control. Method VI is the mixture mixed model analysis of Gusnanto et al. [15], where a similar linear model is used to describe gene expressions. However, the variable used for clustering is the mean difference between the treatment and the control. The mathematical differences between the proposed method and the four methods compared will be evident after the new method is introduced. Advantages of the new method over the four methods will be demonstrated in Section 3 and further addressed in the Section 4.

2. Methods

2.1. Linear Model

Let 𝑌𝑖𝑗(𝑖=1,,𝑚and𝑗=1,,𝑁) be the log transformed expression level for gene 𝑖 from chip 𝑗, where 𝑚 is the total number of genes and 𝑁 is the total number of chips. Let 𝑋𝑗 be a dichotomous variable indicating the status of the chip with 𝑋𝑗=0 for the control and 𝑋𝑗=1 for the treatment. We use the following regression model to describe 𝑌𝑖𝑗: 𝑌𝑖𝑗=𝛼𝑖+𝑋𝑗𝛾𝑖+𝜀𝑖𝑗,(1) where 𝛼𝑖 is the intercept of the regression model, 𝛾𝑖 is the regression coefficient and 𝜀𝑖𝑗 is the residual error. Note that the regression coefficient represents the difference in gene expression between the control and the treatment (𝜇2𝜇1). A large 𝛾𝑖 implies that gene 𝑖 is differentially expressed. The intercept 𝛼𝑖 represents the mean expression level of gene 𝑖 in the control (𝜇1). The intercept 𝛼𝑖 is irrelevant to the differential expression analysis and thus should be removed from the model (see next section for detail). The regression coefficient 𝛾𝑖 is assumed to be sampled from one of three normal distributions, corresponding to the downregulated genes, the neutral genes, and the upregulated genes, respectively. The residual error is assumed to be 𝑁(0,𝜎2) distributed. In matrix notation, model (1) can be expressed as 𝑌𝑖=𝑊𝛼𝑖+𝑋𝛾𝑖+𝜀𝑖,(2) where 𝑌𝑖={𝑌𝑖𝑗}𝑁𝑗=1 is a column vector for the expressions of gene 𝑖 across all samples, 𝑋={𝑋𝑗}𝑁𝑗=1 is the incidence matrix (a column vector) for the linear model, 𝜀𝑖={𝜀𝑖𝑗}𝑁𝑗=1 is a vector for the residual errors, and 𝑊 is an 𝑁×1 unity vector, that is, a vector of 1 for all elements.

2.2. Normalization

As stated earlier, 𝛼𝑖 is irrelevant to the differential expression analysis. We propose to remove the effect of 𝛼 by using a linear contrast. This process, which is called normalization, is similar to the way of removing the fixed effects in the restricted maximum likelihood (REML) method for variance component analysis [19]. For each gene, we subtract the observed expression level by the average level across all chips, that is, 𝑌𝑖=𝑌𝑖𝑊𝑌𝑖,(3) where 𝑌𝑖.=𝑁1𝑁𝑗=1𝑌𝑖𝑗 is the mean expression for gene 𝑖 across all the chips. The mean expression can also be expressed in matrix notation as 𝑌𝑖.=𝑊𝑇𝑊1𝑊𝑇𝑌𝑖..(4) Therefore, the normalized expression is described as a linear combination of the original expressions 𝑌𝑖=𝑊𝐼𝑊𝑇𝑊1𝑊𝑇𝑌𝑖=𝐿𝑌𝑖,(5) where 𝑊𝐿=𝐼𝑊𝑇𝑊1𝑊𝑇.(6) We can prove that 𝐿𝑊=0. Therefore, 𝑌𝑖=𝐿𝑌𝑖=𝐿𝑋𝛾𝑖+𝐿𝜀𝑖.(7) Let 𝑋=𝐿𝑋 and 𝜀𝑖=𝐿𝜀𝑖, we have 𝑌𝑖=𝑋𝛾𝑖+𝜀𝑖.(8) Now the intercept has been removed by using linear contrasts 𝐿𝑌𝑖 in place of the original 𝑌𝑖. The residual variance-covariance matrix of the linear contrasts is Var(𝜀𝑖)=𝐿𝐿𝑇𝜎2. Note that the rank of matrix 𝐿𝐿𝑇 is 𝑁𝑟(𝑊)=𝑁1, indicating that the inverse of matrix 𝐿𝐿𝑇 does not exist. The model-based cluster analysis, however, requires the inverse. Therefore, we delete the last row of matrix 𝐿 to form a new (𝑁1)×𝑁 matrix, called 𝐿, to build the linear contrasts. This treatment is equivalent to deleting the last element of 𝑌𝑖, that is, the dimension of 𝑌𝑖 is (𝑁1)×1. The dimensions of matrix 𝑋 and 𝜀𝑖 also change into (𝑁1)×1 accordingly. Let 𝑅=𝐿(𝐿)𝑇 be an (𝑁1)×(𝑁1) matrix of full rank. We now have Var(𝜀𝑖)=𝑅𝜎2, which will be used in the model-based cluster analysis.

To simplify the notation, let us define 𝑦𝑖=𝑌𝑖, 𝑥=𝑋 and 𝜖𝑖=𝜀𝑖, all with a dimension of (𝑁1)×1. The new model with the intercept removed becomes 𝑦𝑖=𝑥𝛾𝑖+𝜖𝑖.(9) Note that the special way of normalization described above only applies to linear effects. Different methods should be used for adjusting nonlinear effects. In subsequent analysis, we assume that normalization via linear contrasts has been conducted and thus model (9) will be the basis for parameter estimation and statistical inference.

2.3. Mixture Model and the Bayesian Setup

Conditional on 𝛾𝑖, the probability density of 𝑦𝑖 is 𝑝𝑦𝑖𝛾𝑖,𝜎2=𝜙𝑛𝑦𝑖;𝑥𝛾𝑖,𝑅𝜎2,(10) where 𝑛=𝑁1, 𝜙𝑛(𝑦𝑖;𝑥𝛾𝑖,𝑅𝜎2) is the 𝑛-dimensional normal density of variable 𝑦𝑖 with mean 𝑥𝛾𝑖 and variance-covariance matrix 𝑅𝜎2. The gene-specific effect 𝛾𝑖 is assumed to be sampled from one of three normal distributions (clusters), 𝑁(𝛽𝑘,𝜈𝑘) for 𝑘=1,2,3. We define variable 𝑧𝑖 as the cluster assignment of gene 𝑖, where 𝑧𝑖=𝑘 if gene 𝑖 comes from cluster 𝑘. For notational simplicity, we define 𝜂(𝑧𝑖,𝑘) for 𝑘=1,2,3 as a redundant variable to 𝑧𝑖, where 𝜂(𝑧𝑖,𝑘)=1 if 𝑧𝑖=𝑘 and 𝜂(𝑧𝑖,𝑘)=0 otherwise. Let 𝜂(𝑧𝑖)={𝜂(𝑧𝑖,𝑘)}3𝑘=1 for 𝜂(𝑧𝑖,𝑘)=0,1 and 3𝑘=1𝜂(𝑧𝑖,𝑘)=1. The new variable 𝜂(𝑧𝑖) will be used in place of 𝑧𝑖 in subsequent derivation. Let 𝛽={𝛽𝑘}3𝑘=1 and 𝜈={𝜈𝑘}3𝑘=1. The density of the mixture distribution of 𝛾𝑖 is 𝑝𝛾𝑖𝑧𝜂𝑖=,𝛽,𝜈3𝑘=1𝜂𝑧𝑖𝜙,𝑘1𝛾𝑖;𝛽𝑘,𝜈𝑘.(11) Note that variable 𝜂(𝑧𝑖,𝑘) is unknown and it is treated as missing value. In fact, inferring the probability distribution of 𝜂(𝑧𝑖,𝑘) is the main purpose of the proposed mixture model analysis. Define 𝜋={𝜋𝑘}3𝑘=1 as the mixing proportions of the three components for 𝜋𝑘0 and 3𝑘=1𝜋𝑘=1. The distribution of 𝜂(𝑧𝑖) is multinomial with one observation, 𝑝𝜂𝑧𝑖=𝜋3𝑘=1𝜋𝜂𝑧𝑖,𝑘𝑘.(12)

We have introduced the probability densities of the data, the regression coefficients, and the missing values for a single gene. We now combine the densities of individual genes to form the joint density of all genes. The probability density of the data 𝑦={𝑦𝑖}𝑚𝑖=1 is 𝑝𝑦𝛾,𝜎2=𝑚𝑖=1𝑝𝑦𝑖𝛾𝑖,𝜎2.(13) The density of 𝛾={𝛾𝑖}𝑚𝑖=1 is 𝑝(𝛾𝜂,𝛽,𝜈)=𝑚𝑖=1𝑝𝛾𝑖𝑧𝜂𝑖,𝛽,𝜈.(14) The density of 𝜂={𝜂(𝑧𝑖)}𝑚𝑖=1 is 𝑝(𝜂𝜋)=3𝑘=1𝜋𝑚𝑘𝑘,(15) where 𝑚𝑘=𝑚𝑖=1𝜂(𝑧𝑖,𝑘) for 3𝑘=1𝑚𝑘=𝑚. The joint density of (𝑦,𝛾,𝜂) is 𝑝(𝑦,𝛾,𝜂𝜃)=𝑝𝑦𝛾,𝜎2𝑝(𝛾𝜂,𝛽,𝜈)𝑝(𝜂𝜋),(16) where 𝜃=(𝜎2,𝛽,𝜋,𝜈) is the parameter vector.

The next step of the analysis is to find a suitable prior for 𝜃. The following vague prior is chosen for each of the variance components, 𝜎2Inv𝜒2(0,0) and 𝜈𝑘Inv𝜒2(0,0) for 𝑘=1,2,3. They are called the scaled inverse chi-square distributions with zero degrees of freedom and zero-scale parameter. The actual forms of the priors are 𝑝(𝜎2)=1/𝜎2 and 𝑝(𝜈𝑘)=1/𝜈𝑘 for 𝑘=1,2,3. The scaled inverse chi-square distribution is conjugate, and thus the posterior distribution of the variance is also scaled inverse chi-square. The three clusters of genes are distinguished by the overall patterns of responses to the treatment. The first cluster consists of all the downregulated genes, the second cluster represents all the neutral genes, and the third cluster contains all the upregulated genes. The characteristics of the three clusters can be represented by enforcing the following constraints on the means of the three clusters: 𝛽1<0, 𝛽2=0, and 𝛽3>0. Therefore, the prior distributions of the means of the three clusters are 𝑝(𝛽𝑘)=1\ (𝛽𝑘Ω𝑘). Here, we adopt a special notation to indicate that 𝑝(𝛽𝑘)=1 if (𝛽𝑘Ω𝑘) and 𝑝(𝛽𝑘)=0 if (𝛽𝑘Ω𝑘), where Ω1(𝛽1<0), Ω2(𝛽2=0), and Ω3(𝛽3>0). The prior for 𝜋 is the multivariate generalization of the Beta distribution 𝑝(𝜋𝛿)=𝐷(𝜋𝛿,𝛿,𝛿), called the Dirichlet distribution, where 𝛿=1 is used in this study. The joint prior for all the parameters is 𝑝(𝜃)=(1/𝜎2)3𝑘=1(1/𝜈𝑘). The posterior distribution of all the unknowns 𝑝(𝛾,𝜂,𝜃𝑦) is proportional to the joint distribution of all variables, 𝑝(𝛾,𝜂,𝜃𝑦)𝑝(𝑦,𝛾,𝜂,𝜃)=𝑝𝑦𝛾,𝜎2𝑝(𝛾𝜂)𝑝(𝜂𝜋)𝑝(𝜃).(17) Statistical inference of this distribution is the theme of the Bayesian analysis. There is no explicit form for the joint distribution (17). Therefore, we draw observations of the unknowns from the conditional distributions. Fortunately, with the current Bayesian setup, the conditional distribution of any single variable, given all other variables, has a known distribution. Therefore, the MCMC process can be proceeded exclusively using the Gibbs sampler.

2.4. Markov Chain Monte Carlo

The detailed steps of the Markov chain Monte Carlo (MCMC) process are described as follows.

Step 1. Set 𝑡=0 and initialize all variables (𝛾(𝑡),𝜂(𝑡),𝛽(𝑡), 𝜎2(𝑡),𝜈(𝑡),𝜋(𝑡)).

Step 2. Simulate 𝛾𝑖(𝑡+1) from 𝛾𝑖𝑁(𝛾𝑖,𝑠2𝛾𝑖) conditional on 𝑧𝑖=𝑘, where 𝑡 where all variables in the right-hand side take the most current values. In this step, the most current values of the variables in the right-hand side are the values at iteration 𝑡. This statement also holds for subsequent steps, except that the most current values of the variables in the right-hand side are values at iteration 𝑡+1 or 𝜂(𝑧𝑖)(𝑡+1), depending on whether the variable has been updated or not in the current sweep.

Step 3. Simulate 𝜂𝑧𝑖Multinomial1,𝜋𝑖1,𝜋𝑖2,𝜋𝑖3,(19) from 𝜋𝑖𝑘 a multinomial distribution with one observation, where 𝜋𝑖𝑘=𝜋𝑘𝑝𝛾𝑖𝑧𝜂𝑖,𝑘,𝛽𝑘,𝜈𝑘3𝑘=1𝜋𝑘𝑝𝛾𝑖𝑧𝜂𝑖,𝑘,𝛽𝑘,𝜈𝑘.(20) is

Step 4. Simulate 𝛽𝑘𝑁𝛽𝑘,𝑠2𝛽𝑘1𝛽𝑘Ω𝑘.(21) from a truncated normal distribution 𝛽𝑘=1𝑚𝑖=1𝜂𝑧𝑖,𝑘𝑚𝑖=1𝜂𝑧𝑖𝛾,𝑘𝑖,𝑠2𝛽𝑘=𝜈𝑘𝑚𝑖=1𝜂𝑧𝑖,,𝑘(22) The mean and the variance of the normal distribution are 𝛽2=0 respectively. Note that 𝛽2 is enforced and no sampling for 𝛽1 is required. The general inverse transformation method [20] is used to sample 𝛽3 and 𝜎2(𝑡+1) from the corresponding truncated normal distributions.

Step 5. Simulate 𝜎2Inv𝜒2𝑦(𝑚𝑛,𝑖𝑥𝛾𝑖𝑇𝑅1𝑦𝑖𝑥𝛾𝑖),(23) from 𝑚×𝑛 where 𝜈𝑘(𝑡+1) is the degree of freedom, and the term with summation is the scale parameter of the inverse chi-square distribution.

Step 6. Simulate 𝜈𝑘Inv𝜒2(𝑚𝑘,𝜂𝑧𝑖𝛾,𝑘𝑖𝛽𝑘2),(24) from 𝑚𝑘=𝑚𝑖=1𝜂(𝑧𝑖,𝑘) where 𝜋(𝑡+1).

Step 7. Simulate 𝑚𝜋Dirichlet1+1,𝑚2+1,𝑚3+1.(25) from 𝑡

So far, every variable has been updated. We then increment (𝑡=𝑡+1) by 1𝑖 and repeat Steps 37 until the chain gets sufficiently long to allow reliable post-MCMC inference about the parameters of interest. Gene𝑘 is assigned to group 𝜂(𝑧𝑖,𝑘)=max{𝜂(𝑧𝑖,𝑘)}3𝑘=1 if 𝜂(𝑧𝑖,𝑘), where 𝜂(𝑧𝑖,𝑘) is the posterior mean of 𝑚=1000 calculated from the posterior sample.

Schemes for sampling variables from the aforementioned distributions are discussed by Gelman et al. [21].

3. Applications

3.1. Analysis of Simulated Data

We simulated the expression of 𝑁=16 genes from six different groups on (𝑚1=20) microarray chips. Chips 1–8 represent the control, and chips 9–16 represent the treatment. The true parameter values of the six groups of genes are shown in Table 1. If we ignore the intercepts, the six groups of genes really come from three functional clusters: Cluster 1 represents the (𝑚2=960) downregulated genes (Group 3); Cluster 2 contains the (𝑚3=20) neutral genes (Groups 2 and 4); and Cluster 3 consists of the 𝛽 upregulated genes (Groups 1, 5, and 6). The true parameters of the original six groups and the parameters for the combined three clusters are given in Table 1. Note that there are actually three components (Group 1, 5, and 6) for the combined Cluster 3. The weighted 𝛽 of three simulated components is treated as the true 1.2×0.25+0.8×0.25+0.9×0.5=0.95 for this cluster, that is, 𝛼𝑘. The expression patterns are shown in Figure 1 for the original six groups and Figure 2 for the combined three clusters.

Table 1: Parameters used in the simulation experiment and their estimates from the Bayesian mixture model analysis.
Figure 1: Original expression patterns for the six simulated groups of genes. In each plot, the first half represents the observed data from chips 1–8 and the second half represents the observed data from chips 9–16.
Figure 2: Expression patterns for the three combined clusters after normalization. See Figure 1 for the legends.

The data were analyzed using the Bayesian mixture model analysis reported in this study (Method I). We set the number of iterations equal to 12000 in the MCMC process. The results generated from the first 2000 iterations were discarded for the burn-in period. For the remaining 10000 iterations, we saved one observation in every 20 iterations to reduce series correlation between consecutive observations. Therefore, the posterior sample contained 500 observations. The posterior mean of each variable is considered as the Bayesian estimate. Hypothesis tests were only performed on the cluster means 𝛽 rather than on individual genes. Genes assigned into a significant cluster were simultaneously declared to be differentially expressed. From Table 1, we can see that the estimated parameters agree well with the true parameters. Genes are correctly assigned into three clusters except that two genes in Cluster 2 (the neutral cluster) are incorrectly assigned into the other two clusters (down- and upregulated clusters) (see Table 2). Upon enforcing constraints on the cluster means, we are able to fix the number of clusters to three, rather than finding the optimal number of clusters using the Bayes factor or its BIC [22] approximation.

Table 2: Numbers of genes assigned into each of the three clusters for six different methods of differential expression analysis. (The sum of each column within a method represents the true number of genes simulated from that cluster and the sum of each row represents the number of genes assigned into that cluster.)

An interesting question is what would happen if the cluster means are not constrained? To answer this question, we also analyzed the simulated data with unconstrained 𝑁(𝑘,𝑘). We varied the number of clusters from 3 to 6 and found out that 4 was the optimal number of clusters (maximum value of Bayes factor). One cluster mean was close to zero. Another cluster had a negative mean. The third and fourth clusters all had positive means but the two means were almost identical. Therefore, Clusters 3 and 4 may be well considered as a single cluster. When treated as three clusters, the results were similar to those reported earlier when the cluster means were restricted (data now shown).

For comparison, the data were also analyzed using the five methods (Methods II–VI) described in Section 1. The numbers of genes classified into each of the three clusters are given in Table 2 for all the six methods, including Method I (the new method). Cutoff value 0.05 was used for gene detection in Method II and Method V to achieve FDR control. All methods have successfully found the 40 truly differentially expressed genes. However, Methods II, III, IV, V, and VI detected more false-positive genes than Method I. The model-based cluster analysis of Pan (Method III) required BIC to decide the optimal number of clusters. For simplicity of comparison, we only used the result of Method III under three clusters. We noticed that Method III always put the down- and upregulated genes into a cluster with a large variance, but placed the neutral genes into the other two cluster with small variances. The two clusters with small variances can be combined as a single cluster. Because all the methods have successfully detected the 40 nonneutral genes, none of them have suffered the Type II error. However, different methods tended to have different numbers of genes from the neutral cluster misplaced into the functional clusters. This means that different methods have different Type I errors.

Let 𝑘 be the number of genes from cluster 𝑘 assigned into cluster 𝑘,𝑘=1,2,3 for 𝑁(𝑘)=3𝑘=1𝑁(𝑘,𝑘). Then 𝑘 is the number of genes in cluster TypeIerror=𝑁(2,1)+𝑁(2,3)𝑁(2).(26). Recall that Clusters 1 and 3 contain differentially expressed genes and Cluster 2 is reserved for the neutral genes. The empirical Type I error is then defined as TypeIIerror=𝑁(1,2)+𝑁(3,2)𝑁(1)+𝑁(3).(27) The empirical Type II error is defined as Power=1TypeIIerror.(28) The empirical power is defined as 𝛽3 The empirical Type I errors for all the five methods are listed in Table 2. Method I (the new method) have the smallest Type I error. Note that this method of validation is valid only for simulation study where the true gene assignment is known.

3.2. Analysis of Mice Data

We analyzed a cDNA microarray dataset published by Dudoit et al. [11]. The data are publicly available on the website ( The data consist of expression measurements of 6342 genes from a study of lipid metabolism in mice [23]. This experiment was set up to find genes that are differentially expressed in the livers of mice with very low HDL cholesterol levels (treatment) in contrast to a group of normal inbred mice (control). The treatment group consists of eight scavenger receptor BI (SR-BI) transgenic mice and the control group consists of eight normal inbred mice.

We first analyzed the data with the Bayesian mixture model method proposed in this study (Method I). We used exactly the same setup of the MCMC as that used in the simulation experiment. The estimated parameters are given in Table 3. Only 𝜋1=0.1724 (cluster mean of up-regulated genes) is significantly different from zero. The mixing proportions of the three clusters show that majority of the genes (82.46%) are neutral. Note that the estimated mixing proportions (𝜋2=0.8246, 𝜋3=0.0030), 𝑚1=0 are some nuisance parameters, which do not necessarily match the actual numbers of genes assigned to the three clusters (𝑚2=6332, 𝑚3=10, 𝛽𝑘). Report should be made based on the actual numbers of genes assigned to the clusters.

Table 3: Parameters estimated for the mice data using the Bayesian mixture model analysis.

All six methods that were used to analyze the simulated data were applied here to analyze the mice data. The numbers of genes assigned to the three clusters are given in Table 4 for all the five methods. Clearly, Methods I and VI were similar to each other and Methods II, III, IV, and V placed much more genes into the nonneutral clusters, a phenomenon which has been observed early in the simulation experiment.

Table 4: Numbers of genes assigned to the three clusters for the mice data for six different methods.

In the original study [23], individual 𝑡-values were calculated based on the 𝑃-test statistics for all the genes and then the adjusted -values were calculated using the method described in Westfall and Young [24]. Five genes (represented by eight array elements) were detected by Callow et al. [23] as differentially expressed between the treatment and the control (see Table 5). Note that several array elements may represent the same gene. We found that gene EST AI746730 is not included in the dataset and thus was not analyzed with any of the methods (marked as NA in Table 5).

Table 5: Some differentially expressed genes for the mice data detected by six different methods.

The Bayesian mixture model analysis detected 10 genes (or spots), four of which were reported in Callow et al. [23]. The expressions of the 10 genes are plotted in Figure 3 (the first two rows). Significant differences between the control and treatment groups can be easily seen. Note that element 2374 (the first plot of the second row of Figure 3) was a “blank” spot on the chip. However, it was detected by all methods due to the remarkable difference between the control and the treatment. The third row of the plots represents five randomly selected genes which were not detected by any of the methods. Genes of this kind have the same expression levels under the control and treatment. The last row of the plots (in Figure 3) shows the expression patterns of five genes that were not detected by the Bayesian mixture model analysis (Method I), but detected by the other five methods (Methods II, III, IV, V, and VI). When examining these five genes, we found that the differences between the treatment and the control are not obvious. The statistical significance may be caused entirely by the outliers. Therefore, the five methods (II, III, IV, V, and VI) are perhaps too sensitive to outliers. For example, the first plot (5203) of the last row in Figure 3 is the spot that represents 𝛽>0-globin in the dataset. We cannot tell much difference between the control and the treatment. Genes with numbers 2375, 2377, 2379, and 2384 are all “blank” spots. The observed differences between the control and the treatment are purely caused by chance, yet these blank spots were detected by Methods II, III, IV, V, or VI. Finally, Method III [14] and Method V [13] missed gene SR-BI at array element number 3, which is known to have altered the expression between the control and the SR-BI transgenic mice.

Figure 3: Expression patterns of some genes from the mice data. The first two rows (representing ten genes) are genes detected by all methods. The third row (five genes) are genes detected by none of the methods. The last row (five genes) are those detected by Methods II–VI but not by Methods I.

When we examined the plots of the ten genes detected by the Bayesian mixture model analysis, we found that seven of them have increased the expression level by the treatment while three of them have decreased the expression level by the treatment. Interestingly, our method assigned the three genes with negative regression coefficients into the cluster with positive regression coefficients. If we look at the estimated parameters (Table 3) again, we realized that both Clusters 1 and 2 may be considered as the neutral cluster (mean close to zero and variance very small). The third cluster contains both the up-and downregulated genes because it has a much larger variance than the other two clusters. The same phenomenon also occurred for Method VI [15], but the issue was not discussed by Gusnanto et al. [15]. A possible explanation is that there are too few genes with negative regression coefficients, which made them hard to form a single cluster by themselves. However, why were these three genes assigned into Cluster 3 (𝛽=0) instead of Cluster 2 (𝜈), which is closer to the three genes in terms of the cluster mean? The reason may be that Cluster 3 has a larger variance than Cluster 2.

A separate simulation experiment has been conducted to verify the notion that 𝛽 plays a more important role than 𝜋𝑖𝑘 in calculating the posterior probability of cluster assignment 𝑡 (data not shown).

4. Discussion

Similar to Method IV [16], our method is based on a hierarchical model in which the parameters of interest (gene effects) are further described by a mixture distribution. Clustering is made based on the parameter rather than on a summary statistic such as the 𝑃-like statistical score. This allows us to incorporate the error structure of the gene expression profile into the linear model (see (10)), and thus capture the most information from the data. This explains why our method is different from (or even better than) the other four methods (Methods II, III, V, and VI). But why is our method better than (or different from) Method IV? This may be contributed by the two differences between Methods I and IV. One difference is that the incorporated normalization scheme in our model allows us to deal with only the effect of differential expression (regression coefficient) whereas Method IV deals with effects of gene expressions in both the control and the treatment. In other words, we are dealing with a single variable (regression coefficient, the only parameter of interest) but Method IV deals with two variables (intercept and regression coefficients). The dimension reduction from two to one and the simplified Gaussian mixture distribution of our model may largely contribute to the higher efficiency of our method. The second difference between our method and Method IV is that we used the probabilities of cluster assignment to classify genes into three clusters, and genes assigned into the neutral cluster are excluded from the list of differentially expressed genes, but Method IV goes one step further by employing an FDR criterion to select a list of differentially expressed genes. It is obvious that the FDR generated list of genes depends largely on the subjective cutting rule set by the investigator. We consider that the extra step of FDR analysis after gene clustering is not only redundant but also leads to subjective decision for differential expression analysis.

Method II [10] sorts genes based on the 𝑡-value for a regularized 𝑃-statistic calculated from each gene. Information from other genes plays no role in calculating the 𝑡-value for the current gene of interest. This, combined with the subjectiveness of the cutting rule for gene selection, may largely explains the difference between Method II and the proposed new method.

Method III [14] uses a 𝑡-statistic as the data point for cluster analysis. The original gene expression profile is converted into the 𝑡 score. Some information may be lost during the conversion because the method fails to incorporate the error for calculating the 𝑡 score. In addition, the 𝑡 score is the differential gene expression divided by the standard deviation of the difference. However, the calculated standard deviation is subject to large error when the number of replicates per condition is small, which is usually the case in microarray experiments.

Similar to Method II, Method V [13] creates a list of significant genes based on the moderated 𝑋-statistic. Extra information is borrowed, on the basis of the hierarchical model, from the ensemble of genes which can assist in inference about each gene individually. However, users need to subjectively specify the mixing proportions before the algorithm is applied. The authors pointed out that the estimations of mixing proportions are somewhat unstable and suggested using 0.01 as a universal prior. In real mice data, the proportion of differentially expressed genes is about 0.002, which is overestimated by limma. This may explain why limma detected too many false positives both in simulation study and mice data analysis.

Method VI [15] is quite similar to the proposed method except that the observed differential expression between the control and the treatment is used for cluster analysis instead of using the expected differential expression (parameter) as the basis for clustering. Again, the error structure of the expression profile is not properly incorporated into the model when the normalization-like procedure is used. This explains why the proposed new method outperforms Method VI. In addition, Method VI is based on the EM algorithm, while the proposed method is based on the MCMC algorithm. The EM algorithm sometimes may stuck in the so-called local optimal, while the MCMC has reasonable chance to jump out of the “trap.” This may explain why Method VI failed in the situation where the borders of clusters are fuzzy with the effects of differentially expressed genes varying within a wide span, rather than a narrow span (data not shown).

In a quantitative trait-associated microarray data analysis, Jia and Xu [25] classified genes into several clusters based on the association of gene expression and the phenotypic value of a quantitative trait. The Bayesian method developed here can be used for such a quantitative trait-associated microarray data analysis. Recall that, in the differential expression analysis, the design matrix 𝑋 is a variable indicating whether a gene comes from the control or the treatment. To make such an extension, the design matrix 𝑋 in the differential expression analysis is simply replaced by the phenotypic values of the trait in the quantitative trait-associated microarray analysis. The method developed here has the following extra features compared to that of Jia and Xu [25]: (a) Bayesian method implemented via the MCMC algorithm, (b) constraints on the cluster means, (c) an imbedded normalization step, and (d) a fixed number of clusters.

In differential gene expression analysis, we usually deal with two conditions, control and treatment, with a purpose of identifying genes whose expression levels have altered between the two conditions. In time-course and dose-response [26] microarray experiments, however, gene expression are measured from multiple conditions. The Bayesian mixture model analysis developed here may be extended to detect differentially expressed genes across multiple conditions. To make such an extension, the dimensions of 𝛾𝑖 and 𝑑+1 in model (2) need to be changed to reflect the multiple conditions. Let 𝑋 be the number of conditions. The modified dimensions of 𝛾𝑖 and 𝑁×𝑑 should be 𝑑×1 and 𝛾𝑖, respectively. The change of 𝛽𝑘=0 from a scaler into a vector leads to the following consequences: (1) the clusters of down- and upregulated genes are not explicitly defined, although the neutral cluster is still defined as that with 𝛾𝑖, and (2) the variance of 𝐶 in the experiment with two conditions becomes a variance-covariance matrix in the experiment with multiple conditions. The first problem may be solved via the following approaches. Let 𝛽1=0 be the total number of clusters from which these genes are sampled. Let the first cluster be the one consisting of all the neutral genes and thus 𝛽2𝛽𝐶 is enforced for this cluster. The means of the remaining clusters 𝐶 are not restricted, that is, they are sampled freely from their posterior distributions (multivariate normal). The number of clusters 𝑑 are found based on the value of Bayesian factor or BIC. Genes not classified into the neutral cluster are claimed to be differentially expressed. The actual number of neutral genes may not be sensitive to the choice of the total number of clusters. Therefore, one may simply choose any reasonable number of clusters for the analysis. This number may be a function of 𝐶=3𝑑, say 𝐶=3. Note that 𝐶=3×3=9 for a single dimension (one regression coefficient) as done in this study. For two regression coefficients, 𝑑. Therefore, for 𝐶=3𝑑 regression coefficients, the number of clusters becomes 𝛼𝑖. The second problem may be tackled as follows. The scaled inverse chi-square distribution chosen for the variance may be replaced by the generalization of the scaled inverse chi-square distribution for the variance-covariance matrix, called the inverse-Wishart distribution [21]. This prior distribution for the variance-covariance matrix is conjugate, and thus the standard sampling algorithm for a random vector from an inverse-Wishart distribution applies.

The intercept 𝑖 of gene 𝑊 in model (2) is irrelevant to the differential expression analysis and thus it is removed from the model via a special normalization process, called linear contrasting. In fact, all effects not related to differential expression can be removed via such a normalization process. Deleting these irrelevant effects (e.g., dye effect, block effect, etc.) may avoid tedious estimating procedure for unspecified parameters used in Zhang et al. [1] and save substantial time in calculation. To remove these effects, matrices 𝛼𝑖 and need to be customized to reflect the general nature of the normalization. Let 𝑊 be the number of irrelevant effects. The dimensions of 𝛼𝑖 and 𝑁× should be ×1 and 𝐿, respectively. The coefficient matrix for the linear contrasts 𝐿 remains the same as that defined by model (6). However, matrix 𝑁𝑟(𝑊) takes the first 𝐿 eigenvectors of matrix 𝛼𝑖. This generalized approach of normalization is conceptually similar to the ANOVA analysis proposed by Kerr et al. [27], where two steps are involved in the ANOVA. In the first step, 𝑌𝑖=𝑊𝛼𝑖+𝜖𝑖 is estimated under model 𝑌𝑖=𝑌𝑖𝑊𝛼𝑖 and then 𝑌𝑖, the residuals adjusted by the irrelevant effects, is used as the raw data for further differential expression analysis. In the second step, 𝑌𝑖=𝑋𝛾𝑖+𝜖𝑖 is described by model 𝛾𝑖 and then 𝑖 is estimated and tested to make an inference about the significance of gene 𝛼𝑖. Three issues need to be emphasized for the comparison: (1) the generalized approach of normalization proposed in this study removes 𝛼𝑖 using no explicit estimate of , (2) the reduced degrees of freedom after adjusting for the irrelevant effects are used for the new method, and (3) appropriate covariance structure for the residuals is used for the new method.

5. Conclusions

In this paper, we propose a Bayesian mixture model approach to detect genes that differentially expressed in control-treatment microarray experiments. Genes are assigned into one of three constrained clusters, corresponding to clusters of downregulated, neutral, and upregulated genes, respectively. The Bayes method is implemented via the Markov chain Monte Carlo (MCMC) algorithm. Genes that have been assigned into nonneutral clusters are the target genes which we would like to disclose. Using simulated data as well as data from real microarray experiments, we demonstrate that the new method outperforms the methods commonly used in differential expression analysis. Although the new method was demonstrated using data generated from laboratory animals, it is able to generalize to genome studies for plants.


This research was supported by the National Institute of Health Grant R01-GM55321, and the National Science Foundation Grant DBI-0345205 to SX.


  1. D. Zhang, M. T. Wells, C. D. Smart, and W. E. Fry, “Bayesian normalization and identification for differential gene expression data,” Journal of Computational Biology, vol. 12, no. 4, 391 pages, 2005. View at Publisher · View at Google Scholar
  2. R. Jörnsten, H.-Y. Wang, W. J. Welsh, and M. Ouyang, “DNA microarray data imputation and significance analysis of differential expression,” Bioinformatics, vol. 21, no. 22, 4155 pages, 2005. View at Publisher · View at Google Scholar
  3. B. Efron, R. Tibshirani, J. D. Storey, and V. Tusher, “Empirical bayes analysis of a microarray experiment,” Journal of the American Statistical Association, vol. 96, no. 456, 1151 pages, 2001. View at Publisher · View at Google Scholar
  4. P. Broët, S. Richardson, and F. Radvanyi, “Bayesian hierarchical model for identifying changes in gene expression from microarray experiments,” Journal of Computational Biology, vol. 9, no. 4, 671 pages, 2002. View at Publisher · View at Google Scholar
  5. J. W. Edwards, G. P. Page, G. Gadbury et al., “Empirical Bayes estimation of gene-specific effects in micro-array research,” Functional and Integrative Genomics, vol. 5, no. 1, 32 pages, 2005. View at Publisher · View at Google Scholar
  6. K.-A. Do, P. Müller, and F. Tang, “A Bayesian mixture model for differential gene expression,” Journal of the Royal Statistical Society. Series C, vol. 54, no. 3, 627 pages, 2005. View at Publisher · View at Google Scholar
  7. M. Medvedovic and S. Sivaganesan, “Bayesian infinite mixture model based clustering of gene expression profiles,” Bioinformatics, vol. 18, no. 9, 1194 pages, 2002. View at Publisher · View at Google Scholar
  8. C. Vogl, F. Sanchez-Cabo, G. Stocker, S. Hubbard, O. Wolkenhauer, and Z. Trajanoski, “A fully bayesian model to cluster gene-expression profiles,” Bioinformatics, vol. 21, 2, 130 pages, 2005. View at Publisher · View at Google Scholar
  9. A. E. Teschendorff, Y. Wang, N. L. Barbosa-Morais, J. D. Brenton, and C. Caldas, “A variational Bayesian mixture modelling framework for cluster analysis of gene-expression data,” Bioinformatics, vol. 21, no. 13, 3025 pages, 2005. View at Publisher · View at Google Scholar
  10. V. G. Tusher, R. Tibshirani, and G. Chu, “Significance analysis of microarrays applied to the ionizing radiation response,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 9, 5116 pages, 2001. View at Publisher · View at Google Scholar
  11. S. Dudoit, Y. H. Yang, M. J. Callow, and T. P. Speed, “Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments,” Statistica Sinica, vol. 12, no. 1, 111 pages, 2002. View at Google Scholar
  12. X. Cui and G. A. Churchill, “Statistical tests for differential expression in cDNA microarray experiments,” Genome Biology, vol. 4, no. 4, 210 pages, 2003. View at Publisher · View at Google Scholar
  13. G. K. Smyth, “Linear models and empirical bayes methods for assessing differential expression in microarray experiments,” Statistical Applications in Genetics and Molecular Biology, vol. 3, no. 1, article 3, 2004. View at Publisher · View at Google Scholar
  14. W. Pan, “A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments,” Bioinformatics, vol. 18, no. 4, 546 pages, 2002. View at Publisher · View at Google Scholar
  15. A. Gusnanto, A. Ploner, and Y. Pawitan, “Fold-change estimation of differentially expressed genes using mixture mixed-model,” Statistical Applications in Genetics and Molecular Biology, vol. 4, no. 1, article 26, 1 pages, 2005. View at Publisher · View at Google Scholar
  16. M. A. Newton, A. Noueiry, D. Sarkar, and P. Ahlquist, “Detecting differential gene expression with a semiparametric hierarchical mixture method,” Biostatistics, vol. 5, no. 2, 155 pages, 2004. View at Publisher · View at Google Scholar
  17. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society: Series B, vol. 39, 1 pages, 1977. View at Google Scholar
  18. Y. Benjamini and W. Liu, “A step-down multiple hypotheses testing procedure that controls the false discovery rate under independence,” Journal of Statistical Planning and Inference, vol. 82, no. 1-2, 163 pages, 1999. View at Publisher · View at Google Scholar
  19. H. D. Patterson and R. Thompson, “Recovery of inter-block information when block sizes are unequal,” Biometrika, vol. 58, 545 pages, 1971. View at Publisher · View at Google Scholar
  20. L. Devroye, Non-Uniform Random Variate Generation, Springer, New York, NY, USA, 1986.
  21. A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis, Chapman & Hall/CRC, New York, NY, USA, 1995.
  22. G. Schwartz, “Estimating the dimensions of a model,” The Annals of Statistics, vol. 6, 461 pages, 1978. View at Google Scholar
  23. M. J. Callow, S. Dudoit, E. L. Gong, T. P. Speed, and E. M. Rubin, “Microarray expression profiling identifies genes with altered expression in HDL-deficient mice,” Genome Research, vol. 10, 2022 pages, 2000. View at Publisher · View at Google Scholar
  24. P. H. Westfall and S. S. Young, Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment, Wiley, New York, NY, USA, 1993.
  25. Z. Jia and S. Xu, “Clustering expressed genes on the basis of their association with a quantitative phenotype,” Genetical Research, vol. 86, no. 3, 193 pages, 2005. View at Publisher · View at Google Scholar
  26. S. D. Peddada, E. K. Lobenhofer, L. Li, C. A. Afshari, C. R. Weinberg, and D. M. Umbach, “Gene selection and clustering for time-course and dose-response microarray experiments using order-restricted inference,” Bioinformatics, vol. 19, no. 7, 834 pages, 2003. View at Publisher · View at Google Scholar
  27. M. K. Kerr, M. Martin, and G. A. Churchill, “Analysis of variance for gene expression microarray data,” Journal of Computational Biology, vol. 7, no. 6, 819 pages, 2001. View at Publisher · View at Google Scholar