Abstract

Control-treatment design is widely used in microarray gene expression experiments. The purpose of such a design is to detect genes that express differentially between the control and the treatment. Many statistical procedures have been developed to detect differentially expressed genes, but all have pros and cons and room is still open for improvement. In this study, we propose a Bayesian mixture model approach to classifying genes into one of three clusters, corresponding to clusters of downregulated, neutral, and upregulated genes, respectively. The Bayesian method is implemented via the Markov chain Monte Carlo (MCMC) algorithm. The cluster means of down- and upregulated genes are sampled from truncated normal distributions whereas the cluster mean of the neutral genes is set to zero. Using simulated data as well as data from a real microarray experiment, we demonstrate that the new method outperforms all methods commonly used in differential expression analysis.

1. Introduction

Current microarray technology allows us to measure the expression of thousands of genes simultaneously in a small chip. Many advanced statistical methods have been developed to analyze data generated from this technology. For example, a Bayesian model has been proposed to normalize gene expression data and select genes that are differentially expressed [1]. JΓΆrnsten et al. [2] developed a Bayesian method to impute missing values prior to any statistical analysis. Efron et al. [3], BroΓ«t et al. [4], Edwards et al. [5], Do et al. [6] used a Bayesian approach to identify differentially expressed genes. Methods of [7–9] are examples for Bayesian clustering analysis.

In this study, we focus on detecting differentially expressed genes in two-sampled designs, in which only two conditions (control and treatment) are examined and each condition is replicated several times. The data are assumed to have been properly imputed and normalized as needed. Much effort has been made on this study to detect genes whose expression levels respond to the treatment.

Numerous methods have been suggested based on the P-values of a test statistic. The 𝑃-values are obtained from separate tests and reported for all genes [10, 11]. Cui and Churchill [12] reviewed the test statistics for differential expression for microarray experiments. The gene-specific summary statistic provides a basis for the rank ordering of genes and the creation of a short list of genes declared as differentially expressed. However, genes are studied separately in the sense that calculating the test statistic for one gene does not use information from other genes. In limma (linear model for microarray data), Smyth [13] cleverly borrowed information from the ensemble of genes to make inference for individual gene based on the moderate t-statistic. Some other researchers also took advantages of shared information by examining data jointly. Efron et al. [3] proposed a mixture model methodology implemented via an empirical Bayes approach. For each gene, the expression levels were mapped to a single 𝑧-score, which is assumed to arise from two distributions (affected and unaffected). Pan [14] used a mixture model to cluster genes based on gene specific 𝑑-statistic. They are examples where the idea of clustering was first used in differential expression analysis (clustering methods are usually applied to microarray data generated from multiple conditions). Similarly, the methods of [4–6, 15] all used a mapped quantity, like the 𝑧-score or the 𝑑-statistic, to classify genes into different groups. These methods are different from the proposed Bayesian mixture model in that genes were clustered based on a summary statistic for these methods while our Bayesian method clusters genes based on the pattern of expression. It is more desirable to use the expression profile than to use a summary score for cluster analysis because information is bound to be lost when mapping the expression profile into a single score. Recently, Newton et al. [16] developed a new method in which gene expression are directly used for classification. Newton et al. [16] used a hierarchical model that consists of two levels. The first level of the model describes the conditional distributions of the observed measurement of gene expression given the means of the control (πœ‡1𝑖) and the treatment (πœ‡2𝑖), where 𝑖 indexes the gene. The second level describes the distribution of πœ‡1𝑖 and πœ‡2𝑖 jointly by a mixture of three multivariate distributions with constrained parameters of πœ‡1>πœ‡2, πœ‡1=πœ‡2, and πœ‡1<πœ‡2. Genes become linked by virtue of having πœ‡1𝑖 and πœ‡2𝑖 drawn from a common distribution. Parameter estimation was accomplished via the Expectation-Maximization (EM) algorithm [17].

For the Bayesian clustering method presented in this study, the observed gene expression levels are described by a regression model as done by Gusnanto et al. [15]. For each gene, the irrelevant intercept is removed by a special normalization scheme. The slope of the regression represents the difference of expression under the two conditions. The Bayesian method is implemented via the Markov chain Monte Carlo (MCMC) algorithm. The regression coefficient of each gene is assumed to be sampled from a mixture of three normal distributions with constrained parameters. The three distributions are 𝑁(π›½π‘˜,πœˆπ‘˜) for π‘˜=1,2,3, where 𝛽1<0, 𝛽2=0, and 𝛽3>0 are the constrained parameters. The proposed new method actually turns the problem of a complicated multivariate mixture distribution of Newton et al. [16] into that of a univariate mixture distribution.

The new Bayesian method proposed in this study (Method I) is compared to five different methods that are commonly used in differential expression analysis. These four methods, called Methods II, III, IV, V, and VI respectively, are described below. Method II is the regularized 𝑑-test (SAM, Tusher et al. [10]), in which a 𝑑-like score is calculated for each gene. Genes with scores greater than an adjusted threshold are said to have altered expressions under the treatment. Permutations are used to calculate the false discovery rate (FDR, Benjamini and Liu [18]). Method III is the model-based cluster analysis of Pan [14] where the variable used for clustering is the 𝑑-test statistic. Genes that are assigned into a cluster with a mean 𝑑 value significantly different from zero are declared to be differentially expressed. Method IV is the hierarchical mixture model method of Newton et al. [16], where the expected expression values of the 𝑖th gene under the control (πœ‡1𝑖) and the treatment (πœ‡2𝑖) are assumed to arise from a mixture of three distributions. Clustering is actually conducted based on the latent parameters (πœ‡1𝑖 and πœ‡2𝑖). The cluster of genes with πœ‡1π‘–β‰ πœ‡2𝑖 are declared as differentially expression genes. A short list of significant genes is created based on FDR at 0.05. Method V is the linear model and empirical Bayes method (limma) of Smyth [13]. In limma, a hierarchical linear model is used to describe the gene expression levels. A moderated 𝑑-statistic, with extra information borrowed from other genes, is calculated for each gene. Adjusted 𝑃-values are calculated for genes based on the moderated 𝑑-statistics to achieve FDR control. Method VI is the mixture mixed model analysis of Gusnanto et al. [15], where a similar linear model is used to describe gene expressions. However, the variable used for clustering is the mean difference between the treatment and the control. The mathematical differences between the proposed method and the four methods compared will be evident after the new method is introduced. Advantages of the new method over the four methods will be demonstrated in Section 3 and further addressed in the Section 4.

2. Methods

2.1. Linear Model

Let π‘Œπ‘–π‘—(𝑖=1,…,π‘šand𝑗=1,…,𝑁) be the log transformed expression level for gene 𝑖 from chip 𝑗, where π‘š is the total number of genes and 𝑁 is the total number of chips. Let 𝑋𝑗 be a dichotomous variable indicating the status of the chip with 𝑋𝑗=0 for the control and 𝑋𝑗=1 for the treatment. We use the following regression model to describe π‘Œπ‘–π‘—: π‘Œπ‘–π‘—=𝛼𝑖+𝑋𝑗𝛾𝑖+πœ€π‘–π‘—,(1) where 𝛼𝑖 is the intercept of the regression model, 𝛾𝑖 is the regression coefficient and πœ€π‘–π‘— is the residual error. Note that the regression coefficient represents the difference in gene expression between the control and the treatment (πœ‡2βˆ’πœ‡1). A large 𝛾𝑖 implies that gene 𝑖 is differentially expressed. The intercept 𝛼𝑖 represents the mean expression level of gene 𝑖 in the control (πœ‡1). The intercept 𝛼𝑖 is irrelevant to the differential expression analysis and thus should be removed from the model (see next section for detail). The regression coefficient 𝛾𝑖 is assumed to be sampled from one of three normal distributions, corresponding to the downregulated genes, the neutral genes, and the upregulated genes, respectively. The residual error is assumed to be 𝑁(0,𝜎2) distributed. In matrix notation, model (1) can be expressed as π‘Œπ‘–=π‘Šπ›Όπ‘–+𝑋𝛾𝑖+πœ€π‘–,(2) where π‘Œπ‘–={π‘Œπ‘–π‘—}𝑁𝑗=1 is a column vector for the expressions of gene 𝑖 across all samples, 𝑋={𝑋𝑗}𝑁𝑗=1 is the incidence matrix (a column vector) for the linear model, πœ€π‘–={πœ€π‘–π‘—}𝑁𝑗=1 is a vector for the residual errors, and π‘Š is an 𝑁×1 unity vector, that is, a vector of 1 for all elements.

2.2. Normalization

As stated earlier, 𝛼𝑖 is irrelevant to the differential expression analysis. We propose to remove the effect of 𝛼 by using a linear contrast. This process, which is called normalization, is similar to the way of removing the fixed effects in the restricted maximum likelihood (REML) method for variance component analysis [19]. For each gene, we subtract the observed expression level by the average level across all chips, that is, π‘Œβˆ—π‘–=π‘Œπ‘–βˆ’π‘Šπ‘Œπ‘–,(3) where π‘Œπ‘–.=π‘βˆ’1βˆ‘π‘π‘—=1π‘Œπ‘–π‘— is the mean expression for gene 𝑖 across all the chips. The mean expression can also be expressed in matrix notation as π‘Œπ‘–.=ξ‚€π‘Šπ‘‡π‘Šξ‚βˆ’1π‘Šπ‘‡π‘Œπ‘–..(4) Therefore, the normalized expression is described as a linear combination of the original expressions π‘Œβˆ—π‘–=ξ‚ƒξ‚€π‘ŠπΌβˆ’π‘Šπ‘‡π‘Šξ‚βˆ’1π‘Šπ‘‡ξ‚„π‘Œπ‘–=πΏπ‘Œπ‘–,(5) where ξ‚€π‘ŠπΏ=πΌβˆ’π‘Šπ‘‡π‘Šξ‚βˆ’1π‘Šπ‘‡.(6) We can prove that πΏπ‘Š=0. Therefore, π‘Œβˆ—π‘–=πΏπ‘Œπ‘–=𝐿𝑋𝛾𝑖+πΏπœ€π‘–.(7) Let π‘‹βˆ—=𝐿𝑋 and πœ€βˆ—π‘–=πΏπœ€π‘–, we have π‘Œβˆ—π‘–=π‘‹βˆ—π›Ύπ‘–+πœ€βˆ—π‘–.(8) Now the intercept has been removed by using linear contrasts πΏπ‘Œπ‘– in place of the original π‘Œπ‘–. The residual variance-covariance matrix of the linear contrasts is Var(πœ€βˆ—π‘–)=πΏπΏπ‘‡πœŽ2. Note that the rank of matrix 𝐿𝐿𝑇 is π‘βˆ’π‘Ÿ(π‘Š)=π‘βˆ’1, indicating that the inverse of matrix 𝐿𝐿𝑇 does not exist. The model-based cluster analysis, however, requires the inverse. Therefore, we delete the last row of matrix 𝐿 to form a new (π‘βˆ’1)×𝑁 matrix, called πΏβˆ—, to build the linear contrasts. This treatment is equivalent to deleting the last element of π‘Œβˆ—π‘–, that is, the dimension of π‘Œβˆ—π‘– is (π‘βˆ’1)Γ—1. The dimensions of matrix π‘‹βˆ— and πœ€βˆ—π‘– also change into (π‘βˆ’1)Γ—1 accordingly. Let 𝑅=πΏβˆ—(πΏβˆ—)𝑇 be an (π‘βˆ’1)Γ—(π‘βˆ’1) matrix of full rank. We now have Var(πœ€βˆ—π‘–)=π‘…πœŽ2, which will be used in the model-based cluster analysis.

To simplify the notation, let us define 𝑦𝑖=π‘Œβˆ—π‘–, π‘₯=π‘‹βˆ— and πœ–π‘–=πœ€βˆ—π‘–, all with a dimension of (π‘βˆ’1)Γ—1. The new model with the intercept removed becomes 𝑦𝑖=π‘₯𝛾𝑖+πœ–π‘–.(9) Note that the special way of normalization described above only applies to linear effects. Different methods should be used for adjusting nonlinear effects. In subsequent analysis, we assume that normalization via linear contrasts has been conducted and thus model (9) will be the basis for parameter estimation and statistical inference.

2.3. Mixture Model and the Bayesian Setup

Conditional on 𝛾𝑖, the probability density of 𝑦𝑖 is π‘ξ‚€π‘¦π‘–βˆ£π›Ύπ‘–,𝜎2=πœ™π‘›ξ‚€π‘¦π‘–;π‘₯𝛾𝑖,π‘…πœŽ2,(10) where 𝑛=π‘βˆ’1, πœ™π‘›(𝑦𝑖;π‘₯𝛾𝑖,π‘…πœŽ2) is the 𝑛-dimensional normal density of variable 𝑦𝑖 with mean π‘₯𝛾𝑖 and variance-covariance matrix π‘…πœŽ2. The gene-specific effect 𝛾𝑖 is assumed to be sampled from one of three normal distributions (clusters), 𝑁(π›½π‘˜,πœˆπ‘˜) for π‘˜=1,2,3. We define variable 𝑧𝑖 as the cluster assignment of gene 𝑖, where 𝑧𝑖=π‘˜ if gene 𝑖 comes from cluster π‘˜. For notational simplicity, we define πœ‚(𝑧𝑖,π‘˜) for π‘˜=1,2,3 as a redundant variable to 𝑧𝑖, where πœ‚(𝑧𝑖,π‘˜)=1 if 𝑧𝑖=π‘˜ and πœ‚(𝑧𝑖,π‘˜)=0 otherwise. Let πœ‚(𝑧𝑖)={πœ‚(𝑧𝑖,π‘˜)}3π‘˜=1 for πœ‚(𝑧𝑖,π‘˜)=0,1 and βˆ‘3π‘˜=1πœ‚(𝑧𝑖,π‘˜)=1. The new variable πœ‚(𝑧𝑖) will be used in place of 𝑧𝑖 in subsequent derivation. Let 𝛽={π›½π‘˜}3π‘˜=1 and 𝜈={πœˆπ‘˜}3π‘˜=1. The density of the mixture distribution of 𝛾𝑖 is π‘ξ‚€π›Ύπ‘–ξ‚€π‘§βˆ£πœ‚π‘–ξ‚ξ‚=,𝛽,𝜈3ξ“π‘˜=1πœ‚ξ‚€π‘§π‘–ξ‚πœ™,π‘˜1𝛾𝑖;π›½π‘˜,πœˆπ‘˜ξ‚.(11) Note that variable πœ‚(𝑧𝑖,π‘˜) is unknown and it is treated as missing value. In fact, inferring the probability distribution of πœ‚(𝑧𝑖,π‘˜) is the main purpose of the proposed mixture model analysis. Define πœ‹={πœ‹π‘˜}3π‘˜=1 as the mixing proportions of the three components for πœ‹π‘˜β‰₯0 and βˆ‘3π‘˜=1πœ‹π‘˜=1. The distribution of πœ‚(𝑧𝑖) is multinomial with one observation, π‘ξ‚€πœ‚ξ‚€π‘§π‘–ξ‚ξ‚=βˆ£πœ‹3ξ‘π‘˜=1πœ‹πœ‚ξ€·π‘§π‘–,π‘˜ξ€Έπ‘˜.(12)

We have introduced the probability densities of the data, the regression coefficients, and the missing values for a single gene. We now combine the densities of individual genes to form the joint density of all genes. The probability density of the data 𝑦={𝑦𝑖}π‘šπ‘–=1 is π‘ξ‚€π‘¦βˆ£π›Ύ,𝜎2=π‘šξ‘π‘–=1π‘ξ‚€π‘¦π‘–βˆ£π›Ύπ‘–,𝜎2.(13) The density of 𝛾={𝛾𝑖}π‘šπ‘–=1 is 𝑝(π›Ύβˆ£πœ‚,𝛽,𝜈)=π‘šξ‘π‘–=1π‘ξ‚€π›Ύπ‘–ξ‚€π‘§βˆ£πœ‚π‘–ξ‚ξ‚,𝛽,𝜈.(14) The density of πœ‚={πœ‚(𝑧𝑖)}π‘šπ‘–=1 is 𝑝(πœ‚βˆ£πœ‹)=3ξ‘π‘˜=1πœ‹π‘šπ‘˜π‘˜,(15) where π‘šπ‘˜=βˆ‘π‘šπ‘–=1πœ‚(𝑧𝑖,π‘˜) for βˆ‘3π‘˜=1π‘šπ‘˜=π‘š. The joint density of (𝑦,𝛾,πœ‚) is 𝑝(𝑦,𝛾,πœ‚βˆ£πœƒ)=π‘π‘¦βˆ£π›Ύ,𝜎2𝑝(π›Ύβˆ£πœ‚,𝛽,𝜈)𝑝(πœ‚βˆ£πœ‹),(16) where πœƒ=(𝜎2,𝛽,πœ‹,𝜈) is the parameter vector.

The next step of the analysis is to find a suitable prior for πœƒ. The following vague prior is chosen for each of the variance components, 𝜎2∼Invβˆ’πœ’2(0,0) and πœˆπ‘˜βˆΌInvβˆ’πœ’2(0,0) for π‘˜=1,2,3. They are called the scaled inverse chi-square distributions with zero degrees of freedom and zero-scale parameter. The actual forms of the priors are 𝑝(𝜎2)=1/𝜎2 and 𝑝(πœˆπ‘˜)=1/πœˆπ‘˜ for π‘˜=1,2,3. The scaled inverse chi-square distribution is conjugate, and thus the posterior distribution of the variance is also scaled inverse chi-square. The three clusters of genes are distinguished by the overall patterns of responses to the treatment. The first cluster consists of all the downregulated genes, the second cluster represents all the neutral genes, and the third cluster contains all the upregulated genes. The characteristics of the three clusters can be represented by enforcing the following constraints on the means of the three clusters: 𝛽1<0, 𝛽2=0, and 𝛽3>0. Therefore, the prior distributions of the means of the three clusters are 𝑝(π›½π‘˜)=1\ (π›½π‘˜βˆˆΞ©π‘˜). Here, we adopt a special notation to indicate that 𝑝(π›½π‘˜)=1 if (π›½π‘˜βˆˆΞ©π‘˜) and 𝑝(π›½π‘˜)=0 if (π›½π‘˜βˆ‰Ξ©π‘˜), where Ξ©1≑(𝛽1<0), Ξ©2≑(𝛽2=0), and Ξ©3≑(𝛽3>0). The prior for πœ‹ is the multivariate generalization of the Beta distribution 𝑝(πœ‹βˆ£π›Ώ)=𝐷(πœ‹βˆ£π›Ώ,𝛿,𝛿), called the Dirichlet distribution, where 𝛿=1 is used in this study. The joint prior for all the parameters is 𝑝(πœƒ)=(1/𝜎2)∏3π‘˜=1(1/πœˆπ‘˜). The posterior distribution of all the unknowns 𝑝(𝛾,πœ‚,πœƒβˆ£π‘¦) is proportional to the joint distribution of all variables, 𝑝(𝛾,πœ‚,πœƒβˆ£π‘¦)βˆπ‘(𝑦,𝛾,πœ‚,πœƒ)=π‘π‘¦βˆ£π›Ύ,𝜎2𝑝(π›Ύβˆ£πœ‚)𝑝(πœ‚βˆ£πœ‹)𝑝(πœƒ).(17) Statistical inference of this distribution is the theme of the Bayesian analysis. There is no explicit form for the joint distribution (17). Therefore, we draw observations of the unknowns from the conditional distributions. Fortunately, with the current Bayesian setup, the conditional distribution of any single variable, given all other variables, has a known distribution. Therefore, the MCMC process can be proceeded exclusively using the Gibbs sampler.

2.4. Markov Chain Monte Carlo

The detailed steps of the Markov chain Monte Carlo (MCMC) process are described as follows.

Step 1. Set 𝑑=0 and initialize all variables (𝛾(𝑑),πœ‚(𝑑),𝛽(𝑑), 𝜎2(𝑑),𝜈(𝑑),πœ‹(𝑑)).

Step 2. Simulate 𝛾𝑖(𝑑+1) from π›Ύπ‘–βˆΌπ‘(𝛾𝑖,𝑠2𝛾𝑖) conditional on 𝑧𝑖=π‘˜, where 𝑑 where all variables in the right-hand side take the most current values. In this step, the most current values of the variables in the right-hand side are the values at iteration 𝑑. This statement also holds for subsequent steps, except that the most current values of the variables in the right-hand side are values at iteration 𝑑+1 or πœ‚(𝑧𝑖)(𝑑+1), depending on whether the variable has been updated or not in the current sweep.

Step 3. Simulate πœ‚ξ‚€π‘§π‘–ξ‚ξ‚€βˆΌMultinomial1,πœ‹π‘–1,πœ‹π‘–2,πœ‹π‘–3,(19) from πœ‹π‘–π‘˜ a multinomial distribution with one observation, where πœ‹π‘–π‘˜=πœ‹π‘˜π‘ξ‚€π›Ύπ‘–ξ‚€π‘§βˆ£πœ‚π‘–ξ‚,π‘˜,π›½π‘˜,πœˆπ‘˜ξ‚βˆ‘3π‘˜β€²=1πœ‹π‘˜β€²π‘ξ‚€π›Ύπ‘–ξ‚€π‘§βˆ£πœ‚π‘–,π‘˜ξ…žξ‚,π›½π‘˜β€²,πœˆπ‘˜β€²ξ‚.(20) is
π›½π‘˜(𝑑+1)

Step 4. Simulate π›½π‘˜ξ‚€βˆΌπ‘π›½π‘˜,𝑠2π›½π‘˜ξ‚1ξ‚€π›½π‘˜βˆˆΞ©π‘˜ξ‚.(21) from a truncated normal distribution π›½π‘˜=1βˆ‘π‘šπ‘–=1πœ‚ξ‚€π‘§π‘–ξ‚,π‘˜π‘šξ“π‘–=1πœ‚ξ‚€π‘§π‘–ξ‚π›Ύ,π‘˜π‘–,𝑠2π›½π‘˜=πœˆπ‘˜βˆ‘π‘šπ‘–=1πœ‚ξ‚€π‘§π‘–ξ‚,,π‘˜(22) The mean and the variance of the normal distribution are 𝛽2=0 respectively. Note that 𝛽2 is enforced and no sampling for 𝛽1 is required. The general inverse transformation method [20] is used to sample 𝛽3 and 𝜎2(𝑑+1) from the corresponding truncated normal distributions.

Step 5. Simulate 𝜎2∼Invβˆ’πœ’2𝑦(π‘šπ‘›,π‘–βˆ’π‘₯π›Ύπ‘–ξ‚π‘‡π‘…βˆ’1ξ‚€π‘¦π‘–βˆ’π‘₯𝛾𝑖),(23) from π‘šΓ—π‘› where πœˆπ‘˜(𝑑+1) is the degree of freedom, and the term with summation is the scale parameter of the inverse chi-square distribution.

Step 6. Simulate πœˆπ‘˜βˆΌInvβˆ’πœ’2(π‘šπ‘˜,ξ“πœ‚ξ‚€π‘§π‘–π›Ύ,π‘˜ξ‚ξ‚€π‘–βˆ’π›½π‘˜ξ‚2),(24) from π‘šπ‘˜=βˆ‘π‘šπ‘–=1πœ‚(𝑧𝑖,π‘˜) where πœ‹(𝑑+1).

Step 7. Simulate ξ‚€π‘šπœ‹βˆΌDirichlet1+1,π‘š2+1,π‘š3+1.(25) from 𝑑

So far, every variable has been updated. We then increment (𝑑=𝑑+1) by 1𝑖 and repeat Steps 3–7 until the chain gets sufficiently long to allow reliable post-MCMC inference about the parameters of interest. Geneπ‘˜ is assigned to group πœ‚(𝑧𝑖,π‘˜)=max{πœ‚(𝑧𝑖,π‘˜ξ…ž)}3π‘˜β€²=1 if πœ‚(𝑧𝑖,π‘˜), where πœ‚(𝑧𝑖,π‘˜) is the posterior mean of π‘š=1000 calculated from the posterior sample.

Schemes for sampling variables from the aforementioned distributions are discussed by Gelman et al. [21].

3. Applications

3.1. Analysis of Simulated Data

We simulated the expression of 𝑁=16 genes from six different groups on (π‘š1=20) microarray chips. Chips 1–8 represent the control, and chips 9–16 represent the treatment. The true parameter values of the six groups of genes are shown in Table 1. If we ignore the intercepts, the six groups of genes really come from three functional clusters: Cluster 1 represents the (π‘š2=960) downregulated genes (Group 3); Cluster 2 contains the (π‘š3=20) neutral genes (Groups 2 and 4); and Cluster 3 consists of the 𝛽 upregulated genes (Groups 1, 5, and 6). The true parameters of the original six groups and the parameters for the combined three clusters are given in Table 1. Note that there are actually three components (Group 1, 5, and 6) for the combined Cluster 3. The weighted 𝛽 of three simulated components is treated as the true 1.2Γ—0.25+0.8Γ—0.25+0.9Γ—0.5=0.95 for this cluster, that is, π›Όπ‘˜. The expression patterns are shown in Figure 1 for the original six groups and Figure 2 for the combined three clusters.

The data were analyzed using the Bayesian mixture model analysis reported in this study (Method I). We set the number of iterations equal to 12000 in the MCMC process. The results generated from the first 2000 iterations were discarded for the burn-in period. For the remaining 10000 iterations, we saved one observation in every 20 iterations to reduce series correlation between consecutive observations. Therefore, the posterior sample contained 500 observations. The posterior mean of each variable is considered as the Bayesian estimate. Hypothesis tests were only performed on the cluster means 𝛽 rather than on individual genes. Genes assigned into a significant cluster were simultaneously declared to be differentially expressed. From Table 1, we can see that the estimated parameters agree well with the true parameters. Genes are correctly assigned into three clusters except that two genes in Cluster 2 (the neutral cluster) are incorrectly assigned into the other two clusters (down- and upregulated clusters) (see Table 2). Upon enforcing constraints on the cluster means, we are able to fix the number of clusters to three, rather than finding the optimal number of clusters using the Bayes factor or its BIC [22] approximation.

An interesting question is what would happen if the cluster means are not constrained? To answer this question, we also analyzed the simulated data with unconstrained 𝑁(π‘˜,π‘˜ξ…ž). We varied the number of clusters from 3 to 6 and found out that 4 was the optimal number of clusters (maximum value of Bayes factor). One cluster mean was close to zero. Another cluster had a negative mean. The third and fourth clusters all had positive means but the two means were almost identical. Therefore, Clusters 3 and 4 may be well considered as a single cluster. When treated as three clusters, the results were similar to those reported earlier when the cluster means were restricted (data now shown).

For comparison, the data were also analyzed using the five methods (Methods II–VI) described in Section 1. The numbers of genes classified into each of the three clusters are given in Table 2 for all the six methods, including Method I (the new method). Cutoff value 0.05 was used for gene detection in Method II and Method V to achieve FDR control. All methods have successfully found the 40 truly differentially expressed genes. However, Methods II, III, IV, V, and VI detected more false-positive genes than Method I. The model-based cluster analysis of Pan (Method III) required BIC to decide the optimal number of clusters. For simplicity of comparison, we only used the result of Method III under three clusters. We noticed that Method III always put the down- and upregulated genes into a cluster with a large variance, but placed the neutral genes into the other two cluster with small variances. The two clusters with small variances can be combined as a single cluster. Because all the methods have successfully detected the 40 nonneutral genes, none of them have suffered the Type II error. However, different methods tended to have different numbers of genes from the neutral cluster misplaced into the functional clusters. This means that different methods have different Type I errors.

Let π‘˜ be the number of genes from cluster π‘˜ξ…ž assigned into cluster π‘˜,π‘˜ξ…ž=1,2,3 for βˆ‘π‘(π‘˜)=3π‘˜β€²=1𝑁(π‘˜,π‘˜ξ…ž). Then π‘˜ is the number of genes in cluster TypeIerror=𝑁(2,1)+𝑁(2,3)𝑁(2).(26). Recall that Clusters 1 and 3 contain differentially expressed genes and Cluster 2 is reserved for the neutral genes. The empirical Type I error is then defined as TypeIIerror=𝑁(1,2)+𝑁(3,2)𝑁(1)+𝑁(3).(27) The empirical Type II error is defined as Power=1βˆ’TypeIIerror.(28) The empirical power is defined as 𝛽3 The empirical Type I errors for all the five methods are listed in Table 2. Method I (the new method) have the smallest Type I error. Note that this method of validation is valid only for simulation study where the true gene assignment is known.

3.2. Analysis of Mice Data

We analyzed a cDNA microarray dataset published by Dudoit et al. [11]. The data are publicly available on the website (http://www.stat.berkeley.edu/users/terry). The data consist of expression measurements of 6342 genes from a study of lipid metabolism in mice [23]. This experiment was set up to find genes that are differentially expressed in the livers of mice with very low HDL cholesterol levels (treatment) in contrast to a group of normal inbred mice (control). The treatment group consists of eight scavenger receptor BI (SR-BI) transgenic mice and the control group consists of eight normal inbred mice.

We first analyzed the data with the Bayesian mixture model method proposed in this study (Method I). We used exactly the same setup of the MCMC as that used in the simulation experiment. The estimated parameters are given in Table 3. Only ξπœ‹1=0.1724 (cluster mean of up-regulated genes) is significantly different from zero. The mixing proportions of the three clusters show that majority of the genes (82.46%) are neutral. Note that the estimated mixing proportions (ξπœ‹2=0.8246, ξπœ‹3=0.0030), π‘š1=0 are some nuisance parameters, which do not necessarily match the actual numbers of genes assigned to the three clusters (π‘š2=6332, π‘š3=10, π›½π‘˜). Report should be made based on the actual numbers of genes assigned to the clusters.

All six methods that were used to analyze the simulated data were applied here to analyze the mice data. The numbers of genes assigned to the three clusters are given in Table 4 for all the five methods. Clearly, Methods I and VI were similar to each other and Methods II, III, IV, and V placed much more genes into the nonneutral clusters, a phenomenon which has been observed early in the simulation experiment.

In the original study [23], individual 𝑑-values were calculated based on the 𝑃-test statistics for all the genes and then the adjusted √-values were calculated using the method described in Westfall and Young [24]. Five genes (represented by eight array elements) were detected by Callow et al. [23] as differentially expressed between the treatment and the control (see Table 5). Note that several array elements may represent the same gene. We found that gene EST AI746730 is not included in the dataset and thus was not analyzed with any of the methods (marked as NA in Table 5).

The Bayesian mixture model analysis detected 10 genes (or spots), four of which were reported in Callow et al. [23]. The expressions of the 10 genes are plotted in Figure 3 (the first two rows). Significant differences between the control and treatment groups can be easily seen. Note that element 2374 (the first plot of the second row of Figure 3) was a β€œblank” spot on the chip. However, it was detected by all methods due to the remarkable difference between the control and the treatment. The third row of the plots represents five randomly selected genes which were not detected by any of the methods. Genes of this kind have the same expression levels under the control and treatment. The last row of the plots (in Figure 3) shows the expression patterns of five genes that were not detected by the Bayesian mixture model analysis (Method I), but detected by the other five methods (Methods II, III, IV, V, and VI). When examining these five genes, we found that the differences between the treatment and the control are not obvious. The statistical significance may be caused entirely by the outliers. Therefore, the five methods (II, III, IV, V, and VI) are perhaps too sensitive to outliers. For example, the first plot (5203) of the last row in Figure 3 is the spot that represents 𝛽>0-globin in the dataset. We cannot tell much difference between the control and the treatment. Genes with numbers 2375, 2377, 2379, and 2384 are all β€œblank” spots. The observed differences between the control and the treatment are purely caused by chance, yet these blank spots were detected by Methods II, III, IV, V, or VI. Finally, Method III [14] and Method V [13] missed gene SR-BI at array element number 3, which is known to have altered the expression between the control and the SR-BI transgenic mice.

When we examined the plots of the ten genes detected by the Bayesian mixture model analysis, we found that seven of them have increased the expression level by the treatment while three of them have decreased the expression level by the treatment. Interestingly, our method assigned the three genes with negative regression coefficients into the cluster with positive regression coefficients. If we look at the estimated parameters (Table 3) again, we realized that both Clusters 1 and 2 may be considered as the neutral cluster (mean close to zero and variance very small). The third cluster contains both the up-and downregulated genes because it has a much larger variance than the other two clusters. The same phenomenon also occurred for Method VI [15], but the issue was not discussed by Gusnanto et al. [15]. A possible explanation is that there are too few genes with negative regression coefficients, which made them hard to form a single cluster by themselves. However, why were these three genes assigned into Cluster 3 (𝛽=0) instead of Cluster 2 (𝜈), which is closer to the three genes in terms of the cluster mean? The reason may be that Cluster 3 has a larger variance than Cluster 2.

A separate simulation experiment has been conducted to verify the notion that 𝛽 plays a more important role than πœ‹π‘–π‘˜ in calculating the posterior probability of cluster assignment 𝑑 (data not shown).

4. Discussion

Similar to Method IV [16], our method is based on a hierarchical model in which the parameters of interest (gene effects) are further described by a mixture distribution. Clustering is made based on the parameter rather than on a summary statistic such as the 𝑃-like statistical score. This allows us to incorporate the error structure of the gene expression profile into the linear model (see (10)), and thus capture the most information from the data. This explains why our method is different from (or even better than) the other four methods (Methods II, III, V, and VI). But why is our method better than (or different from) Method IV? This may be contributed by the two differences between Methods I and IV. One difference is that the incorporated normalization scheme in our model allows us to deal with only the effect of differential expression (regression coefficient) whereas Method IV deals with effects of gene expressions in both the control and the treatment. In other words, we are dealing with a single variable (regression coefficient, the only parameter of interest) but Method IV deals with two variables (intercept and regression coefficients). The dimension reduction from two to one and the simplified Gaussian mixture distribution of our model may largely contribute to the higher efficiency of our method. The second difference between our method and Method IV is that we used the probabilities of cluster assignment to classify genes into three clusters, and genes assigned into the neutral cluster are excluded from the list of differentially expressed genes, but Method IV goes one step further by employing an FDR criterion to select a list of differentially expressed genes. It is obvious that the FDR generated list of genes depends largely on the subjective cutting rule set by the investigator. We consider that the extra step of FDR analysis after gene clustering is not only redundant but also leads to subjective decision for differential expression analysis.

Method II [10] sorts genes based on the 𝑑-value for a regularized 𝑃-statistic calculated from each gene. Information from other genes plays no role in calculating the 𝑑-value for the current gene of interest. This, combined with the subjectiveness of the cutting rule for gene selection, may largely explains the difference between Method II and the proposed new method.

Method III [14] uses a 𝑑-statistic as the data point for cluster analysis. The original gene expression profile is converted into the 𝑑 score. Some information may be lost during the conversion because the method fails to incorporate the error for calculating the 𝑑 score. In addition, the 𝑑 score is the differential gene expression divided by the standard deviation of the difference. However, the calculated standard deviation is subject to large error when the number of replicates per condition is small, which is usually the case in microarray experiments.

Similar to Method II, Method V [13] creates a list of significant genes based on the moderated 𝑋-statistic. Extra information is borrowed, on the basis of the hierarchical model, from the ensemble of genes which can assist in inference about each gene individually. However, users need to subjectively specify the mixing proportions before the algorithm is applied. The authors pointed out that the estimations of mixing proportions are somewhat unstable and suggested using 0.01 as a universal prior. In real mice data, the proportion of differentially expressed genes is about 0.002, which is overestimated by limma. This may explain why limma detected too many false positives both in simulation study and mice data analysis.

Method VI [15] is quite similar to the proposed method except that the observed differential expression between the control and the treatment is used for cluster analysis instead of using the expected differential expression (parameter) as the basis for clustering. Again, the error structure of the expression profile is not properly incorporated into the model when the normalization-like procedure is used. This explains why the proposed new method outperforms Method VI. In addition, Method VI is based on the EM algorithm, while the proposed method is based on the MCMC algorithm. The EM algorithm sometimes may stuck in the so-called local optimal, while the MCMC has reasonable chance to jump out of the β€œtrap.” This may explain why Method VI failed in the situation where the borders of clusters are fuzzy with the effects of differentially expressed genes varying within a wide span, rather than a narrow span (data not shown).

In a quantitative trait-associated microarray data analysis, Jia and Xu [25] classified genes into several clusters based on the association of gene expression and the phenotypic value of a quantitative trait. The Bayesian method developed here can be used for such a quantitative trait-associated microarray data analysis. Recall that, in the differential expression analysis, the design matrix 𝑋 is a variable indicating whether a gene comes from the control or the treatment. To make such an extension, the design matrix 𝑋 in the differential expression analysis is simply replaced by the phenotypic values of the trait in the quantitative trait-associated microarray analysis. The method developed here has the following extra features compared to that of Jia and Xu [25]: (a) Bayesian method implemented via the MCMC algorithm, (b) constraints on the cluster means, (c) an imbedded normalization step, and (d) a fixed number of clusters.

In differential gene expression analysis, we usually deal with two conditions, control and treatment, with a purpose of identifying genes whose expression levels have altered between the two conditions. In time-course and dose-response [26] microarray experiments, however, gene expression are measured from multiple conditions. The Bayesian mixture model analysis developed here may be extended to detect differentially expressed genes across multiple conditions. To make such an extension, the dimensions of 𝛾𝑖 and 𝑑+1 in model (2) need to be changed to reflect the multiple conditions. Let 𝑋 be the number of conditions. The modified dimensions of 𝛾𝑖 and 𝑁×𝑑 should be 𝑑×1 and 𝛾𝑖, respectively. The change of π›½π‘˜=0 from a scaler into a vector leads to the following consequences: (1) the clusters of down- and upregulated genes are not explicitly defined, although the neutral cluster is still defined as that with 𝛾𝑖, and (2) the variance of 𝐢 in the experiment with two conditions becomes a variance-covariance matrix in the experiment with multiple conditions. The first problem may be solved via the following approaches. Let 𝛽1=0 be the total number of clusters from which these genes are sampled. Let the first cluster be the one consisting of all the neutral genes and thus 𝛽2βˆ’π›½πΆ is enforced for this cluster. The means of the remaining clusters 𝐢 are not restricted, that is, they are sampled freely from their posterior distributions (multivariate normal). The number of clusters 𝑑 are found based on the value of Bayesian factor or BIC. Genes not classified into the neutral cluster are claimed to be differentially expressed. The actual number of neutral genes may not be sensitive to the choice of the total number of clusters. Therefore, one may simply choose any reasonable number of clusters for the analysis. This number may be a function of 𝐢=3𝑑, say 𝐢=3. Note that 𝐢=3Γ—3=9 for a single dimension (one regression coefficient) as done in this study. For two regression coefficients, 𝑑. Therefore, for 𝐢=3𝑑 regression coefficients, the number of clusters becomes 𝛼𝑖. The second problem may be tackled as follows. The scaled inverse chi-square distribution chosen for the variance may be replaced by the generalization of the scaled inverse chi-square distribution for the variance-covariance matrix, called the inverse-Wishart distribution [21]. This prior distribution for the variance-covariance matrix is conjugate, and thus the standard sampling algorithm for a random vector from an inverse-Wishart distribution applies.

The intercept 𝑖 of gene π‘Š in model (2) is irrelevant to the differential expression analysis and thus it is removed from the model via a special normalization process, called linear contrasting. In fact, all effects not related to differential expression can be removed via such a normalization process. Deleting these irrelevant effects (e.g., dye effect, block effect, etc.) may avoid tedious estimating procedure for unspecified parameters used in Zhang et al. [1] and save substantial time in calculation. To remove these effects, matrices 𝛼𝑖 and β„Ž need to be customized to reflect the general nature of the normalization. Let π‘Š be the number of irrelevant effects. The dimensions of 𝛼𝑖 and π‘Γ—β„Ž should be β„ŽΓ—1 and 𝐿, respectively. The coefficient matrix for the linear contrasts πΏβˆ— remains the same as that defined by model (6). However, matrix π‘βˆ’π‘Ÿ(π‘Š) takes the first 𝐿 eigenvectors of matrix 𝛼𝑖. This generalized approach of normalization is conceptually similar to the ANOVA analysis proposed by Kerr et al. [27], where two steps are involved in the ANOVA. In the first step, π‘Œπ‘–=π‘Šπ›Όπ‘–+πœ–π‘– is estimated under model π‘Œβˆ—π‘–=π‘Œπ‘–βˆ’π‘Šξπ›Όπ‘– and then π‘Œβˆ—π‘–, the residuals adjusted by the irrelevant effects, is used as the raw data for further differential expression analysis. In the second step, π‘Œβˆ—π‘–=𝑋𝛾𝑖+πœ–π‘– is described by model 𝛾𝑖 and then 𝑖 is estimated and tested to make an inference about the significance of gene 𝛼𝑖. Three issues need to be emphasized for the comparison: (1) the generalized approach of normalization proposed in this study removes 𝛼𝑖 using no explicit estimate of , (2) the reduced degrees of freedom after adjusting for the irrelevant effects are used for the new method, and (3) appropriate covariance structure for the residuals is used for the new method.

5. Conclusions

In this paper, we propose a Bayesian mixture model approach to detect genes that differentially expressed in control-treatment microarray experiments. Genes are assigned into one of three constrained clusters, corresponding to clusters of downregulated, neutral, and upregulated genes, respectively. The Bayes method is implemented via the Markov chain Monte Carlo (MCMC) algorithm. Genes that have been assigned into nonneutral clusters are the target genes which we would like to disclose. Using simulated data as well as data from real microarray experiments, we demonstrate that the new method outperforms the methods commonly used in differential expression analysis. Although the new method was demonstrated using data generated from laboratory animals, it is able to generalize to genome studies for plants.

Acknowledgments

This research was supported by the National Institute of Health Grant R01-GM55321, and the National Science Foundation Grant DBI-0345205 to SX.