We propose a Bayesian procedure to cluster temporal gene expression microarray profiles,
based on a mixed-effect smoothing-spline model, and design a Gibbs sampler to sample from
the desired posterior distribution. Our method can determine the cluster number automatically
based on the Bayesian information criterion, and handle missing data easily. When applied
to a microarray dataset on the budding yeast, our clustering algorithm provides biologically
meaningful gene clusters according to a functional enrichment analysis.
1. Introduction
Microarray technology enables the
scientist to measure the mRNA expression levels of thousands of genes
simultaneously. For a particular species of interest, one can make microarray
measurements under many different conditions and for different types of cells
(if it is a multicellular organism). Genes' expression profiles under these
conditions often give the scientist some clues on biological roles of these
genes. A group of genes with similar profiles are often “coregulated” or
participants of the same biological functions.
When a series of microarray experiments are conducted
sequentially during a biological process, we call the resulting dataset a
“temporal” microarray dataset, which can provide insights on the underlying
biology and help decipher the dynamic gene regulatory network. Clustering genes
with similar temporal profiles is a crucial first step to reveal potential
relationships among the genes.
Conventional clustering methods, such
as
the K-means and hierarchical clustering, do not take
into consideration the correlation in the gene expression
levels over time. Although it is possible to use a general multivariate
Gaussian model to account for the correlation structure, such a model ignores
the time order of the gene expressions. As evidenced in our example, the time
factor is important in interpreting the results of gene expression clustering
in temporal data. It is also possible to use an autoregression model to
describe the gene expression time series, but such a model often requires
stationarity, which is unlikely to hold in most temporal microarray data.
Recently, nonparametric analysis of data in
the form of curves, that is, functional data, is
subject to active research, see [1, 2]
for a comprehensive treatment of functional data analysis; and curve-based
functional clustering methods have emerged [3–7], but a rigorous assessment
of the estimation precision is still lacking.
In this paper, we propose a Bayesian clustering
method, which optimally combines the available information and provides a
proper uncertainty measure for all estimated quantities. Our method is based on
a mixture of mixed-effect smoothing splines models. For each cluster, we model
its mean profile as a smoothing spline function and describe its individual
gene's variation by a parametric random effect. Based on the theory of
reproducing-kernel Hilbert spaces [8], we represent the mean expression curve as a linear
combination of certain basis functions, which enables us to derive the full
posterior distribution up to a normalizing constant. All the conditional
distributions needed by a Gibbs sampler are also easy to compute and to sample
from. Our method automatically takes care of the missing data and infers the
number of clusters in the data. Using the method, we analyzed a microarray dataset
of budding yeast, we found that the majority of the clusters we had obtained
are enriched for known and expected biological functions.
Our method is not restricted to temporal microarray
data, and can be applied to all curve clustering problems, especially for
sparsely and irregularly sampled temporal data.
2. Material and Methods
2.1. Mixed-Effect Representation of Gene Expression Profile
Let the expression value of the th gene at time be . To accommodate missing data that occasionally occurs
in microarray experiment, we denote and , where is the number
of measurements of th gene. Our
mixed-effect smoothing spline model [9] for genes in one cluster iswhere is the
cluster's mean profile, is the random
effect to capture the intragene correlation, is the known
design matrix for the random effect, and is the random
error independent of and of each
other.
By taking different vectors, we can
accommodate different nonrandom effects. For example, when and , the expression profile of the th gene is
parallel to the mean profile (Figure 1). If and , the difference between the th gene
profile and the mean profile is a linear function in time. More complicated
structures such as periodicity can be modeled by
letting the be basis of a
certain functional space.
Figure 1: A
smoothing-spline mixed effect model for temporal gene expression.
By considering in a
reproducing kernel Hilbert space in which is a square
seminorm, we can represent aswhere is a set
consisting of all distinct , is the number
of , and is the kernel
of . The choice of yields the
cubic smoothing spline withwhere [10].
Writing (2) in a vector-matrix form, we
havewhere is with the th entry and is with the th entry . Substituting (5) into (1), we
haveDenoting and , , , similarly, we
have the matrix representationwhere .
The prior distributions are specified as
follows:where IG and IW are
inverse-Gamma and inverse-Wishart distributions, respectively.
These priors lead to the following full conditional
posteriors, which are used in our Gibbs sampler:where , , , and .
2.2. The Mixture Model with Unknown Number of Components
When more than one cluster is considered, we assume
that the expression of the th gene has a
Gaussian mixture distribution:where and are the mean
and covariance matrix for the th component,
as given by (7); is the fraction
of th component,
and is the number
of Gaussian components.
2.3. Class Labels and Cluster Numbers
To ease the
computation, we introduce a “latent” membership labeling variable for the th gene so
thatWhen the number of Gaussian
components is known, we
can get the joint posterior probability aswhere , , , and is the joint
prior distribution.
Since is unknown, we
used the following Bayesian information criterion (BIC):where is the current
model with parameters , is the
estimate, and is the total
number of parameters in our model. A small BIC score indicates the adequacy of
the corresponding model. An alternative to our current approach (i.e., each
clustering configuration is equally likely given the number of clusters , and is determined
by BIC) is to use a Polya Urn prior (also called the “Chinese restaurant”
process), which postulates that when a new member comes in, its a priori
probability for joining an existing cluster of size is , and for forming a new cluster of its own is , where is the total
number of existing members. This prior, however, favors unbalanced cluster
configurations (e.g., very large and very small clusters) and may not be appropriate
in our applications.
2.3.1. Gibbs Sampling from the Posterior
To complete our Bayesian analysis, we employ the
Dirichlet prior Di for , the cluster proportions. Thus, given the cluster
indicator , the posterior distribution of the 's is again a
Dirichlet distribution.
Given , we have the
conditional distribution of :
With an initial value of , which gives rise to a partition of , and the initial values of , , , , where , as well as , we iterate the following iterative conditional
sampling steps:(i) for , draw a new from the
conditional distribution from (14) to replace the old one;(ii) conditional on , sequentially (a)update by a draw from , where ,(b)update from , where ,(c) update from , where ,(d) update , and , where ,(e) update ,(f)update , where is the number
of genes in the th cluster.
3. Results and Discussion
To study oxygen-responsive gene network, Lai et al.
[11] used cDNA
microarray to monitor the gene expression changes of wild-type budding yeast (Saccharomyces cerevisiae) under aerobic condition in galactose medium.
Under aerobic condition, the oxygen concentration was lowered gradually until
oxygen was exhausted during a period of ten minutes. Microarray experiments
were conducted at 14 time points under aerobic condition. A reference sample
was obtained from a pooled RNA collected from all time points for
hybridization.
For the analysis, Lai et al. [11] normalized gene expression
after time 0 to gene expression of time 0 to set the common starting point.
They identified 2388 genes whose expression is differentially expressed at one
or more time points. Using our method, 2388 genes was clustered to 31 clusters,
which yields the smallest BIC. FunSpec [12] was used for gene annotation and biological function
enrichment analysis, where the Bonferroni-corrected functional enrichment -values
based on hypergeometric distributions are reported.
We found 23 clusters out of 31 clusters discovered have biological functions
over-represented. Among them, estimated mean gene expression profiles of three
clusters are given in Figure 2.
Figure 2: Estimated mean expression curves for cluster A, B, and C (from top to bottom)
discovered in the yeast aerobic expression data.
In cluster A, which consists of 40 genes, the estimated
mean expression goes up progressively as oxygen level goes down, which suggests
that the genes in this cluster were transiently upregulated in response to
aerobisis. Accordingly, genes involved in stress response (function enrichment -value ) as well as
cell rescue and defense are over-represented in this cluster (function
enrichment -value ). Furthermore,
genes involved in molecular functions of oxidoreductase and coproporphyrinogen
oxidase are also presented, which explains the upregulation of the gene
expression levels.
We have 92 genes in cluster B, where the estimated
mean gene expression drops down at the beginning rapidly and then goes up
gradually. In this cluster, 34 genes are involved in protein synthesis
(function enrichment -value ). Moreover,
ribosome biogenesis are also over-represented (function enrichment -value ). These
processes were affected by oxygen level initially, but were quickly adjusted to
high expression levels to maintain living of yeast.
Contrast to cluster B, cluster C (68 genes) consists
of genes involved in galactose fermentation (function enrichment -value ), carbon
utilization (functional enrichment -value ), and
carbohydrate metabolism (function enrichment -value ). The initial
upregulation of gene expression under aerobic condition can be partly explained
by the fact that the cell increases the energy
uptaking through the carbon utilization as oxygen level goes down; but as the
oxygen level continues to drop down, these processes are replaced by the more
energy-efficient processes, which drives the expression levels of genes to be
downregulated.
4. Conclusions
Conventional clustering methods do not take into consideration the correlation
in the gene expression levels over time. Multivariate Gaussian models and time
series analysis cannot model the time factor and correlation properly. These
limitations can be readily overcome by the full Bayesian approach developed
here. Although certain prior distributions and the related hyperparameters need
to be input by the user, we found the clustering results rather robust to
variations in such inputs. Moreover, our Bayesian clustering algorithm serves
as a platform to incorporate more biological knowledge. Open source R code is
available at www.stat.uiuc.edu/~pingma/BayesianFDAClust.htm.
Acknowledgments
The authors thank C. I. Castillo-Davis for his help in designing Figure 1 and for the
many constructive suggestions and discussion during the early stages of this
article. The authors thank Ji Young Kim for designing the software website. The authors also thank
Kurt Kwast for providing the yeast microarray data.