Abstract

An important objective in systems biology is to infer gene regulatory networks from postgenomic data, and dynamic Bayesian networks have been widely applied as a popular tool to this end. The standard approach for nondiscretised data is restricted to a linear model and a homogeneous Markov chain. Recently, various generalisations based on changepoint processes and free allocation mixture models have been proposed. The former aim to relax the homogeneity assumption, whereas the latter are more flexible and, in principle, more adequate for modelling nonlinear processes. In our paper, we compare both paradigms and discuss theoretical shortcomings of the latter approach. We show that a model based on the changepoint process yields systematically better results than the free allocation model when inferring nonstationary gene regulatory processes from simulated gene expression time series. We further cross-compare the performance of both models on three biological systems: macrophages challenged with viral infection, circadian regulation in Arabidopsis thaliana, and morphogenesis in Drosophila melanogaster.

1. Introduction

An important objective in systems biology is to infer regulatory networks from postgenomic data. Bayesian networks have been widely applied as a popular tool to this end—see, for example, [1]—and novel fast Markov chain Monte Carlo (MCMC) algorithms can be applied to systematically search the space of network structures for those that are most consistent with the data [2, 3]. One reason for the popularity of Bayesian networks has been the tractability of the marginal likelihood of the network structure. This term describes how well the model structure explains the data. Its computation is usually challenging, as it requires the solution of an integral over the entire parameter space. To obtain a closed-form expression, two probabilistic models with their respective conjugate prior distributions have been employed: the multinomial distribution with the Dirichlet prior (BDe) and the Gaussian distribution with the normal-Wishart prior (BGe). These approaches are restricted in that they either require a data discretisation (BDe: Bayesian Dirichlet equivalence score) or can only capture linear relationships (BGe: Bayesian Gaussian equivalence score). Recently, a generalisation of the BGe model based on a combination of a mixture model and the allocation sampler has been proposed [4], the Bayesian Gaussian Mixture (BGM) model. In the BGM model, data points are assigned to different compartments (subsets of the data) with the allocation sampler [5]. Model parameters (and their distributions) are allowed to differ between compartments, while information is shared among the compartments via a common network structure. Given the network structure each compartment is modelled separately and independently with the Gaussian BGe model.

The present work proposes a modification of the BGM model, which is more suitable for dynamic gene regulatory networks and gene expression time series. The proposed model, which we refer to as the model, replaces the free allocation model by a multiple changepoint process, as, for example, employed in [6], to divide the time series into segments and thereby take the time structure into account.

In a preliminary study [7], we focused on synthetic data from small network domains. We found that the new model avoids spurious self-feedback loops if feedback loops, such as (e.g., in molecular biology: transcription factors regulating their own transcription), are not excluded from the analysis. In this paper, we present the mathematical details of the new model in greater depth, and we demonstrate that yields better inference results for real biological systems. To this end, we analyse two gene expression time series from macrophages and Arabidopsis thaliana and cross-compare the inference results of the BGM model and the new changepoint process model.

Furthermore, we combine both models (the free allocation model and the changepoint process) with discrete Bayesian network methodology and we compare the performance of both models on inferring the morphogenetical stages of muscle development in Drosophila melanogaster from binary gene expression time series.

We note that our modelling paradigm is complementary to other recently proposed approaches. A nonlinear nondiscretised model based on heteroscedastic regression has been proposed in [8]. However, this approach no longer allows the marginal likelihood to be obtained in closed-form and requires a restrictive approximation (the Laplace approximation, that is, an approximation based on a 2nd order Taylor series approximation of the log likelihood) to be adopted. Another nonlinear model based on node-specific Gaussian mixture models has been proposed in [9, 10]. Again, the marginal likelihood is intractable. The authors resort to the Bayesian information criterion (BIC) of [11] for model selection, which is only a good approximation to the marginal likelihood in the limit of very large data sets. A more flexible approach based on changing graphs between changepoints has been proposed in [12, 13]. Conceptually, the assumption of changing networks is reasonable for some biological scenarios, like morphogenesis, where gene-regulatory processes have been measured over a long period of time. However, for cellular processes on a shorter time scale, it is questionable whether it is the network structure rather than just the strength of the regulatory interactions that changes with time. For example, it is not the principle ability of a transcription factor to potentially bind to the promoter of a gene and thereby initiate transcription (i.e., not the network structure), but the extent to which this happens (i.e., the regulatory interaction strength). We therefore argue that, especially for short time series, it is more appropriate to leave the network structure invariant among segments and to allow the interaction parameters to vary with time by modelling the time segments (between changepoints) separately and independently with the Gaussian BGe model. The systematic sharing of information among segments via a common network structure also reduces potential problems with model overflexibility and overfitting, which a more flexible approach that allows for different network structures is susceptible to when the time series are short.

2. Methods

2.1. The Dynamical BGe Network

Dynamical Bayesian networks (DBNs) are flexible models for representing probabilistic relationships among interacting variables . The graph of a DBN describes the relationships between the variables (nodes), which have been measured at equidistant time points , in the form of conditional probability distributions. An edge pointing from to means that the realisation of at time point , symbolically: , is influenced by the realisation of at time point , symbolically: . Effectively there is a bipartite graph structure between two time steps and so that the acyclicity constraint—known from static Bayesian networks—is guaranteed to be satisfied. Therefore, in principle, each node can be its own parent node in DBNs. Such self-feedback loops model autocorrelations and it depends on the application whether such self-feedback loops should be valid edges or ruled out altogether to focus the inference on the interactions between genes. denotes the parent (node) set of node in ; that is, the set of all nodes from which an edge points to node in . Given a data set , where and are the th realisations and of and , respectively, DBNs are based on the following homogeneous Markov chain expansion: where is the total parameter vector, composed of subvectors . specifies the th local conditional distributions in the factorisation. The BGe model specifies the distributional form as multivariate Gaussian distribution, and it assumes a normal-Wishart distribution as prior distribution [14]. The local probability distributions are then given by conditional linear Gaussian distributions. For discrete data the BDe model specifies the distributional form of the likelihood as a set of multinomial distributions, and assumes a Dirichlet distribution as the prior for the unknown parameters [15]. In principle, the BDe model yields a higher modelling flexibility than the BGe model, but BDe requires a data discretisation that usually incurs a substantial information loss. Under fairly weak conditions imposed on the parameter vector (prior independence and modularity) and the prior distribution (conjugacy), so that , where , the parameters can be integrated out analytically. For both scoring metrics BGe and BDe [16], the marginal likelihood then satisfies the same expansion rule as the Bayesian network with fixed parameters: where and denotes the subset of the data pertaining to node and its parent set . For the Gaussian BGe model the (local) factors in (3) can be computed in closed-form, according to (15) and (24) in Geiger and Heckerman [14]. For the discrete multinomial BDe model, the (local) factors can be computed in closed form according to (19) in [16].

2.2. Gaussian Mixture Bayesian Network Model

In the Gaussian BGe model, the local distributions are conditional linear Gaussian distributions. We generalise the BGe model by the introduction of a latent allocation vector , which assigns the data points to different mixture components, where is inferred from the data by applying changepoint birth and death moves, along the lines of the changepoint model (e.g., see [6, 17]). As in the BGM model [4], conditional on the latent vector , a separate BGe score can be computed for each of the mixture components.

The allocation vector of size describes the allocation of the time points to the components. denotes all time points that are allocated to component . The posterior probability of , , and is proportional to the joint distribution and the joint distribution can be factorised as follows: where and denotes the realisations of node and its parent set for those time points that have been allocated to the th component. It can be seen from these equations that acts as a filter which divides the data into different compartments (), for which separate independent BGe scores can be computed in closed-form using (2) and (3). The BGM counterpart of (3) is given by: For instance, if we have time points and one changepoint between and so that assigns the time points to the first and the remaining time points to the second segment, then separate local BGe scores are computed for the data subsets and , according to (15) and (24) in Geiger and Heckerman [14].

When a data compartment is empty, then we set the factors equal to 1 (). For , we take a uniform distribution over all graph structures subject to a fan-in restriction of . For , we take a truncated Poisson distribution with restricted to as prior. We note that the MCMC inference scheme, which we will discuss in Section 2.3, does not sample directly, but is based on local modifications of based on changepoint birth, death, and reallocation moves. That is, different from the free allocation in the BGM model [4], we change the assignment of data points to components via a changepoint process (e.g., see [6, 17]). This reduces the complexity of the allocation space substantially and incorporates our prior knowledge that adjacent time points are likely to be assigned to the same component. We will refer to the new changepoint BGM model as the model.

We identify with changepoints: on the continuous interval , and for notational convenience we introduce the pseudo-changepoints and . The observation at time point is assigned to the th component, symbolically , if . Following Green [17], we assume that the changepoints are distributed as the even-numbered order statistics of points uniformly and independently distributed on the interval . The motivation for this prior, instead of taking uniformly distributed points, is to encourage a priori an equal spacing between the changepoints, that is, to discourage mixture components that contain only a short compartment of the time series. We note that the even-numbered order statistics prior on the changepoint locations induces a prior distribution on the allocation vector . Deriving a closed-form expression is involved but not required, as our MCMC inference scheme does not sample directly. The inference is based on local modifications of . In particular, we employ three different move types: (i) an additional component () can be generated by setting a new changepoint on the interval (changepoint birth move), (ii) the number of components can be decremented () by removing one of the existing changepoints (changepoint death move), and finally, (iii) the allocation vector can be changed without affecting the number of components by changing the position of one of the existing changepoints (changepoint reallocation move). For the acceptance probabilities of these local moves, only ratios, which are straightforward to compute, are required.

2.3. Multinomial Mixture Bayesian Network Model

The discrete multinomial BDe model can be generalised in analogy to the continuous Gaussian Bge model. As before, the allocation vector divides the data into compartments, and each compartment of the data is modelled separately and independently with the multinomial BDe scoring metric. We note that closed-form solutions for the -terms in (7) can be obtained with (19) in [16]. We will refer to the free individual allocation model BGM with the BGe score being replaced by the discrete multinomial BDe score as the Bayesian Discrete Mixture (BDM) model, and accordingly we will refer to the discrete BDe counterpart of as the model. We note that the model is similar to the model of Robinson and Hartemink [12], except that leaves the inferred network structures invariant in time to allow for more information sharing among segments. We will employ the discrete counterparts BDM and for an independent comparison of the suitability of the free allocation model and the changepoint process for inferring dynamic gene-regulatory processes (from discrete data).

2.4. MCMC Inference

We now describe an MCMC inference algorithm that can be used to obtain a sample from the posterior distribution . Our algorithm combines the structure MCMC algorithm for Bayesian networks [18] with the changepoint model (e.g., see [6, 17]), and draws on the fact that conditional on the allocation vector , separate BGe scores can be computed for the data compartments. Note that this approach is equivalent to the idea underlying the allocation sampler [5]. The resulting algorithm is effectively an RJMCMC sampling scheme in the discrete space of network structures and latent allocation vectors, where the Jacobian in the acceptance criterion is always 1 and can be omitted. With probability , we perform a traditional structure MCMC move on the current graph and leave the latent vector and the number of mixture components unchanged, symbolically: and . A new candidate graph is randomly drawn out of the set of graphs that can be reached from the current graph by deletion or addition of one single edge. The proposed graph is accepted with probability where is the cardinality, and the marginal likelihood terms have been specified in (6). The graph is left unchanged, symbolically , if the move is not accepted. We note that the network reconstruction will be based on the marginal posterior probabilities of the individual edges, which can be estimated for each edge from the MCMC sample by the fraction of graphs in the sample that contain the edge of interest where is the indicator function with if there is an edge from to in .

With the complementary probability we leave the graph unchanged: , and perform a move on . We change the current number of components via a changepoint birth or death move, or the allocation vector by a changepoint reallocation move along the lines of the Reversible Jump Markov chain Monte Carlo algorithm (RJMCMC) algorithm [17].

The changepoint birth (death) move increases (decreases) by 1 and may also have an effect on . The changepoint reallocation move leaves unchanged and may have an effect on . If with probability a changepoint move on is performed, we randomly draw the move type. Under fairly mild regularity conditions (ergodicity), the MCMC sampling scheme converges to the desired posterior distribution if the equation of detailed balance is fulfilled [17]. The condition of detailed balance implies that for each move a complementary move is defined, and that the acceptance probability depends on the proposal probability of the complementary move. The moves presented below are designed such that there is a unique complementary death move for each birth move and vice versa. Moreover, each reallocation move can be reversed by a single (complementary) reallocation move. The acceptance probabilities for these three changepoint moves are of the following form [17]: where is the prior probability ratio, and is the inverse proposal probability ratio. The exact form of these factors depends on the move type.

(i) For a changepoint reallocation (), we randomly select one of the existing changepoints , and the replacement value is drawn from a uniform distribution on where and . Hence, the proposal probability ratio is one, the prior probabilities cancel out, and the remaining prior probability ratio can be obtained from page 720 in Green's RJMCMC paper [17] If there is no changepoint () the move is rejected and the Markov chain is left unchanged.

(ii) If a changepoint birth move () on is proposed, the location of the new changepoint is randomly drawn from a uniform distribution on the interval ; the proposal probability for this move is , where is the (-dependent) probability of selecting a birth move. The reverse death move, which is selected with probability , consists in discarding randomly one of the changepoints. The inverse proposal probability ratio is thus given by . The prior probability ratio is given by the expression at the bottom of page 720 in Green's RJMCMC paper [17] slightly modified to allow for the fact that components correspond to changepoints, and we obtain For the birth of a new changepoint is invalid and the Markov chain is left unchanged.

We note that the proposal probabilities and for birth and death moves can be chosen as follows: where is a constant that can be chosen as large as possible subject to the constraint for . This choice yields both a certain acceptance rate of the MCMC sampling scheme [17] and a simple prior probability (Hastings) ratio. From it follows that the ratio in the expression cancels out against the prior ratio in the expression , and the prior probability ratio simplifies to

(iii) A changepoint death move () is the reverse of the birth move, and we obtain

3. Data

We have evaluated the proposed model on various synthetic data sets. For illustration purposes, we present results obtained for two studies with small networks where the nonlinearity was implemented by a sinusoidal transformation. In a second study, we focus on the two-gene expression time series from macrophages and Arabidopsis thaliana, which have been used for evaluating the BGM model. More details on the experimental settings can be found in the paper on BGM [4]. In a third study, we switch from the Gaussian BGe score to the discrete multinomial BDe score to infer discrete nonstationary time series from Drosophila melanogaster.

3.1. Small Synthetic Networks

The first synthetic network consists of two nodes and and possesses two edges. Node has a recurrent feedback loop, symbolically, and acts as a regulator of node , symbolically . We consider the scenario of a nonlinear regulatory influence that exerts on , whereby we implement the nonlinearity by a sinusoid transformation of 's signal on . The state-space equations are given by where , and are constants, are iid normally distributed random variables.

The second synthetic network is a generalisation of the two node domain where three nodes , , and are regulated by . The relationships are again realised by sinusoids, whereby we shift the periods: with , , and . For both networks we set the drift term to ensure that (on average) the complete period of the sinusoid is involved, and we generate observations for four different combinations of the coefficients and .

3.2. Bone Marrow Derived Macrophages

Interferons (IFNs) play a pivotal role in the innate and adaptive mammalian immune response against infection, and central research efforts therefore aim to elucidate their regulatory interactions [19]. For the present study, we have analysed gene expression time series from bone marrow-derived macrophages, which were sampled at minute time intervals. The macrophages were subjected to three external conditions: () infection with Cytomegalovirus (CMV), () treatment with Interferon Gamma (IFN), and () infection with Cytomegalovirus after pretreatment with IFN (CMV+IFN). Samples derived from the macrophages were hybridised to Agilent mouse genome arrays. We focus on Interferon-regulatory factors Irf1, Irf2, and Irf3. These factors are the key regulators in the response of the macrophage cell to pathogens. The macrophages data sets used in the study are publicly available from http://www.bioss.ac.uk/associates/marco/supplement/.

3.3. Circadian Regulation in Arabidopsis thaliana

We have also analysed two-gene expression time series from Arabidopsis thaliana cells, which were sampled at = 13 2 hour time intervals with Affymetrix microarray chips. The expressions were measured twice independently under experimentally generated constant light condition, but differed with respect to the prehistories. In the first experimental scenario, , the plants were entrained in a 10 h : 10 h light/dark-cycle, while the plants in the second experimental setting, , were entrained in 14 h : 14 h light/dark-cycle. The analysis focuses on genes, namely LHY, CCA1, TOC1, ELF4, ELF3, GI, PRR9, PRR5, and PRR3, which are known to be involved in circadian regulation [20, 21]. The Arabidopsis data sets used in the study are publicly available from http://www.bioss.ac.uk/associates/marco/supplement/.

3.4. Muscle Development in Drosophila melanogaster

The gene expressions in Drosophila melanogaster cells were sampled at time-steps during four different morphological stages of life: embryonic, larval, pupal, and adult stages. Since these phases cover time periods of different lengths, gene expression profiles were collected at nonequidistant time-points. The true morphological transitions occur at time points (embryonic to larval), (larval to pupal), and (pupal to adult) [22]. Like other researchers [12], we focus our analysis on genes involved in muscle growth and muscle development: EVE, GFL, TWI, MLC1, SLS, MHC, PRM, ACTN, UP, MYO61F, and MSP300. The quantile-discretised binary data set of these 13 Drosophila genes is available from Robinson and Hartemink [12].

4. Simulations and Evaluation

We impose a fan-in restriction of size on the maximal number of parent nodes per node as done in related articles [2, 4]. For the Gaussian BGe model the hyperparameters of the normal-Wishart prior were chosen maximally uninformative subject to certain regularity constraints [14]. In the second study, the hyperparameters of the discrete multinomial BDe model were chosen as explained in Heckerman and Geiger [16], whereby the total prior precision parameter was set to 1.

We set , and the burn-in and the sampling-phase of MCMC runs were set to 500,000 iterations each, and we sampled every 1,000 iterations during the sampling-phase. For each data set, we started 5 independent MCMC simulations from different initialisations, and we computed the potential scale reduction factor [23] based on the marginal edge posterior probabilities. As we observed, a sufficient degree of convergence for all data sets (), we report only the results of the empty-seeded (graphs without any edges) MCMC runs. We note that each single MCMC simulation (even for the bigger domains Arabidopsis with genes and Drosophila with genes) was accomplished within a few hours using Matlab© code on a SunFire X4100M2 machine with MAD Opteron 2224 SE dual-core processor.

More generally, we note that the computational complexity in network structure space is , where is the number of nodes and is the maximum number of parent nodes. The complexity in changepoint configuration space is of order , where is the number of changepoints, and is the number of time points. Hence, the problem is of polynomial complexity in and , and not exponential complexity (i.e., it is not NP-hard). Polynomial complexity—as opposed to exponential complexity—does not impose any principled restrictions on the network size. However, the practical feasibility will depend on varying factors, like the efficiency of the software implementation and the capacity of parallel clusters. It also depends on the information content of the data, as increasing the number of nodes with limited number of experimental replications will increase the intrinsic uncertainty of inference. The practical decision on how many nodes and how large a network to consider will therefore usually be based on some preliminary data analysis.

We note that we allowed for self-feedback loops for the synthetic data only. For the three real applications to biological systems, we ruled out self-feedback loops altogether to enable direct comparability with the results reported for the BGM model [4].

If two independent data sets and are available for a network domain, predictive probabilities of the models BGM and can be estimated straightforwardly [4]. For example, having sampled from the posterior distribution via the MCMC inference scheme described above, the predictive probability of can be estimated by where That is, when computing the BGe score for the compartment with (6) and (7), the local prior distributions in (7) are replaced by posterior distributions . This results in a straightforward modification of the BGe score as follows. In (13) in Geiger and Heckerman [14], those training data that have been allocated to component , symbolically , are included in the conditioning part of the distribution, and the sufficient statistics are adjusted accordingly.

5. Results

5.1. Inference on Synthetic Data

For the synthetic sinusoid data, the true underlying network topologies are known so that the network reconstruction accuracy can be assessed via the area under the ROC (receiver operator characteristic) curve: AUC; this is a standard criterion that has been applied in numerous related articles (e.g., [24]). Figure 1 shows histograms of the average AUC scores obtained from the marginal edge posterior probabilities for the sinusoid networks with and nodes. Figure 1 is laid out as matrix, in which the rows and columns correspond to different and coefficients (noise levels). We note that an increase of reduces the autocorrelation of , while increasing blurs the functional dependence of on . The autocorrelation of is jointly influenced by both parameters. From the histograms it can be seen that the novel model leads to a better network reconstruction accuracy in terms of average AUC values than the standard BGe model and the BGM model in the majority of scenarios. Further investigations showed that BGe and BGM yield lower AUC scores, since they tend to infer spurious self-feedback loops on node () or on the nodes (), respectively.

This trend can be visualised by histograms of the average edge posterior probabilities. As an example, Figure 2 shows the average marginal edge posterior probabilities of the four possible edges for the nodes sinusoid network with and . Consistently, all three models under comparison assign the highest posterior probability to the true self loop and the lowest posterior probability to the false edge . But BGe and BGM favour the spurious feed-back loop over the true edge while the proposed suppresses the false self-feedback loop and assigns a higher edge posterior probability to the true edge . This shows that yields a higher network reconstruction accuracy (see Figure 1), as it is less susceptible to inferring spurious self-feedback loops (see Figure 2).

5.2. Inference on Macrophages Data

For the macrophages data the BGM model inferred a biologically plausible state change in the host macrophage brought about by infection (CMV) or immune activation (IFN), and a less pronounced state change in the combined condition CMV+IFN [4]. We compare these findings with results obtained with the novel model. The fraction of sampled states for which two time points and are allocated to the same component () can be used as a connectivity measure , and the resulting temporal connectivity structures are displayed graphically as heat maps in Figure 3. All six heatmaps in Figure 3 reflect the two-stage nature of the gene-regulatory processes in the host macrophages: the first part (time points ) and the last part of the three time series (time points ) are allocated to different components. For all three conditions, a stronger separation between the two regulatory states is inferred by the model (see Figures 3(d), 3(e), and 3(f)). It appears that the inference results are more consistent, as even for the combined condition (CMVIFN) a clear trend towards a dichotomous regulatory process can be found (see Figure 3(d)). This finding (stronger separation) is consistent with our conjecture that the novel assigns neighbouring time-points to the same compartment more likely a priori. Interestingly, the BGM inference outlier at time point in Figure 3(b) yields a certain trend for a subdivision of the second compartment by the model. Instead of one outlying time point two substages and are inferred (see Figure 3(e)). To provide statistical evidence that the new model does not overfit the data, we compute predictive probabilities for the model and compare them with those reported for the BGM model [4].

To this end, we treat the three experiments as independent replications, and check via predictive probabilities whether the superiority of the proposed model can be confirmed statistically. Table 1 gives the predictive probabilities reported in the BGM paper [4] and those obtained with the proposed model. As the predictive probabilities for are systematically better than those of BGM, we conclude that the model yields more stable inference results, that is, a better generalisation performance.

5.3. Inference on Arabidopsis thaliana Data

For the Arabidopsis thaliana data, the BGM model also inferred a biologically plausible two-stage process [4]. In this application, the two stages are likely to be related to the diurnal nature of the dark-light cycle influencing the circadian genes. The plants were subjected to different prehistories, related to different lengths of the artificial, experimentally controlled light-dark cycle. The plants in experimental scenario were entrained in an increased day length of 14 hours light followed by 14 hours darkness and in experiment the plants were entrained in a decreased day length of 10 hours light followed by 10 hours darkness. As an effect of these two entrainments, a phase shift in the gene-regulatory processes between these two experiments was expected [4]. The BGM model inferred a certain trend for a phase shift of the changepoint (subjective day to subjective night) of about 4–6 hours as a consequence of the increased day length. The heat maps in Figures 4(a) and 4(b) show that the connected blocks (compartments) of the time series are shifted along the diagonal by 2-3 time-points (4–6 hours). The model infers the same trend but with a stronger separation score between these compartments (see Figures 4(c) and 4(d)). We note that the model is based on changepoints so that compartments once left cannot be revisited. That is, while the BGM model tends to allocate the first time points and the last time points () in experiment to one single component (light grey shading in the top right and bottom left area of the heat map in the top centre panel of Figure 4), the model has to allocate the last time points to an additional third component, as the first compartment () cannot be reused after the transition to the second compartment .

As for the macrophages data, predictive probabilities can be computed by treating the two experiments and as independent replications. Table 2 gives the predictive probabilities reported for the BGM model [4] and those obtained with the new model. In consistency with the results for the macrophages data, the resulting two predictive probabilities for are better than those of the BGM model. A scatter plot versus of the inferred (marginal) posterior probabilities of the individual edges for the model inference is given in Figure 5. As the individual edge posterior probabilities do not differ drastically, we extract a network structure from the averaged probabilities: .

Figure 6 shows the gene interaction network that is predicted when keeping all edges with marginal posterior probability above 0.5. There are two groups of genes. Empty circles in the figure represent morning genes (i.e., genes whose expression peaks in the morning), shaded circles represent evening genes (i.e., genes whose expression peaks in the evening). There are several directed edges pointing from the group of morning genes to the evening genes such that each evening gene is regulated by at least one morning gene. Moreover, the two genes LHY and CCA1 seem to play a central role. This result is consistent with biological findings, where the morning genes were found to activate the evening genes, with LHY and CCA1 being central regulators [25]. Our reconstructed network also contains edges pointing into the opposite direction, from the evening genes back to the morning genes. This finding is consistent with biological observations [25], where the evening genes were found to inhibit the morning genes via negative feedback loops. In the reconstructed network, there are 9 edges (drawn in black) originating from the four morning genes while only 7 edges (drawn in grey) originate from the group of five evening genes. Biologically, this means that the activity of the morning genes is stronger than the activity of the group of evening genes and that the regulatory mechanisms are dominated by the morning genes in the network topology. This finding is consistent with the fact that following the light-dark cycle entrainment, the experiments were carried out in constant-light condition, resulting in a higher activity of the morning genes overall. Within the group of evening genes, the reconstructed network contains an edge between GI and TOC1. This interaction has been confirmed independently [26]. Hence, while a proper evaluation of the reconstruction accuracy is currently unfeasible—like many related studies, we lack a gold-standard owing to the unknown nature of the true interaction network—our study suggests that the essential features of the reconstructed network are biologically plausible and consistent with the literature.

5.4. Inference on Drosophila melanogaster Data

For an independent comparison of the free allocation model and the changepoint model, we carried out an analysis similar to Robinson and Hartemink [12] on the binary Drosophila muscle development gene expression time series. This time series can be analysed with the discrete counterparts BDM and of our Gaussian Mixture models; see Section 2.3 for details. The graphical heat map representations in Figure 7 show that the BDM model does not infer the morphological stages of Drosophila melanogaster. Almost all time-points are strongly connected (white shading) and no separated blocks of connected time points have been inferred. Only a few time-points are allocated (as outliers) to other mixture components (black vertical and horizontal lines).

The novel infers a clear block structure with different separated compartments. The connectivity structure corresponds well to the first two morphological transitions (i) embryonic to larval () and (ii) larval to pupal (), whereby the separation between the embryonic and the larval stage is less pronounced (grey shading) than the separation between the larval and the pupal stage (black shading). The exact third morphological transition pupal to adult () has not been inferred but it can be seen that two changepoints occur during the pupal stage before the third transition to the adult stage. Figure 8 shows a graphical presentation of the changepoint location posterior probabilities inferred with and the same trends become obvious: the first two stage transitions have been inferred correctly while two further changepoints occur before the morphological stage transition from pupal to adult. We note that Robinson and Hartemink did find the same trends with their model and they conclude that the third premature transition in the gene regulatory process can be explained biologically, since the gene expression program governing the transition from pupal to adult morphology should be active well before the time of the real morphological transition [12].

6. Discussion

Our empirical results have shown that the proposed model performs consistently better than the BGM model. A possible explanation could be related to the latent space complexity of the models. For two components, we have changepoint locations, but free allocations. This difference in latent space complexity gets aggravated for more components. The difference in performance between and BGM could therefore be a consequence of the different degrees of convergence of the MCMC simulations. However, our convergence diagnostics based on potential scale reduction factors [23] did not indicate any significant difference in the convergence. It therefore appears that the higher latent space complexity of the BGM model does not adversely affect the MCMC convergence for gene expression time series of the length investigated in our study. This suggests that another explanation for the better performance of over BGM has to be found.

We will discuss that the performance difference is most likely a consequence of the different prior probabilities intrinsic to the models, which determine the factor in (10). Since both models BGM and employ the same Poisson distribution with parameter truncated to the interval as prior for the number of mixture components, the difference between the two models BGM and is imposed by the prior distribution of the allocation vector . While the BGM model is based on a free allocation, the model takes the time structure into account and employs a changepoint process. In this section we describe the differences between the two models in detail by three theoretical examinations, and these examinations reveal trends that appear to be immediately related to some of our empirical findings.

In the first study, we compare the (temporal) connectivity structures that are introduced by the prior . To this end, we infer the prior distribution of both models BGM and via MCMC simulations that are purely prior-driven. That is, we employ an empty data set (without any data points) so that all acceptance probabilities depend on the prior probability ratios only, as for all combinations of , , and . Note that sampling network structures via edge operation moves on the graph becomes obsolete, because (i) the graphs do not have any effect on the likelihoods for an empty data set, and (ii) the graph priors cancel out in the prior probability ratio if a uniform graph prior is used. We set ; note that in this theoretical consideration based on empty data, only determines the length of the allocation vector.

After running 10 independent MCMC simulations with to infer the prior distribution , we can compute the average prior connectivity strengths from the sampled allocation vectors. As before, the fraction of sampled allocation vectors for which two time points and are allocated to the same component () can be used as a connectivity measure . Figure 9 shows heat maps of the inferred prior connectivity structure for the free allocation BGM model and the proposed changepoint process model. The heat maps confirm our earlier conjecture that the proposed model, which takes the time structure of the data into account, allocates neighbouring time points to the same compartment more likely a priori than the BGM model. More precisely, it can be seen from Figure 9(a) that the prior connectivity strength is the same for all and with in the BGM model. On the contrary, for the model (Figure 9(b)), the connectivity strength decreases with the temporal distance between and : for three time points , and with we have . This finding explains why the proposed model yields a stronger separation between the light : darkness induced stages in Arabidopsis thaliana (see Figure 4): the two-stage structure of gene-regulation in Arabidopsis is of a temporal form that is supported by the allocation vector prior of the model.

In the second and third theoretical study, we cross-compare the BGM model and the proposed model in terms of prior probability ratios between the heterogeneous (nonstationary) and the homogeneous (stationary) state. That is, both models can be in the homogeneous state (i.e., the complete time series is modelled as one single compartment ()) or in a heterogeneous state (i.e., the time series is divided into different compartments ()). For the BGM as well as the model the prior probability ratio between (i) the heterogeneous state consisting of two segments () and (ii) the homogeneous state () is given by since for both models. We consider the scenario where the time series is of length and the allocation vector divides the time series into two (non-empty) connected segments and (). The prior probability ratio of the BGM model is then given by where , , and is the number of time points that have been allocated to the th segment (); that is, and in our scenario. See [4] for details. The proposed model requires a changepoint in the interval and it can be derived straightforwardly from (12): In the second theoretical study we vary the length of the time series , and we consider a heterogeneous time series consisting of two equally-spaced segments and . This corresponds to in the BGM model. For the model, we obtain with that the changepoint has to be located in the interval . Figures 10(a) and 10(b) show the resulting (logarithmic) prior probability ratios in dependence on . It can be seen that the prior ratio for the BGM model is considerably lower than for the model. Moreover, the logarithmic plot in Figure 10(b) shows that the prior ratio of the BGM model shows a much stronger decrease with the sample size than the model. This suggests that the BGM model imposes a more severe penalty for complexity (non-stationarity), which increases with increasing sample size . This tendency may explain the finding in [4] for the macrophage gene expression time series, which we have reproduced in the present study (Figure 3(c)): the BGM model does not infer a clear two-phase nature of the time series under simultaneous immune activation (with IFN) and viral infection (with CMV). A possible biological explanation was offered in [4]. However, the novel BGM model does not support the hypothesis of a decreased probability for the two-phase nature (Figure 3(f)). Moreover, the previous analysis has revealed that a strong penalty against the two-phase process is inherent in the BGM model. This suggests that the results reported in [4], which we have reproduced in our study, might be an artefact of the BGM model rather than of genuine biological nature.

In the third theoretical study, we fix the length of the time series and we vary the last time point of the first segment in the heterogeneous state to illustrate the effect of unequal segment lengths () and (). For the BGM model, this scenario yields and , and for the model we have the changepoint location intervals , . Figures 10(c) and 10(d) show the resulting prior probability ratios in dependence on the changepoint location . The figure reveals contrary trends for the BGM and the model. Figure 10(c) shows that the prior ratio of the model peaks in the middle, that is for equally long time series segments, whereas it decreases monotonically with increasing asymmetry of the segment lengths. The BGM model (Figure 10(d)) exhibits the converse behaviour: the more disproportionate the segment lengths, the higher the prior ratio . This behaviour becomes even more obvious when varying the time series length and the breakpoint location simultaneously. Figure 11 shows the logarithmic prior probabilities in dependence on both: the time series length and the segment length proportions. For the BGM model (Figure 11(a)) the penalty term increases drastically with the sample size if both segments are of a certain length. The higher the sample size the stronger the penalty for symmetric segment lengths. Only if is either very low or very high, that is if the segment lengths are strongly disproportionate, does the size of the prior probability ratio not change drastically with the sample size . Figure 11(b) shows that the prior probability ratio of the proposed model is less sensitive to both: the sample size and the segment length proportions. We can conclude that it is the equability of the prior ratio of the proposed that renders it superior when modelling nonstationary behaviour in long time series. That is, the penalty for dividing a time series into segments does not change drastically with the length of the time series , and symmetric segment lengths are supported by the prior distribution . On the contrary, the penalty term of the BGM model for dividing a time series into segments increases substantially with the length of the time series: the longer the time series, the lower the prior probability . Only if the segment lengths are strongly asymmetric such that one segment is very long and the other very short, is the prior probability of a comparable size for different time series lengths . This tendency provides a possible explanation for the failure of the BGM model on the Drosophila melanogaster gene expression time series. The heat map in Figure 7 shows that the proposed model divides the time series into segments that are consistent with the morphogenesis of Drosophila melanogaster. The BGM model also tends to detect the correct segment boundaries. However, it then erroneously infers short segments consisting of a only few time points around these boundaries. This pattern is not consistent with the morphogenetical findings. The segmentation inferred with the BGM model thus suffers from artefacts that are an immediate consequence of what has been discussed above. Namely, that the BGM model penalises symmetric partitions much more strongly than asymmetric partitions and thereby encourages the formation of segments that consist of a single or only a few time points. Note that the BGM model becomes more susceptible to these artefacts as the length of the time series increases, as demonstrated in Figure 11(a). This explains why the difference between the BGM and the model is less pronounced for the Arabidopsis and macrophage expression time series (Figures 3 and 4), which are considerably shorter (Arabidopsis: 13 time points, macrophages: 25 time points, Drosophila: 67 time points). The susceptibility of the BGM model to short-segment artefacts is not completely avoided here either, though, as can be seen from the heat map in Figure 3(b): While infers three segments for the macrophages data under condition (see Figure 3(e)), BGM tends to allocate a single data point to a separate segment.

On the synthetic sinusoid data, the proposed model yields a higher network reconstruction accuracy than the BGM model, as the latter is more susceptible to inferring spurious self-feedback loops. This tendency can be explained from the previous mathematical analysis. Both the BGM and the model effectively approximate the nonlinear function by a piecewise linear function. A good approximation of the sine wave requires three segments of approximately the same length, corresponding to the ascending, stationary and descending phase. As opposed to , the prior inherent to the BGM model heavily penalises against this equal-length segmentation (Figures 10 and 11); see the previous discussion. Now, from (16) it becomes clear that the data are strongly autocorrelated. More precisely, the ’s tend to exhibit a strong autocorrelation by virtue of the autocorrelation of the ’s and the influence of on . Given that the prior implicit in the BGM model impedes the proper piecewise linear model with equal-length segments, the BGM model tends to infer the second-best explanation of the data: explaining the realisation of the ’s via a direct modelling of the autocorrelation between the ’s themselves. This corresponds to (spurious) self-feedback loops.

We note that the novel model has been particularly designed for dynamic data with a temporal structure. The BGM model is not restricted to such data, and can equally be applied to both static (steady-state) and dynamic (time series) data. However, the greater flexibility of the BGM model and the intrinsic implications for the effective prior probability on segment lengths and numbers renders its application to time series data suboptimal. This suggests that the proposed offers a useful new tool for the analysis of dynamic processes.

7. Conclusions

Two paradigms for relaxing the nonhomogeneity/nonlinearity restriction of dynamic Bayesian networks have been proposed in the literature: the changepoint process and the free allocation mixture model. The latter provides the proper approximation of a nonlinear regulation process by a piecewise linear process. The former provides a similar approximation, but under the assumption that the temporal processeses are sufficiently smooth, as the assignment of observations to mixture components of the model is done in the temporal domain rather than the domain of regulatory variables. It is obvious that inference in the free allocation model has a considerably higher computational complexity than the changepoint process. However, we have additionally discussed several principled shortcomings that are intrinsic to the methodology per se. We have proposed a new variation of the BGM model [4], which has turned out to be more suitable for the reconstruction of regulatory networks from nonstationary gene expression time series. Like the BGM model the new model is based on a mixture model dividing the data points into compartments. The network structures are kept fixed among time series segments, and each segment is modelled separately and independently with the Gaussian BGe model for Bayesian networks. The methodological difference is that the model employs a changepoint process to divide time series into segments instead of a free unrestricted allocation of data points. The practical inference follows the Bayesian paradigm and samples the network, the number of changepoints and the changepoint locations from the posterior distribution with Markov chain Monte Carlo (MCMC).

In a first step, the inference problem was based on synthetic data from small network domains possessing self-feedback loops. Our empirical results show that the proposed model suppresses spurious self-feedback loops and yields a higher network reconstruction accuracy than the standard BGe model or the BGM model. We also cross-compared the performance of the models on three real biological systems. On gene expression, time series related to (i) viral challenge of macrophages and (ii) circadian rhythms in Arabidopsis heat maps of the connectivity scores revealed that the new model infers the biologically expected two-stage structures ((i) dichotomy between the healthy and diseased state of the cell and (ii) the diurnal contrast between light and darkness) more clearly than the BGM model. For assessing the statistical significance of the improvement we focused on predictive probabilities, and the proposed model yields consistently higher scores than the BGM model. We extracted a gene regulatory network for the circadian clock-regulated genes in the Arabidopsis thaliana domain from the inference results, and the reconstructed network shows features that are consistent with the biological literature.

Furthermore, for an independent comparison we combined the free allocation model and the changepoint process model with the discrete multinomial BDe scoring metric for Bayesian networks. Empirical results on binary gene expression time series related to muscle development in Drosophila melanogaster were consistent with the results from the first study on continuous data. The (discrete) changepoint process model () infers a time series segmentation that is more consistent with the morphogenetical stages in Drosophila melanogaster than the free allocation model (BDM).

We note that the ideal approach—from a biological point of view—for these three applications would be to use a supervised approach, for example, as described in Werhli and Husmeier [27], and to exploit the biological knowledge we have about the experimental conditions (e.g., the morphogenetical stages of Drosophila) for data segmentation before inference. However, we elected to use these data as a test case for evaluating the efficiency of unsupervised learning for the proposed changepoint-process DBN model.

Finally, we cross-compared both models from a theoretical point of view. The theoretical study was based on a comparison of (i) the a priori imposed temporal connectivity structures and (ii) the prior probability ratios between the heterogeneous and the homogeneous states of the models. These results are consistent with our empirical findings, and lead to a deeper insight into the intrinsic difference between the two models.

Acknowledgments

M. Grzegorcyzk is supported by the Graduate School “Statistische Modellbildung” of the Department of Statistics, TU Dortmund University. D. Husmeier is supported by the Scottish Government Rural and Environment Research and Analysis Directorate (RERAD).