Inference of Gene Regulatory Networks Using Bayesian Nonparametric Regression and Topology Information

Fan, Yue; Wang, Xiao; Peng, Qinke

doi:https://doi.org/10.1155/2017/8307530

Computational and Mathematical Methods in Medicine

On this page

Abstract Introduction Results Conclusion Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2017 | Article ID 8307530 | https://doi.org/10.1155/2017/8307530

Inference of Gene Regulatory Networks Using Bayesian Nonparametric Regression and Topology Information

Yue Fan,¹Xiao Wang,¹and Qinke Peng¹

Academic Editor: Konstantin Blyuss

Received18 Aug 2016

Accepted24 Nov 2016

Published04 Jan 2017

Abstract

Gene regulatory networks (GRNs) play an important role in cellular systems and are important for understanding biological processes. Many algorithms have been developed to infer the GRNs. However, most algorithms only pay attention to the gene expression data but do not consider the topology information in their inference process, while incorporating this information can partially compensate for the lack of reliable expression data. Here we develop a Bayesian group lasso with spike and slab priors to perform gene selection and estimation for nonparametric models. B-spline basis functions are used to capture the nonlinear relationships flexibly and penalties are used to avoid overfitting. Further, we incorporate the topology information into the Bayesian method as a prior. We present the application of our method on DREAM3 and DREAM4 datasets and two real biological datasets. The results show that our method performs better than existing methods and the topology information prior can improve the result.

1. Introduction

Gene regulatory network plays an important role in diverse cellular functions. A reliable method to identify the structure and dynamics of such regulation is important for understanding complex biological processes and is helpful for treatment of diseases. With the development of high throughout technologies in recent years, gene expression data has provided a useful way to investigate the cellular system.

Generally, there are two types of gene expression data used to predict the structure of GRNs, which are steady-state data and time-series data. The steady-state data measures the steady-state levels in different samples, while time-series data measures the expression levels at several successive time points. Since the time-series data contains the dynamic information of the network while the steady-state data does not [1], we focus on the time-series data in this paper.

Over the last several years, a number of network inference methods have been developed to tackle this problem, including Bayesian network [2, 3], dynamic Bayesian network [4, 5], Boolean network [6, 7], ordinary differential equation [8, 9], and mutual information [10, 11]. A comprehensive review can be found in [12, 13]. Among these methods, dynamic Bayesian network has become the major focus for inferring gene regulatory network because it can infer causal interactions, model cyclic interactions, and has less computational complexity than ordinary differential equation.

Inferring a GRN from time-series data is known to be challenging partly due to the high number of genes relative to the number of data points. More importantly, the interactions between genes are typically nonlinear; thus linear model may be inefficient to recognize the nonlinear interactions. A flexible way to solve this problem is to use B-spline functions to describe the nonlinear interactions, and the B-spline functions have been used to infer GRNs in previous studies [14, 15]. A key problem in spline regression is the knot selection which greatly influences the curve fitting. Reference [14] suggested using penalized-splines to avoid overfitting and reduce the number of parameters to be estimated. Among many penalized methods, lasso [16] is the most popular method due to its ability to select and estimate simultaneously and can produce exact 0 estimates. Group lasso [17] was also developed to select grouped variables. Reference [18] proposed group lasso or Bayesian group lasso when spline regression was used because the predictors belong to a same gene forming a natural group. Reference [19] also developed a Bayesian adaptive group lasso to perform simultaneous model selection and estimation for B-spline regression. However, Bayesian spline regression methods still predict a lot of false positive interactions because of the indirect effects existing in the GRNs.

Recently, [20] proposed a new method which uses network topology information to improve gene regulatory network inference; they used a prior that both prokaryotic and eukaryotic transcription networks exhibit an approximately scale-free out-degree distribution while the in-degree distribution is a more restricted exponential function; this structure property is described in [21]. Reference [20] also pointed out that 79% or more genes regulators are less than 3. This property means that most genes in a GRN are regulated by a few regulators and may be possible to be combined with the B-spline regression to improve the results of the GRN inference.

In this paper, we work with a dynamic Bayesian network and use spline regression to detect the nonlinear interactions between genes. A Bayesian group lasso is also used to avoid overfitting and reduce the number of parameters to be estimated. Comparing with group lasso, Bayesian group lasso is a better choice because there are 2 major advantages of Bayesian selection methods: The tuning parameter can be set flexibly. The topology information can be incorporated easily. Further, instead of taking a traditional Bayesian group lasso, we use a Bayesian group lasso model with spike and slab priors since this problem only requires the sparsity on the group level and spike and slab priors can exclude or include the entire group of B-spline basis functions. Finally, we incorporate the topology information as a prior in the Bayesian approach which controls the size of the selected model. This method is assessed by applying to DREAM3 and DREAM4 datasets and two real biological datasets.

2. Method

2.1. The Nonlinear Regression Model for GRN Inference

Consider an matrix , where is the number of the gene expression levels measured times and is the number of genes. A DBN model represents probabilistic relationships between genes via a directed acyclic graph . In this graph, genes are represented by a set of nodes and the interactions between genes are represented by a set of directed edges . A directed edge from node to node means gene is a regulator of gene . The probability distribution of genes given its parents can be expressed as where is the gene expression level at time and is the set of all the parent nodes of gene at time . In the case of the regression-based DBN, the conditional distribution can be written as where is the expression level of gene and is the vector without :We assume that the GRN is a time-invariant network; thus and the error term . Although can be characterized by any nonlinear functional representation, [15] suggested using B-spline basis functions instead of using Fourier basis, wavelets, or other nonlinear basis functions because of the pattern of the relationship between genes is unknown. Therefore, we also use B-spline basis functions in this article and the regulatory relationships can be written aswhere is the intercept and . are B-spline basis functions of degree and is the parameters to estimate from data. Let be the set of equally spaced knots with , and . We get rid of the subscript for the variables for simplicity of notation. Then the regression equation can be written as where is the bases matrix of size and is the corresponding coefficients vector.

2.2. Incorporating the Topology Information and Bayesian Inference

We use the Bayesian group lasso method proposed in [22]; the hierarchical Bayesian model iswhere for and otherwise. Here we use a spike and slab prior on and get the ranking of the potential regulatory links from . Although we can place a positive and very small as a prior when the in-degree of the target gene is small, there are still a lot of false positive interactions to be predicted. Inspired by the idea of maxP technique proposed by [20], we use a prior proposed in [23], to place a restriction on , that only allow the model to be of small size.Here the integer-valued hyperparameter restricts the maximum number of parents for the target gene in each iteration. However, there are still some genes regulated by a large number of genes. Therefore, a fixed will affect the accuracy of the prediction. Thus a uniform prior on is placed on , where is a predetermined integer. Then the model becomesThe likelihood is According to the prior and the likelihood above, the joint posterior distribution on data is The Gibbs sampling scheme is as follows: We use to denote the coefficient vector without the group and to denote the covariate matrix corresponding to . The full conditions of and arewhere and .

Integrating out , we haveFrom these equations, we can draw through Then the full conditional posterior distribution of is Thus, the full conditional distribution of is a normal distribution:The full conditions of and areThen the full conditional distribution of isThe full conditional distribution of is Then the conditional posterior distribution of is where and . And it can be verified that the conditional posterior distributions of other parameters areAnd a Monte Carlo EM algorithm is used to estimate : where equal to is the number of the total regressors and can be replaced by the sample average of generated in the step of the Gibbs sampler. We choose the second half of the samples and the result is the average of the samples.

3. Results

To demonstrate the effectiveness of the topology information and the B-spline functions, our method is used to infer GRNs from in silico time-series data and real biological data; a linear model with topology information and a nonlinear model without topology information are also applied as competing methods. Here we use the time-series data in DREAM3 and DREAM4 challenges as the in silico data, and we use a cell cycle regulatory subnetwork in Saccharomyces cerevisiae and Human Hela cell network as the real biological datasets. We generate 10000 samples from the posterior distribution and choose the second half of the samples to derive the results. The posterior estimates of all the parameters are obtained through the posterior averages of the chains. For the B-spline functions, we adopt the setting as [14] and use a cubic B-spline with 10 interior knots. Here we choose in our experiments.

3.1. Application to In Silico Networks

We first evaluate our method on DREAM4 challenges networks of sizes 10 and 100 [24–26]. The size 10 network data consists of 5 simulated networks, each of which consists of 21 time points and 5 replicates. The size 100 network data also consists of 5 simulated networks, each of which consists of 21 time points and 10 replicates. We also evaluate our method on DREAM3 challenges networks of sizes 10, which is also used in [27]. This data consists of 5 simulated networks, each of which consists of 21 time points. There are also steady-state data provided by the DREAM4 challenge. However, we only focus on time-series data in this article. Although the winning entry in DREAM4 competition used only the knock-out data [28] and combining the time-series and steady-state data can achieve much better results [27, 29], it is infeasible to do knock-out experiments for all genes in practice and generally the knock-out experiments only are done for a small part of genes [30].

Each of the five networks is inferred using all available time-series data, and the area under the receiver operating characteristic (AUROC) curve and the area under precision-recall (AUPR) curve are computed according to the gold standard network topology provided by DREAM3 and DREAM4 challenge. The prediction performances on the DREAM4 10-gene networks and 100-gene networks are summarized in Tables 1 and 2. Table 1 shows that the Bayesian lasso and Bayesian group lasso perform similarly on size 10 data while the BGL_prior has a better performance than the methods above in both average AUROC and AUPR. For net 2 and net 5, the BGL_prior outperforms other methods significantly. We also compared our method with another 2 dynamic Bayesian network methods [31]; the result of our method is also comparable to these methods. Table 2 shows that the nonlinear model performs poorly on this dataset; while the topology information can remarkably improve the prediction performance of the nonlinear model, the Bayesian group lasso with topology information outperforms the Bayesian group lasso methods in both AUROC and AUPR, and these methods also have higher AUROC than linear model, although the AUPR is a little worse. Compared with the results of the other 2 DBN methods, the result of the Bayesian lasso is similar to them and our method still has the highest average AUROC. The prediction performances on the DREAM3 10 gene networks are summarized in Table 3. We also compared our method with another additive model based on ODE [27] and Inferelator 1.0 [32]. For Ecoli 1, Ecoli 2, Yeast 2, and Yeast 3, the 3 additive models perform better than the linear model; for Yeast 1, although BL performs much better than BGL, the BGL_prior still gets slightly better results. The average results show that BGL_prior outperforms the other methods in both AUROC and AUPR.

3.2. Application to IRMA Network

The IRMA network data is a subnetwork embedded in Saccharomyces cerevisiae consisting of 5 genes: CBF1, GAL4, SWI5, GAL80, and ASH1. Both of the two time-series gene expressions include switch-on data and switch-off data. The switch-on data is taken from 5 experiments and the switch-off data is taken from 4 experiments with a total of 142 samples measured by [33] and also used in [20, 34]. The IRMA network is well studied and is a gold standard network. This network also has a fixed topology and the genes in the network are not regulated by other yeast genes. Here we use the precision rate (), recall rate (), and to evaluate the performance and select a best threshold as [35]. The signs of the interactions and self-regulations are not considered; thus the total number of the potential interactions is 20. Table 4 shows the inference performance for the IRMA network. The nonlinear model still performs better than linear model. The method with the prior has a higher TP than the Bayesian group lasso, which implies that the topology information improves the performance. Comparing with another B-spline based method [14] and the method used and compared in [36], although our method cannot achieve the best performance, it is still comparable to the TSNIF and performs much better than another B-spline based method.

3.3. Application to Hela Network

We then apply our method on the cell cycle genes in human cancer cell lines (HeLa) which were analyzed by Whitfield [37]. A subnet consisting of 9 Hela cell genes was extracted by Sambo et al. [38] and the topology of this gene regulatory network is determined in the BioGRID database. They also developed a method called CNET to analysis the Hela network. This network is also analyzed by Lozano et al. [39] and Shojaie and Michailidis [40]; they proposed 2 penalized method, grpLasso, and TAlasso to infer causal interactions.

Here we use the third experiment of Whitfield [37] as the previous studies, consisting of 47 samples. The results of CNET, grpLasso, and TAlasso are taken from [40]. Table 5 shows the inference performance for the Hela network. Comparing with the BGL, the BGL_prior has a higher precision. Comparing with other methods, the penalized method seems to perform better than another B-spline based method and has a similar performance to the other 2 penalized methods and all the true positives of Morrissey’s method are also found by BGL and BGL_prior. On the other hand, the interactions from RFC4 to CDC2 and CDC2 to CCNE1 are found not only by BGL and BGL_prior, but also by 2 of other 3 comparable methods. It may be because these interactions exist in real regulatory network but are not included in the BioGRID dataset.

4. Conclusion

In this study, we propose a fully Bayesian method, based on B-spline, group lasso, and topology information to infer gene regulatory network from time-series data. We use B-spline functions to capture the nonlinear interactions between genes, norm penalty to prevent overfitting, and topology information, the knowledge of the exponential decrease in in-degree that most genes have only a small number of regulators as a prior. A spike and slab prior is used to facilitate variable selection by putting a multivariate point mass at for an -dimensional coefficients group. The performance of the proposed method is demonstrated by applications to the DREAM4 in silico data of sizes 10 and 100 network challenges and the real biological data of IRMA and Hela cell network. The results show that the topology information indeed contributes to the gene regulatory network inference which can improve the AUROC remarkably of the DREAM4 in silico data and improve the results of the IRMA network and Hela cell data. B-spline regression model also performs better than linear model in real biological data. Therefore, our method is an effective way of inferencing gene regulatory network from the time-series data.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported in part by the National Science Foundation of China, under Grant 61173111.

References

L.-Z. Liu, F.-X. Wu, and W.-J. Zhang, “Properties of sparse penalties on inferring gene regulatory networks from time-course gene expression data,” IET Systems Biology, vol. 9, no. 1, pp. 16–24, 2015.
View at: Publisher Site | Google Scholar
A. V. Werhli and D. Husmeier, “Reconstructing gene regulatory networks with Bayesian networks by combining expression data with multiple sources of prior knowledge,” Statistical Applications in Genetics and Molecular Biology, vol. 6, no. 1, 2007.
View at: Publisher Site | Google Scholar | MathSciNet
Y. Watanabe, S. Seno, Y. Takenaka, and H. Matsuda, “An estimation method for inference of gene regulatory net-work using Bayesian network with uniting of partial problems,” BMC Genomics, vol. 13, supplement 1, p. S12, 2012.
View at: Publisher Site | Google Scholar
M. Grzegorczyk and D. Husmeier, “Improvements in the reconstruction of time-varying gene regulatory networks: dynamic programming and regularization by information sharing among genes,” Bioinformatics, vol. 27, no. 5, Article ID btq711, pp. 693–699, 2011.
View at: Publisher Site | Google Scholar
N. Xuan Vinh, M. Chetty, R. Coppel, and P. P. Wangikar, “Gene regulatory network modeling via global optimization of high-order dynamic Bayesian network,” BMC Bioinformatics, vol. 13, article no. 131, 2012.
View at: Publisher Site | Google Scholar
T. Akutsu, S. Kuhara, O. Maruyama, and S. Miyano, “Identification of genetic networks by strategic gene disruptions and gene overexpressions under a Boolean model,” Theoretical Computer Science, vol. 298, no. 1, pp. 235–251, 2003.
View at: Publisher Site | Google Scholar | MathSciNet
M. I. Davidich and S. Bornholdt, “Boolean network model predicts cell cycle sequence of fission yeast,” PLoS ONE, vol. 3, no. 2, Article ID e1672, 2008.
View at: Publisher Site | Google Scholar
K.-C. Chen, T.-Y. Wang, H.-H. Tseng, C.-Y. F. Huang, and C.-Y. Kao, “A stochastic differential equation model for quantifying transcriptional regulatory network in Saccharomyces cerevisiae,” Bioinformatics, vol. 21, no. 12, pp. 2883–2890, 2005.
View at: Publisher Site | Google Scholar
A. Polynikis, S. J. Hogan, and M. di Bernardo, “Comparing different ODE modelling approaches for gene regulatory networks,” Journal of Theoretical Biology, vol. 261, no. 4, pp. 511–530, 2009.
View at: Publisher Site | Google Scholar | MathSciNet
A. A. Margolin, I. Nemenman, K. Basso et al., “ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context,” BMC Bioinformatics, vol. 7, no. 1, article no. S7, 2006.
View at: Publisher Site | Google Scholar
J. J. Faith, B. Hayete, J. T. Thaden et al., “Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles,” PLoS biology, vol. 5, no. 1, p. e8, 2007.
View at: Publisher Site | Google Scholar
G. Michailidis and F. d'Alché-Buc, “Autoregressive models for gene regulatory network inference: sparsity, stability and causality issues,” Mathematical Biosciences, vol. 246, no. 2, pp. 326–334, 2013.
View at: Publisher Site | Google Scholar | MathSciNet
L. E. Chai, S. K. Loh, S. T. Low, M. S. Mohamad, S. Deris, and Z. Zakaria, “A review on the computational approaches for gene regulatory network construction,” Computers in Biology and Medicine, vol. 48, no. 1, pp. 55–65, 2014.
View at: Publisher Site | Google Scholar
E. R. Morrissey, M. A. Juárez, K. J. Denby, and N. J. Burroughs, “Inferring the time-invariant topology of a nonlinear sparse gene regulatory network using fully Bayesian spline autoregression,” Biostatistics, vol. 12, no. 4, pp. 682–694, 2011.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
Y. Ni, F. C. Stingo, and V. Baladandayuthapani, “Bayesian nonlinear model selection for gene regulatory networks,” Biometrics, vol. 71, no. 3, pp. 585–595, 2015.
View at: Publisher Site | Google Scholar | MathSciNet
R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society Series B: Methodological, vol. 58, no. 1, pp. 267–288, 1996.
View at: Google Scholar | MathSciNet
M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society, Series B: Statistical Methodology, vol. 68, no. 1, pp. 49–67, 2006.
View at: Publisher Site | Google Scholar | MathSciNet
S. McKay Curtis, S. Banerjee, and S. Ghosal, “Fast Bayesian model assessment for nonparametric additive regression,” Computational Statistics & Data Analysis, vol. 71, pp. 347–358, 2014.
View at: Publisher Site | Google Scholar
X.-N. Feng, G.-C. Wang, Y.-F. Wang, and X.-Y. Song, “Structure detection of semiparametric structural equation models with Bayesian adaptive group lasso,” Statistics in Medicine, vol. 34, no. 9, pp. 1527–1547, 2015.
View at: Publisher Site | Google Scholar | MathSciNet
A. Nair, M. Chetty, and P. P. Wangikar, “Improving gene regulatory network inference using network topology information,” Molecular BioSystems, vol. 11, no. 9, pp. 2449–2463, 2015.
View at: Publisher Site | Google Scholar
R. Albert, “Scale-free networks in cell biology,” Journal of Cell Science, vol. 118, no. 21, pp. 4947–4957, 2005.
View at: Publisher Site | Google Scholar
X. Xu and M. Ghosh, “Bayesian variable selection and estimation for group lasso,” Bayesian Analysis, vol. 10, no. 4, pp. 909–936, 2015.
View at: Publisher Site | Google Scholar | MathSciNet
Z. Shang and P. Li, “High-dimensional Bayesian inference in nonparametric additive models,” Electronic Journal of Statistics, vol. 8, no. 2, pp. 2804–2847, 2014.
View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
D. Marbach, R. J. Prill, T. Schaffter, C. Mattiussi, D. Floreano, and G. Stolovitzky, “Revealing strengths and weaknesses of methods for gene network inference,” Proceedings of the National Academy of Sciences of the United States of America, vol. 107, no. 14, pp. 6286–6291, 2010.
View at: Publisher Site | Google Scholar
D. Marbach, T. Schaffter, C. Mattiussi, and D. Floreano, “Generating realistic in silico gene networks for performance assessment of reverse engineering methods,” Journal of Computational Biology, vol. 16, no. 2, pp. 229–239, 2009.
View at: Publisher Site | Google Scholar
R. J. Prill, D. Marbach, J. Saez-Rodriguez et al., “Towards a rigorous assessment of systems biology models: the DREAM3 challenges,” PLoS ONE, vol. 5, no. 2, Article ID e9202, 2010.
View at: Publisher Site | Google Scholar
J. Henderson and G. Michailidis, “Network reconstruction using nonparametric additive ODE models,” PLoS ONE, vol. 9, no. 4, Article ID A1455, 2014.
View at: Publisher Site | Google Scholar
A. Pinna, N. Soranzo, and A. de la Fuente, “From knockouts to networks: establishing direct cause-effect relationships through graph analysis,” PLoS ONE, vol. 5, no. 10, Article ID e12912, 2010.
View at: Publisher Site | Google Scholar
A. Shojaie, A. Jauhiainen, M. Kallitsis, and G. Michailidis, “Inferring regulatory networks by combining perturbation screens and steady state gene expression profiles,” PLoS ONE, vol. 9, no. 2, Article ID e82393, 2014.
View at: Publisher Site | Google Scholar
W. C. Young, A. E. Raftery, and K. Y. Yeung, “Fast Bayesian inference for gene regulatory networks using ScanBMA,” BMC Systems Biology, vol. 8, article 47, 2014.
View at: Publisher Site | Google Scholar
M. Bansal, V. Belcastro, A. Ambesi-Impiombato, and D. Di Bernardo, “How to infer gene networks from expression profiles,” Molecular Systems Biology, vol. 3, article no. 78, 2007.
View at: Publisher Site | Google Scholar
R. Bonneau, D. J. Reiss, P. Shannon et al., “The inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo,” Genome Biology, vol. 7, no. 5, article R36, 2006.
View at: Publisher Site | Google Scholar
I. Cantone, L. Marucci, F. Iorio et al., “A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches,” Cell, vol. 137, no. 1, pp. 172–181, 2009.
View at: Publisher Site | Google Scholar
A. Emad and O. Milenkovic, “CaSPIAN: a causal compressive sensing algorithm for discovering directed interactions in gene networks,” PLoS ONE, vol. 9, no. 3, Article ID e90781, 2014.
View at: Publisher Site | Google Scholar
T. Hasegawa, R. Yamaguchi, M. Nagasaki, S. Miyano, and S. Imoto, “Inference of gene regulatory networks incorporating multi-source biological knowledge via a state space model with L1 regularization,” PLoS ONE, vol. 9, no. 8, Article ID e105942, 2014.
View at: Publisher Site | Google Scholar
M. Ceccarelli, L. Cerulo, and A. Santone, “De novo reconstruction of gene regulatory networks from time series data, an approach based on formal methods,” Methods, vol. 69, no. 3, pp. 298–305, 2014.
View at: Publisher Site | Google Scholar
M. L. Whitfield, “Identification of genes periodically expressed in the human cell cycle and their expression in tumors,” Molecular Biology of the Cell, vol. 13, no. 6, pp. 1977–2000, 2002.
View at: Publisher Site | Google Scholar
F. Sambo, B. Di Camillo, and G. Toffolo, “CNET: an algorithm for reverse engineering of causal gene networks,” in Proceedings of the Network Tools and Applications in Biology Workshops (NETTAB '08), Varenna, Italy, 2008.
View at: Google Scholar
A. C. Lozano, N. Abe, Y. Liu, and S. Rosset, “Grouped graphical Granger modeling for gene expression regulatory networks discovery,” Bioinformatics, vol. 25, no. 12, pp. i110–i118, 2009.
View at: Publisher Site | Google Scholar
A. Shojaie and G. Michailidis, “Discovering graphical granger causality using the truncating lasso penalty,” Bioinformatics, vol. 26, no. 18, pp. i517–i523, 2010.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2017 Yue Fan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

2501

Downloads

1383

Citations