Abstract

We have proposed a methodology for the reverse engineering of biologically plausible gene regulatory networks from temporal genetic expression data. We have used established information and the fundamental mathematical theory for this purpose. We have employed the Recurrent Neural Network formalism to extract the underlying dynamics present in the time series expression data accurately. We have introduced a new hybrid swarm intelligence framework for the accurate training of the model parameters. The proposed methodology has been first applied to a small artificial network, and the results obtained suggest that it can produce the best results available in the contemporary literature, to the best of our knowledge. Subsequently, we have implemented our proposed framework on experimental (in vivo) datasets. Finally, we have investigated two medium sized genetic networks (in silico) extracted from GeneNetWeaver, to understand how the proposed algorithm scales up with network size. Additionally, we have implemented our proposed algorithm with half the number of time points. The results indicate that a reduction of 50% in the number of time points does not have an effect on the accuracy of the proposed methodology significantly, with a maximum of just over 15% deterioration in the worst case.

1. Introduction

With the ongoing evolution of technology, massive amounts of temporal genetic expression data for different diseases are becoming available to researchers. The analysis of these data can potentially reveal many unknown cellular activities of living organisms [1, 2]. These data have enough hidden information embedded in them that if suitably analysed can revolutionise biological science and its allied applications like drug design. Accordingly, this has attracted and motivated the research fraternity to undertake detailed investigations in this domain and subsequently develop computational tools required for biologically credible analysis of these data [36]. In this paper, we have examined the reverse engineering of gene regulatory networks (GRNs) from temporal genetic expression datasets. These types of datasets contain crucial underlying information concerning the network dynamics among the genes (through protein).

A GRN represents the complex interregulatory relationships among genes. The transcriptional regulation of genes by other genes involves DNA, RNA, and protein as well as other molecules. The genetic interactions are indirect; that is, a gene does not interact with other genes directly. The indirect interactions take place with the help of proteins (a.k.a. transcription factors). The regulatory relationships (depending on the nature of the control) may be of two types, namely, activation (where there is an increase in the expression value of the target gene) and repression (where the expression value of the target gene decreases). The various processes involved in genetic regulation have been shown in Figure 1.

Genetic expression datasets deal with the expression values of a vast number of interacting genes. Moreover, the number of genes in a dataset is generally two to three times more than the number of time points, at the very least. This imposes a well-known computational problem known as the curse of dimensionality [7]. Another difficulty imposed by microarray datasets is considerable noise contamination [8]. The current work deals with small to medium sized networks and hence is not faced with the former problem. However, the authors do focus on the performance of the proposed methodology in the presence of noise.

In this paper, we have proposed a new methodology for the accurate extraction of the topology of a GRN from any given noisy temporal genetic expression dataset using a statistical paradigm based on the theory of combination. The methodology has an underlying hybrid swarm intelligence framework which is basically a Bat Algorithm (BA) inspired Particle Swarm Optimization (PSO) algorithm christened BAPSO by the authors. Here, better results have been obtained compared to the contemporary literature for the benchmark networks considered. The proposed methodology uses the Recurrent Neural Network (RNN) for modelling the required network dynamics.

According to Bolouri and Davidson [9], a gene in a GRN is usually regulated by 4 to 8 other genes. We have proposed a novel GRN construction strategy (based on this concept) that generates candidate architectures with a limit to the maximum number of regulators for each gene in the network. Since, in this work, we have studied only small-scale and medium-scale networks, we have assumed the maximum number of regulators to be 4 [9]. The fundamental mathematical theory of combination has been applied to search all the candidate solutions in the discrete search space of network constructions exhaustively. The corresponding RNN model parameters have been trained by the proposed hybrid metaheuristic technique that can replicate the original network dynamics faithfully. The quality of a solution architecture depends on the quantum of error in the predicted dynamics. The authors have observed in this investigation that biologically plausible candidate architectures return much-reduced prediction errors compared with those which are far removed from real-world network structures.

We have implemented our proposed algorithm on three different types of data:(i)A synthetic dataset generated from an artificial network which has been studied quite extensively concerning reverse engineering of GRNs.(ii)A real-world experimental dataset (in vivo), that is, the DNA SOS repair network of E. coli.(iii)An artificial dataset generated in silico from a real-world network of E. coli.(iv)Another artificial dataset generated in silico from a real-world network of yeast.

Also, we have incorporated networks from small to medium scale in this work (i.e., 4-gene to 20-gene networks). In the case of the synthetic dataset, GRNs predicted by our proposed methodology generate improved results concerning the prediction of correct as well as incorrect regulations, compared to the best existing results in the contemporary literature (to the best of our knowledge). In the case of the in silico experiments, the results suggest that our proposed algorithm is robust enough to return fewer incorrect predictions along with an increase in the number of correct predictions compared to the best available outcomes in recent research endeavours. For the real experimental datasets, it has been observed that the proposed methodology can identify all the possible gene regulatory relations, some of which are quite elusive to the contemporary as well as previous researchers.

The rest of the paper has been organized as follows. The background of temporal genetic expression dataset study has been presented in the next section with an outline of the existing methodologies for reverse engineering of GRNs. The subsequent section presents our proposed framework in detail. Experimental results have been presented and discussed next. The final section concludes the paper, highlighting some future research scopes.

2. Preliminaries

2.1. Background

Traditional investigations in the domain of molecular biology provide vital information about the functioning of the genetic system in a living cell. Regrettably, this information so far is inadequate for us to comprehend the complex gene regulatory mechanisms fully. Among such techniques, southern blotting was first reported by Augenlicht and Kobrin [10], and it is the origin of DNA microarray technology [11]. Improvements in this technology have allowed us to measure the expression levels of thousands of genes simultaneously under various circumstances. These data help us in the pursuit of disclosing the knowledge about the regulatory interactions between genes of the entire genome of a living organism. Despite these advancements, there remain numerous open challenges in system biological research domain.

Various approaches exist in the contemporary literature for the construction of GRNs from time series genetic expression datasets. At the outset, researchers attempted to employ clustering algorithms on temporal expression data based on pairwise correlation coefficients [12] and Euclidian distances [13] for the reconstruction of GRNs. Information theory based approaches that made use of “mutual information” between different expression profiles have also been implemented by researchers for defining similarity between genes [1416]. Application of Bayesian networks for modelling of GRNs is also quite popular among the researcher fraternity [1720].

GRNs can be effectively constructed using the dynamical modelling formalisms [21] such as Boolean networks [22], where Boolean variables are used to represent the interaction between genes, and the ordinary differential equations based method, -systems [2325], where in-depth biochemical kinetic models are used to simulate gene network architectures. All of the above can reproduce the structure as well as the temporal expression profiles from temporal genetic expression profiles.

Additive regulation networks have also been used by researchers to represent the dynamics of a GRN [26]. The collective regulatory effect of a group of genes on a target gene can be represented in this formalism. The intensity and type of a particular interaction between a target () and a regulator () are defined by : a positive value denotes expression (facilitation) and a negative value denotes repression while a zero (0) value implies that there is no interaction between and . Thus, a GRN can be represented by a weight matrix , where the number of genes in the GRN is equal to [27, 28]. Another model somewhat analogous to this model is the Recurrent Neural Network (RNN) model, which has been effectively used in the reconstruction of GRNs from temporal expression data by several contemporary researchers [2935]. The theoretical background of the RNN formalism has been discussed in detail in the next section. This forms the basis of our proposed modified framework. Figure 2 shows the representation of a GRN by an RNN model.

2.2. Recurrent Neural Networks

The regulation of the expression of any particular gene, by another gene or a group of genes, can be expressed with the help of the Recurrent Neural Network formalism [30, 3638] as shown in Figure 2. Each node symbolises a particular gene and the edges between the nodes represent the regulatory interactions among the genes. Each tier of the neural network defines the genetic expression level of the genes at a specified time . The level of expression of any particular gene at a time depends upon the genetic expression level of all the genes () at the preceding time and the weights of their corresponding connecting edges () with that particular gene. Thus, the total regulatory effect of all the genes in a network, on any gene , can be summarised as follows:This can be transformed using a sigmoid function, within an interval , as has been shown by Vohradsky [31]. Here, symbolises an external input, which may be visualised as a reaction delay parameter. A higher (large) value of this parameter indicates a reduction of the effect (influence) of on . The actual genetic expression rate is subsequently modulated by a multiplicative constant that defines the peak expression level of a particular gene [31]. The rate of expression of any gene can be defined as the total of the regulatory effects of other genes on it minus its degradation . This is represented bywhere the degradation factor can be modelled based on the kinetic framework of a first-order biochemical equation represented as [31]. The term represents the entire regulatory effect on the expression of the gene, (represented as ). The constant signifies the rate constant of degradation of the gene product . Thus,where denotes the sigmoid transfer function and signifies the concentrations of the elements of the given system (for ; ). The above expresses the dynamics of expression of a gene and denotes a “node” function [31]. Each node can be connected with all the other nodes to form a neural network (Figure 2). The weight matrix describes the connection between the nodes of the network; a nonzero value of means that a connection between nodes and exists. The magnitude of the weight signifies the strength of an interaction, or a regulatory effect, between the two nodes. The neural network is completely defined by the differential equations respective to the particular nodes, and the number of equations is determined by the number of nodes. The quantum of genetic expression at any arbitrary time can be calculated by solving the set of differential equations. Equation (3) represents a special case of a class of RNNs described by the more general equation:This is a continuous model that has been used for modelling brain activity pattern and the study of its dynamics. If the weight matrix is symmetrical in nature, the network it represents reaches stability in finite time. Now, in real world, time series data are obtained at discrete time points only for which (5) can be rewritten in its discrete format as follows:

The dynamics of a GRN can be parametrically modelled using appropriate dynamical methodologies such as Bayesian networks, Boolean networks, Recurrent Neural Networks, and -systems. This indicates the significance of identification of the underlying information regarding genetic interactions present in the temporal expression data of a regulatory network. The purpose of any reverse engineering framework is the accurate inference of the applied model’s parameters for the faithful reproduction of the given time series data. This can be viewed as an optimization problem, where the model parameters are trained to minimise the difference between the simulated and the original time series data. Determination of the mean square error (MSE) from the above can be a suitable measure of this specification:Here, is the total number of genes (nodes) in the network, is the total number of time points available, is the original expression data, and is the simulated data at any point of time .

2.3. Major Concerns

One of the major hurdles in the reverse engineering of GRN from temporal gene expression data is the curse of dimensionality. It arises from the fact that the number of genes in a dataset is usually two to three orders higher than the number of time points, and it severely reduces the prediction capacity of the given formalisms. Researchers have attempted to solve this problem to some extent in [28, 30, 32, 33, 39, 40]. The present work focuses on small- to medium-scale networks only (4 genes to 20 genes) and thus does not face the entire severity of this problem.

The RNN methodology has been implemented, in this paper, to model the temporal expression data. For that purpose, the RNN model parameters require training, which, in essence, is an optimization problem. Several metaheuristic techniques, like Simulated Annealing [30, 41], Genetic Algorithm (GA) [32, 33, 40, 42, 43], Differential Evolution [44], Particle Swarm Optimization [34, 45, 46], and so forth, have been and are being implemented for this purpose with various levels of accuracy. The proposed methods, however, have largely been ineffective to accurately infer even small-scale real-life GRNs. A few have been able to identify all the true regulations but in the process have also inferred unwanted false regulations. Moreover, the “No Free Lunch” (NFL) theorem [47] rationally states that there is no single metaheuristic that is most appropriate for solving all types of optimization problems. Therefore, finding out the most suitable and efficient optimization techniques for the accurate inference of small GRNs is still an open problem for researchers.

Nevertheless, the number of parameters in need of training undergoes quadratic scaling with respect to the number of genes in a GRN. This fact imposes severe hindrance in keeping the dimension of the optimization problem at a reasonable computational limit. As a result, optimization of model parameters becomes implausible for practical values of (i.e., , 1000, etc.). To solve this difficulty, researchers have proposed strategies like decomposition of the problem of global optimization of parameters into local problems of parameter optimization for a single target gene only [40, 43, 48, 49]. Other strategies, such as interpolation [36, 50, 51], the addition of noisy duplicate copies [52], and use of suitable thresholds [37, 52], have also been implemented to limit the number of optimizable parameters. Interpolation strategies usually have a drawback: they are incapable of faithfully summarising the dynamics between any two time points. According to van Someren et al., such strategies can bring about only a minimal reduction in the dimension of the optimization problem, irrespective of the number of additional time points [52]. Other strategies also fail to improve upon the situation as they also fail to add any supplementary information to the network dynamics.

Fortuitously, extensive biological research, in the perspective of reverse engineering of GRNs, confirms that there exist only a handful of genes that act as regulators in a GRN [7, 9, 19, 36, 37]; that is, GRNs are connected sparsely. Mathematically, this implies that we can assign a zero value to a large number of the model parameters that represent the one-on-one regulatory relationships. Thus, a considerable reduction in the dimensionality of the optimization problem can be achieved.

Researchers strived to develop a suitable optimization environment integrating this sparseness concept and achieve a significant improvement in the problem solution. This entailed some form of architectural constraint to be imposed on the predicted networks. Researchers have also found that it is possible to decouple the structural and the dynamic aspects of the given reverse engineering problem. In other words, there is scope for the application of a suitable technique that can decouple the problem into two independent subproblems: the search for candidate architectures in the discrete search space of network structures and the search for suitable model parameters in the corresponding continuous search space of parameters of dynamical formalisms [34, 35, 45, 46, 53, 54]. Thus, an endeavour to locate suitable model parameters may supervise the pursuit of detection of the candidate networks. The accuracy of a trained model, assessed from the perspective of reproducing the original dynamics, determines the appropriateness of a predicted architecture. The level of precision can be ascertained from the MSE calculated for the predicted models as per (7). A genetic interaction is appended to a predicted architecture or removed from it based on the value of the calculated MSE. Thus, the extra burden of searching the biologically plausible network architectures within a separate discrete search space of candidate network architectures is compensated by a considerable reduction in the dimension of the problem of training the dynamic model parameters.

3. Methods

3.1. Decoupled Strategy

In this section, we have explained in detail the methodology implemented in this work. Firstly, we have represented a GRN with the help of a directed graph; represents a GRN, where denotes the set of all nodes (genes) and is the set of all edges (the interaction between a pair of genes). An edge, , is present in the set if and only if there exists an interaction between node (gene) and node (gene) . Here, signifies that gene regulates gene , and this convention has been used right through this work. A directed graph can be represented by an adjacency matrix for computational purposes. An adjacency matrix , where is the number of nodes in the graph (i.e., the number of genes in the network). The element can take any value, 0 or 1, depending on the absence or presence of a directed edge from node to node , respectively.

Now, the methodology proposed in this work, for the reverse engineering of GRNs from temporal expression datasets, employs the decoupling strategy discussed in the previous section [34, 35, 45, 46, 53, 54]. Here, we have first reduced the search space of candidate network structures by restricting the number of regulators [9] on a particular gene in a GRN. Subsequently, we have implemented the theory of combination to exhaustively search the reduced candidate network architecture space. In other words, if there are genes in a GRN and is the maximum number of regulators allowed for a gene, then the search space dimension is, by definition, or . This is much less than the original search space dimension of . Additionally, since our proposed algorithm is performing exhaustive search in the reduced space, it has a high chance of obtaining biologically plausible candidate network architectures. Mathematically, there are GRN structures, each represented by ().

In the next phase, the RNN formalism has been implemented to model the underlying dynamics from the temporal genetic expression profiles based on the candidate network structures obtained in the previous phase. In other words, the weight matrix of the RNN formalism has been initialised based on all ’s defined. We have used the proposed BAPSO technique to train the RNN model parameters, that is, , , and , accurately such that the predicted expression profiles match the original expression profiles faithfully. The MSE defined by (7) determines the quality of a candidate solution , and the candidate solution with the least MSE has been chosen as the most reasonable from all the or candidates.

It would be interesting to note here that each of the genes in a GRN may not always be regulated by the maximum number of allowed regulators; that is, genes. Therefore, we have gradually incremented the value of from 1 to 4, and the MSE has been calculated for each case. A satisfactorily low value of MSE ~ has been used as the selection criterion for a candidate solution.

A further problem encountered in this endeavour is the dimensionality of the RNN model parameter training problem. For genes in a GRN, there are parameters to be trained for a particular RNN instance with the help of the BAPSO technique, and this essentially becomes computationally unrealistic for large values of . To further reduce the computational load, in this work, we have decomposed this problem into subproblems where, in each subproblem, parameters are trained for each of the genes, independently. In case of each of the independent subproblems, the aim is to minimise the error term defined as

Here, and which is subsequently used for the calculation of MSE. Hence,

The MSE governs the overall quality of the candidate solutions. The lower the value of the term , the more efficient the reduction in the difference between the predicted temporal expression profile and the original one and the more suitable the candidate network architecture. It is, therefore, the ultimate objective of the proposed methodology to reconstruct a network architecture that is biologically plausible and at the same time capable of replicating the original network dynamics more accurately.

3.2. The Proposed Metaheuristic

The training of the model parameters of the RNN instances has been achieved using the proposed BAPSO algorithm. Among all the proposed swarm intelligence techniques to date, Particle Swarm Optimization (PSO) [5558] is conspicuous for being simple yet efficient, robust, easily tractable, and easy to code. PSO yields solutions that are of the same or better quality compared to GA for a wide array of problems and possesses a faster convergence rate. A particle swarm comprises some particles arbitrarily dispersed in a search space. The positions of these individual particles denote candidate solutions. The intention of any particle is to find the optimum solution utilising the knowledge acquired through social interactions with its neighbours. Each particle in a swarm is specified by its position , its velocity , and its memory of the best solution achieved by it so far . Another memory element denotes the best solution attained thus far by the swarm and is shared among all particles.

The position of a particle signifies the vector containing all the parameters of an RNN instance. The fitness of a particle is calculated using either (8) or (7), depending upon whether the decoupled strategy has been implemented or not, respectively. In other words, if someone chooses not to use the decoupled strategy, then the quality of the solution is determined by (7). However, since, in the decoupled strategy, each gene is dealt with separately, the quality of a solution is determined by (8), and we have used this only. For each generation, the updated position and velocity of a particle for the next generation are calculated based on its best solution achieved so far and the best solution obtained by the entire swarm thus far. Hence,where is the inertia weight term, and it controls the dynamic balance between exploration and exploitation undertaken by a particle. Again, and are random numbers in the range and usually . The terms and determine the effect (on the particle velocity) of the best solutions achieved by a particle and the swarm, respectively. The terms are all in a matrix format and thus it is sensible to point out that elementwise multiplications are necessary here and have been symbolised by .

BA has been recently formulated by Yang based on the echolocation property of real bats [59, 60]. In BA, the virtual bats locate food and inform others about the food source with the help of sound waves. The virtual bats are assumed to have the ability to modulate the sound waves according to the need, that is, locating food/prey or communicating with others. The virtual bats are also scattered in the search space, with the position of each virtual bat denoting possible solutions. A virtual bat is completely specified by its position , its velocity , loudness , and frequency . A memory element stores the position of the best food source found so far. The velocity and position of a virtual bat are updated according to the following equations:where is a random vector. In this investigation, if standalone BA had been used, then the pertinent values would have been and . At the outset, each virtual bat is arbitrarily allocated a frequency from , drawn uniformly. This frequency term controls the movement of the virtual bats in the search space, similar to what the inertia weight term does in case of PSO, as can be seen in (10). There are various ways of updating the inertia weight for PSO.

In this paper, we have proposed a new technique based on the update technique of frequency of virtual bats in BA. We have proposed to update the inertia weight in PSO in each iteration using the following equation:where is also a random vector. We have assumed and . In each iteration, for each particle, an inertia weight is drawn uniformly from . This somewhat counterbalances the problem of being trapped at local minima, which is one of the few but major shortcomings of PSO. The proposed novel BAPSO algorithm, with the new inertia weight update technique, used for the particular problem domain dealt with herein, has been able to produce better results than individual PSO or BA algorithms (as suggested by other investigations carried out by the authors).

Another change, inspired by the virtual bats in BA that has been incorporated in the proposed BAPSO algorithm, is the initialization of the velocity vector of each particle to zero instead of a random vector. The authors observe that this might help in preventing the particles from having an initial unguided velocity that may divert them away from a potential optimal solution in the search space.

4. Experimental Results and Discussion

Owing to the stochastic nature of the proposed framework implemented, it is quite normal that, for any given temporal genetic expression dataset, the predicted GRN would vary in its topology for each independent solution generated. To circumvent this problem, we have employed, in this investigation, a collaborative learning method. We have performed independent experiments and have stored in memory each inferred GRN. Additionally, a selection scheme has been implemented based on a plausibility score , assigned to each edge , as given below. This has been done to identify the most consistent predicted edges for the construction of the final GRN:

In (14), and . On the evaluation of for all and , the final predicted network, thus, can be generated and represented by . Whether the value of a particular element is 0 or 1 can be evaluated using the following relation:

In the above equation, is a threshold defined for the purpose of inclusion of an interaction in a GRN. In other words, it governs whether an edge is included in or omitted altogether. In order to estimate the accuracy of the proposed methodology, we have compared with the original GRN, denoted by . In addition, we have quantitatively compared the results of the proposed framework with those from the contemporary literature based on certain metrics. Before explaining the metrics, it would be prudent to mention that an edge can be characterised into four types: true positive (TP), false positive (FP), true negative (TN), and false negative (FN), with their mathematical definitions as follows:TP: if and ; TN: if and .FP: if and ; FN: if and .

Next, we have defined the metrics one by one based on which the proposed methodology can be evaluated.

(i) True Positive Rate (TPR)/Sensitivity/Recall. This signifies the fraction of the total number of existing edges in the original network, correctly predicted in the inferred network.

(ii) True Negative Rate/Specificity (SPC). This signifies the fraction of the total number of nonexistent edges in the original network, correctly identified as nonexistent in the inferred network as well.

(iii) False Positive Rate (FPR)/Complimentary Specificity. This signifies the fraction of the total number of nonexistent edges, incorrectly predicted in the inferred network.

(iv) False Negative Rate (FNR)/Complimentary Sensitivity. This signifies the fraction of the total number of nonexistent edges in the original network, incorrectly guessed in the predicted network.

(v) Positive Predictive Value (PPV)/Precision. This signifies the fraction of the total number of inferred edges, which is correct.

(vi) False Discovery Rate (FDR)/Complimentary Precision. This signifies the fraction of the total number of inferred edges, which is incorrect.

(vii) Accuracy (ACC). This signifies the fraction of the total number of all possible connections, in the original network, truly predicted.

(viii) F-Score. This signifies the harmonic mean of the precision and sensitivity.

Mathematically speaking,The statistical BAPSO methodology has been applied primarily on an artificial network (4 genes). Subsequently, we have applied the proposed methodology on a group of experimental (in vivo) time series genetic expression datasets of a real-world network (the 8-gene E. coli SOS DNA repair network). Finally, we have experimented with two networks extracted from the genome of Saccharomyces cerevisiae (10 genes) and Escherichia coli (20 genes) with the help of GeneNetWeaver [63]. Additionally, we have implemented our proposed algorithm on each of these networks, but with half the number of time points initially used for experimentation. This has been done to observe the accuracy of the method if a lesser number of time points are available for training.

All the simulations have been run on a desktop computer running on a 3.4 GHz Intel Core i7 processor with 8 GB 1600 MHz RAM. The codes have been run on Matlab 2014a, running in a Windows 7 64-bit environment.

4.1. Artificial Network

This artificial network consists of 4 genes and 8 interactions. This network has been extensively studied by researchers for the purpose of preliminary validation of their methodologies with respect to reverse engineering of GRNs from time series genetic microarray data [32, 34, 46]. The time series expression data have been generated using (3). The parameters and their related values necessary for calculations have been given in Table 1. The generated expression profiles have been shown in Figure 3. We have assumed for this case and have generated 500 time points with the help of (3).

However, in real-world experiments, such a high number of time points do not typically exist. Therefore, we have sampled the data evenly into 50 time points and have implemented our proposed methodology on the sampled dataset. Further, we have evenly sampled this reduced dataset to produce another dataset with 25 time points.

The reverse engineering initiative involves = 10 independent experiments. We have conducted each experiment with a swarm population of (where ) particles and 100000 iterations. The statistical properties of the final inferred network have been shown in Figure 4, for . Utilising just a single time series, the results show marked improvement over those published by Xu et al. [34] and Kentzoglanakis and Poole [46], with respect to both true and false positives albeit with a stricter threshold than that used by the authors () in [46]. The average MSE for the experiments are ~ and ~ for the dataset with 50 time points and the one with 25 time points, respectively. The computational times for both experiments are 15.6 minutes and 8.6 minutes, respectively.

4.2. E. coli DNA SOS Repair Network

In this section, the proposed methodology for reverse engineering of GRNs from temporal genetic expression profiles has been employed to identify the causal relationships among the genes from an in vivo (experimental) microarray dataset. The said dataset summarises the dynamics of the well-illustrated transcriptional network involved in the SOS DNA repair mechanism of E. coli studied experimentally by Ronen et al. [64]. The study included eight genes heavily involved in the SOS repair mechanism: recA, lexA (the master repressor), uvrA, uvrD, uvrY, umuD, ruvA, and polB. The original network has been shown in Figure 5. Four experimental datasets had been generated using two different UV light intensities on E. coli (for experiments 1 and 2: UV intensity used → 20 Jm−2; for experiments 3 and 4: UV intensity used → 5 Jm−2). In each of the experiments, expression data had been observed for 50 time points each using temporal resolution of 6 minutes. These datasets are one of the most useful ones concerning the qualitative investigations on computational methods for reconstruction of GRNs from time series genetic expression data (which for ready reference is at http://wws.weizmann.ac.il/mcb/UriAlon/sites/mcb.UriAlon/files/uploads/DownloadableData/sosdata.zip).

In this case also, independent experiments have been performed for each of the four datasets. A swarm population of (where ) has been used with a maximum number of iterations set to 5000. The expression value of each gene in each dataset at the first time point is zero and hence has been ignored. Subsequently, all expression values have been normalised to the range . The dataset thus contains 49 time points. We have also taken alternative time points and created a truncated dataset with 25 points. The statistical properties of the predicted GRN in each experiment with a plausibility score threshold, set at , have been shown in Table 2.

Table 3 displays a quantitative comparison of the characteristics of the predicted GRNs (with a plausibility score threshold, ) with those presented in a recent investigative work (with an inclusion threshold, ) [46]. The proposed methodology is consistent regarding the number of true (and false) positives predicted compared to results presented in [46] for different experimental datasets. The method proposed in [46] fails to identify any true positive in the fourth experiment whereas the framework proposed in this paper does not fail to identify true positives for any experiment. However, we have to concede that the proposed framework cannot match the isolated best result obtained by the eDSF model [46] in the case of the second experiment. However, it may be noted that the regulatory relationship between recA and lexA was not inferred in any of the experiments conducted by Kentzoglanakis and Poole [46], whereas the proposed methodology can identify this particular interaction in one of the four experiments. This suggests that it probably is among a few proposed computational frameworks that are capable of identifying all the regulatory interactions present in the SOS response network of E. coli. A qualitative comparison of several such methodologies [19, 34, 46, 61, 62] is given in Table 4. The performance of the methodology with half the number of time points is also admirable and has been included in Tables 2 and 3.

4.3. 10-Gene Network Extracted from GeneNetWeaver (GNW)

The in silico datasets have been extracted from the genome of yeast and E. coli stored in GNW [63]. First, we have considered the yeast network, made up of 10 genes and 12 genetic interactions as shown Figure 6. We have generated the network dynamics with the help of GeneNetWeaver [63] in keeping with DREAM4 settings [65]. Two sets of genetic expression data have been generated, one with 50 time points and the other with 25 time points (taking the alternate time points of the former). The number of independently generated solutions is . Since there are 10 genes in the GRN, the problem has been divided into 10 subproblems, each with 12 parameters to optimise. For each of the suboptimization problems, a swarm population of (where ) has been used, and the maximum number of iterations has been set to 10000.

The results achieved in this experiment have been summarised in Table 5. The proposed methodology can correctly predict 5 (for ) out of a possible 12 interactions present in the original network using the dataset with 50 time points. The proposed methodology can also correctly predict 4 (for ) out of a possible 12 interactions present in the original network using the dataset with 21 time points. With the increase in , the number of incorrect predictions goes down from 13 to 9, increasing the accuracy from 81% to 84%, and from 15 to 11, increasing the accuracy from 77% to 81%, respectively, in the two cases. The final predicted GRNs for the two cases have been shown in Figures 7 and 8.

The results have been compared with previous similar work published in [46] and have been shown in Table 6. The proposed methodology indicates improvement from the perspective of true predictions. Even for the most stringent value of the threshold, that is, , the number of true predictions is significantly more (5 compared to 3) with 50 time points and still better (4 compared to 3) with 25 time points. The true positive rate and the precision of the predicted network are almost always better than the compared network. Considering the nature of the inferred relationships (whether activation or repression), the proposed methodology has correctly identified the nature of 80% of the predicted relationships.

4.4. 20-Gene Network Extracted from GeneNetWeaver (GNW)

The second network extracted from GNW is a 20-gene network consisting of 24 interactions. The datasets for this network have been generated using the same settings as the previous one. We have generated independent solutions. There are 20 genes in this GRN, and hence the problem has been divided into 20 subproblems, each with 22 parameters to optimise. For each of the suboptimization problems, a swarm population of (where ) has been used, and the maximum number of iterations has been set to 10000. The original network is shown in Figure 9.

The proposed RNN based framework does not scale up with the size of the GRN, efficiently. For the 20-gene network considered here, it was able to predict only 3 out of a possible 24 interactions correctly, but with a large number of false positives. Fascinatingly, however, the proposed method can correctly predict 5 interactions out of a possible 24, with 2 less false positives. The predicted correct relations have been shown in Table 7.

5. Conclusion

In this paper, we have investigated the domain of reconstruction of GRNs from time series microarray datasets with modifications in the existing methodologies. For this purpose, we have implemented a decoupled technique based on the novel BAPSO algorithm, the fundamental mathematical theory of combination, and RNN. The main objective of the investigation is to detect the biologically relevant GRNs from the large discrete network architecture search space. Prior knowledge and the fundamental theory of combination have been used for the purpose of reducing the dimension of the optimisation problem, thus reducing the computational load. Also, the proposed methodology ensures a higher probability of identifying a more biologically relevant network as it searches all possible candidate architectures (i.e., all possible combinations).

The proposed novel hybrid swarm intelligence scheme, BAPSO, has been implemented in the present investigation to train the RNN model parameters and the results obtained show that the predicted networks reproduce the dynamics of the given dataset to a better extent for small-scale GRNs. The results suggest that the proposed decoupled reverse engineering approach is robust and consistent with respect to the number of correct and incorrect predictions while using different types of microarray datasets (synthetic, in silico, and in vivo) for most of the small-scale GRNs studied in the contemporary literature.

However, it is an entirely different scenario for medium-scale networks (20 genes). The methodology fails to reproduce any of the successes it had against smaller GRNs. There are too few true predictions and a large number of incorrect predictions. The methodology, implemented in this paper, thus, needs to be enriched further by studying its performance in larger networks. This provides a vital scope for further research.

Also, the assumption of the value of the threshold, , is based on the knowledge of the final network to be obtained. In real-world cases, where the final GRN is not known, the setting of a suitable threshold for the ensemble learning scheme used in this work needs further research. Additionally, the reduction in false positives is also an important research endeavour for the future.

Another point to be noted in this context is the performance of the methodology with a lesser number of time points, half to be exact. The results indicate that a 50% reduction in the number of time points leads to only a small drop in accuracy of the predicted models, a maximum of just over 15% in the worst-case scenario. However, interestingly, the methodology slightly improves upon the poor results obtained for the 20-gene GRN, with a lesser number of time points available.

This also provides an opportunity for future research into the prediction of GRNs from genetic expression profiles with lesser time points and will surely help to reduce time and cost of data generation in the future.

Disclosure

Appropriate references have been given in the text; no figure or table has been reproduced without permission, and none of the results has been fabricated, to the best of the authors’ knowledge.

Competing Interests

The authors declare that they have no competing interests.

Authors’ Contributions

All authors have equally contributed to this work.