Advances in Bioinformatics

Volume 2017, Article ID 4827171, 14 pages

https://doi.org/10.1155/2017/4827171

## Multiple Linear Regression for Reconstruction of Gene Regulatory Networks in Solving Cascade Error Problems

^{1}Department of Software Engineering, College of Computer Science & IT, Universiti Tenaga Nasional, Jalan IKRAM-UNITEN, 43000 Kajang, Malaysia^{2}Centre of Artificial Intelligence, Faculty of Information Sciences & Technology, Universiti Kebangsaan Malaysia (UKM), 43650 Bangi, Malaysia^{3}Department of Information Technology, Faculty of Computing and Information Technology in Rabigh, King Abdulaziz University, Rabigh, Saudi Arabia

Correspondence should be addressed to Faridah Hani Mohamed Salleh; ym.ude.netinu@hhadiraf

Received 29 June 2016; Revised 10 October 2016; Accepted 19 October 2016; Published 29 January 2017

Academic Editor: Klaus Jung

Copyright © 2017 Faridah Hani Mohamed Salleh et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Gene regulatory network (GRN) reconstruction is the process of identifying regulatory gene interactions from experimental data through computational analysis. One of the main reasons for the reduced performance of previous GRN methods had been inaccurate prediction of cascade motifs. Cascade error is defined as the wrong prediction of cascade motifs, where an indirect interaction is misinterpreted as a direct interaction. Despite the active research on various GRN prediction methods, the discussion on specific methods to solve problems related to cascade errors is still lacking. In fact, the experiments conducted by the past studies were not specifically geared towards proving the ability of GRN prediction methods in avoiding the occurrences of cascade errors. Hence, this research aims to propose Multiple Linear Regression (MLR) to infer GRN from gene expression data and to avoid wrongly inferring of an indirect interaction (A → B → C) as a direct interaction (A → C). Since the number of observations of the real experiment datasets was far less than the number of predictors, some predictors were eliminated by extracting the random subnetworks from global interaction networks via an established extraction method. In addition, the experiment was extended to assess the effectiveness of MLR in dealing with cascade error by using a novel experimental procedure that had been proposed in this work. The experiment revealed that the number of cascade errors had been very minimal. Apart from that, the Belsley collinearity test proved that multicollinearity did affect the datasets used in this experiment greatly. All the tested subnetworks obtained satisfactory results, with AUROC values above 0.5.

#### 1. Introduction

The GRN inference-related works have fueled many major breakthroughs in finding drug targets for the treatment of human diseases, including cancer [1–3]. Therefore, being able to predict gene expressions more accurately provides a way to explore how drugs affect a system of genes, as well as for identifying the genes that are interrelated in a process. Besides, rebuilding GRN from gene expression profiles allows the discovery of various functions ranging over diverse domains like molecular biology, biochemistry, bioengineering, and pharmaceutics [2].

One of the main reasons for the reduced performance of previous GRN methods had been inaccurate prediction of cascade motifs. Although there are various gene prediction methods that were developed and presented in various leading journals before, discussion on specific methods of solving problems related to cascade errors is still lacking. The study conducted by [4–11] discussed the issue of cascade errors. However, the experiments conducted were not specifically geared towards proving the ability of GRN prediction methods in avoiding the occurrence of cascade errors. Distinguishing between direct and indirect regulation (cascade errors) is a well-known difficulty in GRN inference but was never quantitatively assessed.

Inferring GRNs remain challenging because of several limitations: () the high dimensionality of living cells is where tens of thousands of genes act at different temporal and spatial combinations; () one gene or gene product may interact with multiple partners, either directly or indirectly and thus possible relationships are dynamic and nonlinear; () current high-throughput technologies generate data that involve a substantial amount of noise [9, 12]; (4) the sample size is extremely low compared with the number of genes [13, 14] and the presence of hidden nodes [9]. Using the case of a simple cascade , when intermediate node is hidden, nodes and become isolated from each other. Then, all indirect paths between them became hidden, hence interrupting the prediction of the whole GRN.

With that, this research aims to propose Multiple Linear Regression (MLR) to infer GRN from gene expression data and to avoid wrongly inferring of an indirect interaction (A → B → C) as a direct interaction (A → C). MLR was selected because MLR takes into account a combination of effects and simultaneous observations. This work is different from other regression analysis-based researches such as [10, 11, 15–18] in a way that it presents novel experimental procedures to assess the effectiveness of GRN inference method in dealing with cascade error. Lastly, this work proposes a novel experimental procedure to assess the effectiveness of MLR in dealing with cascade error. Although MLR achieved an acceptable level of performance when dealing with cascade motifs, two main problems had been detected from our experience in using MLR for GRN inference. The problems are that MLR is unable to process datasets of structure ( = observations and = variables) and does not cater for multicollinearity problem among the predictors.

#### 2. Past Researches

Various methods have been applied in GRN construction. We categorize the methods into nine categories. Information-theoretic approach is dominated by methods such as Path Consistency Algorithm based on Conditional Mutual Information [7] and Mutual Information Test based on Dynamic Bayesian Network [19] and Mutual Information [20]. As for filter-based approaches, Unscented Kalman Filter [21] and Fractional Kalman Filter [22] were proposed. Under graph-based category, method such as Random Forests or Extra-Trees [23] was applied. Probability and Statistics category has methods such as Gaussian Graphical Model [24] and Double -test [25]. The emerging algorithms such as Particle Swarm Optimization and Ant Colony Optimization [26] are categorized under nature-inspired category. For the category of correlation and dependence, methods such as Local Expression Pattern [27] and three DC- (Distance Correlation-) based algorithms, CLR-DC, MRNET-DC, and REL-DC [28], were proposed. For machine learning category, Markov Logic network [29] was applied. We purposely categorized the past approaches into a category called hybrid methods. The methods in this category incorporated more than one method such as collaboration of Mutual Information and Regression [30], Ordinary Differential Equation-based Recursive Optimization (RO), and Mutual Information (MI) [12] and Linear Regression combined with Bayesian Model [31].

#### 3. Problem Statements

The findings obtained by Salleh et al. [32] pertaining to the topics discussed in this study proved that most of the false positives had been due to cascade errors. Meanwhile, researches conducted by [4, 33] were strongly affected by cascade motifs, where these methods systematically predicted false positive interactions [34]. In addition, studies conducted by [10, 12, 35–37] depicted similar opinion, in which the main source of false positive predictions had been* indirect effects* or* cascade errors*. Apart from the term* cascade error*, other terms, such as* indirect effects*, are also used in the manuscript [10].

Despite the active research on various gene prediction methods, the discussion on specific methods to solve problems related to cascade errors is still lacking. In fact, the experiments conducted by the past studies were not specifically geared towards proving the ability of GRN prediction methods in avoiding the occurrences of cascade errors. Only recently, GNW (GeneNetWeaver), which was developed by [34], has offered tremendous positive impact to the area of systems biology, especially GRN prediction. GNW has been found to provide many features concerning GRN inference performance assessment, including network motifs analysis. However, one problem that hampers the network motifs analysis is that if the GRN inference method was tested by using complex experimental data, the results generated by the GNW would be quite distorted. Thus, the complexity in handling complex data and predicting certain types of genes interactions had motivated the researchers to design, develop, and assess the proposed method towards solving the cascade errors.

#### 4. Overview of Data

In this study, real experiment datasets were utilized from M3D [38]. M3D provided manually curated metadata for their chip measurements. The expression data can be obtained from http://m3d.mssm.edu/. The predicted* E. coli* interactions were validated based on gold standard networks of* E. coli* obtained from GNW [34]. There were 4297 genes, with a maximum of 907 chips (observations). Other references that were also used had been obtained from similar datasets, such as those presented by [10, 12, 25].

#### 5. GRN Prediction Methods by Using the Regression-Based Technique

In recent years, methods in regression analysis category have received ever increasing attention in the GRN inference research area. The existing research was conducted using the regression models such as Multiple Regression [17], LASSO [15], Ridge Partial Least Squares Regression [16], and ANOVA [10].

Regression analysis is known as a complex math-based method that will take some time to be applied. Nowadays, with many improvements done in certain software, the implementation of regression analysis has been simplified, though not completely. The success of application of regression-based methods on modeling the gene expression and DND microarray data depends on the choice of model and predictors that will be used as the input [15]. Reference [15] proposed a method named GEMULA, which has a four-stage method based on LASSO, used to identify and prioritize the synergistic interaction among predictors. Reference [16] has proposed a new method of identifying genes using Partial Least Squares. The estimation problem has been solved by combining Partial Least Squares Ridge with RFE and error Brier using two-nested CV. Ridge method has been receiving increasing attention from researchers based on its ability to tackle problems related to multicollinearity [39]. One of the main issues that need to be considered in applying the regression analysis is how to make GRN predictions with a limited number of observations. Reference [18] stated that the low number of samples is one of the key issues that need to be addressed. Reference [10] emphasizes the ability of ANOVA to be applied to gene expression data without having to perform nonlinear discretization process. Discretization is the process used to convert a continuous equation into a form that can be used to calculate the numerical solution. Another study is from [17] which aims to improve the accuracy of forecasting large-sized networks. This study uses MLR by applying parallel processing techniques. However, this study was conducted on data already in the ideal state of 1000 1000 gene perturbation experiments, which means that the number of observations does not exceed the number of genes. Their algorithm was parallelized to handle large problems in a computationally efficient manner by distributing the overall computational burden among different processors to reduce the total execution time. However, their paper did not explain in detail how the separate predictions were combined to perform the complete prediction for the whole complete set of data at one time. Apart from the study by [10, 11], all of the studies reviewed in this manuscript do not discuss specifics about how to solve the issue of cascade error. The next paragraph specifically explains the researches that cater for cascade motif.

The study from [9] is one of the main researches that serve as benchmarks for the viability of the silencing method in performing GRN prediction to the large network. Reference [9] has proposed several formulas which further highlight the direct relationship between genes versus indirect relationship; hence, prediction of a direct relationship is more easily done without any interference of an indirect relationship. Apart from the effects of indirect relationship or cascade error, the challenge of GRN prediction is increasing with the availability of data that have the total number of experimental observations very less compared to the number of genes. Reference [11] in his study stated that the total number of observations that are less than the number of genes in the experimental dataset has made the estimation unable to be performed by determining the weights to the whole set regulator (regulators). If the complete regulator set in a GRN is unable to be used in the calculation, some method has to be implemented to figure out the best way to use only some parts of the genes in calculation and at the same time does not affect the overall GRN prediction. Reference [9] conducted experiments on data with the number of nodes of 4,511 and 805 the number of observations. The lack of the total number of observations leads [9] to following a DREAM5 protocol that focuses only on correlation that happened in 141 transcription factors.

Regression analysis is a technique for modeling the relationships between two (or more) variables [40]. The Multiple Regression analysis models allow one to test several predictor variables that may explain different attributes about the response variables. Though complex, one can test all the factors that one thinks have an effect on a given response variable. This is unlike other inferior models that allow for only one predictor variable. Moreover, with the use of several variables, the accuracy of prediction is also improved. The terms* dependent variables*,* response variables*, and others have been used in the existing regression literatures interchangeably. The explanation on the meaning of each term, as well as the terms used throughout this manuscript is given in this section. Dependent variables are also known as* response variables* or* target variables*. As for independent variables, it is also known as* regressors* or* predictors *[41]. In order to ensure the consistency of the document, the terms* response variables* and* predictors* are used in the entire manuscript. GRN represents the scenario where the predictor variables are likely to be correlated with each other and they could all influence the response variables. Moreover, questions, such as how can we determine which variables are significant and how large of a role does each one variable play, do arise. All these questions can be answered by using the regression analysis. Thus, the scenario of MLR in the context of GRN is illustrated in Figure 1.