Abstract

In order to elucidate the overall relationships between gene expressions and genetic perturbations, we propose a network inference method to infer gene regulatory network where single nucleotide polymorphism (SNP) is involved as a regulator of genes. In the most of the network inferences named as SNP-gene regulatory network (SGRN) inference, pairs of SNP-gene are given by separately performing expression quantitative trait loci (eQTL) mappings. In this paper, we propose a SGRN inference method without predefined eQTL information assuming a gene is regulated by a single SNP at most. To evaluate the performance, the proposed method was applied to random data generated from synthetic networks and parameters. There are three main contributions. First, the proposed method provides both the gene regulatory inference and the eQTL identification. Second, the experimental results demonstrated that integration of multiple methods can produce competitive performances. Lastly, the proposed method was also applied to psychiatric disorder data in order to explore how the method works with real data.

1. Introduction

In order to understand more accurate causal relationships between a complex disease and genetic variations, we need to consider how the genotypic perturbations affect expression phenotypes that are potentially associated with a target disease. In other words, it is more crucial to look at the overall mechanisms considering a series of three factors, which include genetic variations, altering gene regulations, and caused diseases rather than partial mappings between them. Therefore it is important to evaluate how genetic perturbations affect genes on regulatory networks that are associated with a target disease phenotype. In practice, when biological networks are inferred with high throughput data, we have to consider not only the relationships among genes but also how genetic factors such as single nucleotide polymorphism (SNP) and copy number variation (CNV) can affect genes in gene regulatory network (GRN). Over the last decade, research for mapping genotype to expression phenotype or disease phenotype such as expression quantitative trait loci (eQTL) study and genome wide association study have been actively performed [1]. However, we are now required to do a network-based analysis with genotype data and gene expression because it is more effective in discovering underlying biological process from genotype to phenotype. In doing so, the analysis of SNP-gene regulatory networks (SGRN) will provide more definite relationships of genotypic causes and phenotypic effects so that it will facilitate prognosis and drug designs for therapies.

In this paper we propose a SGRN inference method. In order to identify regulatory interactions among genes, quite a number of network inference methods have been developed by using gene expression data such as gene microarray. Those methods can be generally classified into different theoretical categories: Boolean networks [2, 3], mutual information [4, 5], Bayesian networks (BN) [6, 7], and regression [8, 9]. As each method has its own advantages and limitations under different assumptions and network models such as acyclic or cyclic network and directed or undirected network, there should be trade-offs in inferences given different target network structure and applications [10]. For example, the MI-based approach is very simple and fast so that it can build a large scale network (e.g., genome wide scale) but it cannot estimate direction of edges. It produces worse performance than other approaches in detecting linear cascading structures [10]. The BN-based inference is limited to imply only acyclic network with high computational cost while the regression-based approach supports both directed and cyclic network, which are assumed in SGRN. In addition to directed network model, it should be considered that SGRN is different from conventional GRN inference. In SGRN inference, a gene can be regulated by SNPs as well as other genes, but SNPs are assumed to not be regulated by other SNPs. That is, a SNP cannot be a child node in the network.

Recently, a number of approaches have been suggested to infer SGRNs integrating genetic variation and gene expression data. Kim et al. [11] considered genetic perturbations, gene expression, and disease phenotypes together to find the causal genes to a disease. The electric circuit approach and heuristic search were used to infer SGRN where causal genes are mapped to SNP in the preliminary step before network inference. Keurentjes et al. [12] built a SNP-gene network associated with a particular phenotype, but this method also performed eQTL mapping (SNP-gene) to define the candidate regulator genes before genetic network construction. In addition, Kim and Xing [13] used lasso regression considering the case that a SNP is weakly associated with highly correlated multiple traits rather than a single trait. Chen et al. [14] focused on identifying which pathway among those already known pathways was more likely to be affected by changes of genotype and gene expression rather than inferring a new pathway. The related works we especially noted are the methods that are based on structural equation modeling (SEM) [15ā€“18]. SEM allows us to not only incorporate eQTL information to gene expression in a single model but also identify eQTL simultaneously. However, Logsdon and Mezey [17] assumed that every gene has at least one eQTL, and eQTL mapping was performed by preprocessing but not in a network inference step. Cai et al. [18] introduced sparsity-aware maximum likelihood (SML), which can be potentially extended for eQTL identification. However, SNP-gene pairs were still given in evaluations and implementations of the SML algorithm.

In this paper, we proposed a novel method to infer SGRN where both eQTL identification and SGRN inference are performed simultaneously given a set of gene expression and genotype data without assuming eQTLs are known. The proposed method is based on SEM and multiple steps of edge filtering such as elastic net regression and iterative adaptive lasso. Basically SEM is a regression-based model which is likely to select as many variables causing an overfitting, so the sparsity is enforced by lasso (-regularized least square estimation) considering the sparsity of biological network. Initial weights of edges are estimated by ridge regression [19] and elastic net regression [20], and then the second step is to identify final eQTLs from candidate SNPs selected in the first steps. In the last step, the final network is constructed by iterative adaptive lasso. The first two steps are to fix SNPs before selecting genes. In the third step, edges are selected by iteratively giving more penalties to the edge whose weight is relatively low until network structure is converged.

To evaluate the method, we explore the performance with a simulated data set, that is, generated from random networks with different number of samples and nodes and expected number of edges per node. The result shows that the method can achieve a high detection rate of true edges with low false discovery rate without eQTL information. In addition, to explore the performance in real expression phenotype and SNP data, the method was applied to the psychiatric disorder data. After genes and SNPs were selected from related Genome-Wide Association Study (GWAS), it was tested how the method identify true positive edges between genes and SNPs without eQTL information.

2. Method

2.1. Problem Definitions

We define the problem and notations here. Let denote the matrix of gene expression levels of genes and samples where a row vector is observed expression level of th gene. is matrix to denote genotypes of individuals, where represents the number of minor alleles of th SNP of th sample as an element of matrix supposing that the number of minor alleles should be zero, one, or two in real data. So, represents a relative quantity of minor alleles of samples. As a gene can be regulated by other genes and genetic variations (SNPs), we define SEM as where denotes th row vector of square matrix ; denotes th row vector of square matrix ; is a model bias; and is a residual modeled as zero-mean Gaussian with a variance . As we assume there is no self-regulation (self-loop edge), , where denotes th element of . The parameters of and decide the network structure defining the weight of regulation from every possible gene and SNP to a target gene . For example, if there is no regulation relationship (directed edge) from th gene to th gene, is set to zero. Similarly has nonzero value as a weight of regulation from th SNP to th gene if th SNP is identified as an eQTL for th gene. It is assumed that each gene has at least one eQTL but it is unknown which SNP among a given set of SNPs is an eQTL for a target gene. Our goal in this model is to find and that best fit to observed gene expression and genotype data. To make the problem simpler, we remove from (1) by applying mean centering for row vectors and to have zero mean. The goal is to find and that minimize a residual , so (1) can be expressed in a least square minimization problem as However, regression tends to select as many genes and SNPs as possible to explain the expression level of target gene . To avoid the overfitting, sparse regression methods such as ridge regression, elastic net, and lasso are used.

2.2. The Algorithm

The method we propose is based on -regularized linear regression known as lasso [21] that yields a sparsity of variable selection. The algorithm consists of 3 steps, (i) elastic net, (ii) lasso, and (iii) iterative adaptive lasso. The first two steps are to decide where SNPs are selected but their coefficients can be changed in the third step. Then, is finalized by iterative adaptive lasso in the last step.

2.2.1. Ridge Regression (Step 1-1)

In ridge regression, the coefficient values of irrelevant SNPs and genes to a target gene shrink to zero (but not exactly zero) while those of eQTLs and regulator genes of a target gene tend to be higher. Ridge regression of (2) is defined as Given penalty weights, and , the optimal and can be obtained by closed form solution given by Replacing (5) for in (4) yields where After calculating first in (6) then (5) can be solved. In this manner, matrices and are estimated by computing each and , . Parameters and that decide the degree of sparsity of and are determined by -fold cross-validation. is set to 5 in our experiments.

2.2.2. Elastic Net (Step 1-2)

Note that zero weighted coefficient cannot be recovered back to nonzero in adaptive lasso of Step 3. Therefore, in order to carefully keep only SNPs that are more likely to be true eQTLs in , we give -norm penalty to only but not using elastic net defined as As the objective function is convex, which guarantees a convergence, can be optimized by using coordinate descent iteration given parameters, and . To find the optimal , the derivative of (8) with respect to is considered as follows: Since the derivative of (8) with respect to is the same as (5), in (9) is substituted with (5), and then (9) is simplified to where indicates row vector whose th element is removed, denotes matrix whose th row is removed, and is th row vector of . After defining and in (10), the update rule in the coordinate descent algorithm is written as Algorithm 1 describes the procedures to solve (8) in Step 2. If is nonzero, th SNP is a candidate eQTL for th gene.

(1)ā€‚ procedure ELASTIC and are optimal parameters estimated by cross validation
(2)ā€‚ā€ƒwhileā€‰ā€‰ ā€‰ā€‰do
(3)ā€‚ā€ƒā€ƒ
(4)ā€‚ā€ƒā€ƒforā€‰ā€‰ ā€‰ā€‰do
(5)ā€‚ā€ƒā€ƒā€ƒUpdate via (12)
(6)ā€‚ā€ƒā€ƒend for
(7)ā€‚ā€ƒā€ƒUpdate via (5)
(8)ā€‚ā€ƒā€ƒ
(9)ā€‚ā€ƒend while
(10) ā€ƒreturnā€‰ā€‰ and
(11) end procedure

2.2.3. Lasso (Step 2)

In order to finalize a SNP (a single nonzero of ) for each gene , we apply lasso to combined matrix of and as follows: where denotes indices of low vectors where . So, is a matrix whose rows are removed. If the number of rows of is greater than predefined heuristic number (i.e., 5 in our experiments), only top highest of absolute values of but not all nonzero are selected for . In Step 2, we iteratively estimate , decreasing from a high value that lets have a zero vector. Regardless of elements of for , we note only which element of for has a nonzero value first assuming that the corresponding candidate SNP to is more likely to regulate a target gene if for a row vector of has nonzero value earlier than other elements of during decreases.

2.2.4. Adaptive Lasso (Subroutine of Step 3)

Adaptive lasso is defined as where In (16), penalty weights, vectors and , are defined as where and are estimated in Step 2 that yields a sparsity to but not . Zero coefficient of in Step 2 is not considered as an eQTL for gene . So, zero yields zero in (17), and then if is zero, will never have nonzero value in adaptive lasso of Step 3 (16). The parameters and decide how much previous estimation such as or is reflected to next estimation of or . Therefore, that has smaller penalty weight is more likely to have nonzero value. In addition, we consider a special case that and are set to zero supposing that (i) we do not give a penalty weight to or by setting or to 1 if or is nonzero and (ii) we do not estimate elements of or by setting or to infinity if or is zero. The solution is similar to Step 2 in which either or is optimized by coordinate descent algorithm but it is applied to solve both and in Step 3. Derivative of (15) with respect to yields where indicates row vector whose th element is removed and denotes matrix whose th row is removed. After setting and , the update rule for is as follows: We can also estimate in similar way. After defining and , the update rule for is given as When and are updated, updated single element or immediately affects updating the next elements. In addition, updating order of elements can be changed since convex objective function is converged in any order of elements to update. Algorithm 2 shows the optimization procedure of adaptive lasso.

(1)ā€‚ procedure ADAPTIVE LASSO and are optimal parameters preliminary
ā€ƒā€‚estimated by cross validation
(2)ā€‚ā€ƒCompute and ( )
(3)ā€‚ā€ƒwhileā€‰ā€‰ ā€‰ā€‰do
(4)ā€‚ā€ƒā€ƒ
(5)ā€‚ā€ƒā€ƒforā€‰ā€‰ ā€‰ā€‰do
(6)ā€‚ā€ƒā€ƒā€ƒUpdate via (19)
(7)ā€‚ā€ƒā€ƒend for
(8)ā€‚ā€ƒā€ƒforā€‰ā€‰ ā€‰ā€‰do
(9)ā€‚ā€ƒā€ƒā€ƒUpdate via (20)
(10)ā€ƒā€ƒend for
(11) ā€ƒā€ƒ
(12) ā€ƒend while
(13) ā€ƒreturnā€‰ā€‰ and
(14) end procedure

(1)ā€‚ procedure ITERATIVE ADAPTIVE LASSO denote the number of non-zero elements in and
(2)ā€‚ā€ƒ[ ,
(3)ā€‚ā€ƒ ,
(4)ā€‚ā€ƒforā€‰ā€‰ ā€‰ā€‰do
(5)ā€‚ā€ƒā€ƒ[
(6)ā€‚ā€ƒend for
(7)ā€‚ā€ƒwhileā€‰ā€‰ are decreased by increased ā€‰ā€‰do
(8)ā€‚ā€ƒā€ƒwhileā€‰ā€‰ are decreasedā€‰ā€‰do
(9)ā€‚ā€ƒā€ƒā€ƒforā€‰ā€‰ ā€‰ā€‰do
(10) ā€ƒā€ƒā€ƒā€ƒ[ ,
(11) ā€‰ā€ƒā€ƒā€ƒā€‰end for
(12) ā€ƒā€ƒend while
(13) ā€ƒā€ƒ
(14) ā€ƒend while
(15) ā€ƒreturnā€‰ā€‰ and
(16) end procedure

2.2.5. Iterative Adaptive Lasso (Step 3)

Even if and are estimated in Steps 1 and 2, there should be still many false positive edges yet. The primary goal of Steps 1 and 2 is to carefully get rid of only edges that are more unlikely to be true positive edges. So, instead of simply applying adaptive lasso, we developed iterative adaptive lasso to improve the performance of naive adaptive lasso. The motivation of iterative adaptive lasso is that the coefficient value of the variable considerably depends on the value of and which are fixed to 1 and 0.5 in [17, 18], respectively. In iterative adaptive lasso, adaptive lasso is iteratively applied incrementally changing and until there is no more change in the total number of selected edges of and so that more coefficients of irrelevant variables can be shrunk to zero.

Algorithm 3 presents a detailed procedure of iterative adaptive lasso. and estimated in Step 2 are used as arguments. On line 2, and are initialized by ridge regression. is a vector of optimal for estimated by Ridge regression but there is no penalty to (i.e.). When is estimated, only non-zero elements of that is estimated in Step 2 are updated. On line 5, and are estimated by adaptive lasso in order that elements of are updated by weights of (i.e. that has a small value can shrink to zero). Before line 7 starts, (a vector of for on line 10) is estimated by cross validation of adaptive lasso. On line 7ā€“14, more elements of shrink to zero increasing . The second while loop updates until no change in . Once the second while loop is terminated, is increased, and then the second loop is performed again. If the second while loop is terminated without any change of , the first while loop is terminated.

3. Results

3.1. Simulation Studies

To evaluate the proposed method, we first perform simulations based on randomly generated acyclic networks. The simulation settings are similar to those in [17, 18]. denotes the number of genes and SNPs and is set to 10, 20, and 30. matrix is initialized to zero matrix where is a sample size; then elements of are randomly selected as directed edges. The selected has random coefficient value uniformly distributed over 0.5~1 or āˆ’0.5~āˆ’1. Since we consider a single eQTL per gene (), a single element () is selected from each row vector (). So, is a diagonal matrix. is randomly set as 1, 2, or 3 with the probabilities 0.25, 0.5, and 0.25, respectively. is generated by calculating , where is generated from Gaussian distribution with zero mean and variance 0.01. The number of samples for each network size is . The number of edges per gene on average is set to . Given data and , performances of predicting and are evaluated by comparing true network and inferred network.

Figure 1 displays the examples of networks, where SNP nodes are excluded. For the evaluation, true positive (TP), false positive (FP), true negative (TN), and false negative (FN) edges are counted to measure the accuracy criteria such as true positive rate (TPR) and false discovery rate (FDR) that are defined as(i)TPR = TP/(TP + FN),(ii)FDR = FP/(TP + FP).

In order to evaluate our method, IAL is compared to SML [18]. As SML infers only with known nonzero element indices of , we consider two versions of IAL, IAL without eQTL information and IAL with eQTL information, where Steps 1 and 2 are skipped and only Step 3 is performed with nonzero element index of . SML is tested by using the code the author implemented in [18]. The abbreviations of algorithms to compare in Figure 2 and Table 1 are listed below:(i)SML: sparsity-aware maximum likelihood algorithm with eQTL information [18],(ii)IAL1: IAL with eQTL information,(iii)IAL2: IAL without eQTL information.

Ten replicate simulations are performed and each simulation has a different topology. The results of the different settings ( and ) are displayed in Figure 2. It is shown that IAL1 is superior to SML in all data sets regardless of sample size. We also note that TPR of IAL2 is higher than 0.9 and FDR is less than 0.1 on average in any sample size. It validates that the proposed IAL works very effectively when eQTL is known. In addition, the performance of IAL1 is consistent in different sample sizes while the performance of SML tends to be decreased with small sample size and complicated network (). In network inference, it is known that the performance of inference is very sensitive to the network size and density. In the inference of densely connected and large networks, the computational cost will exponentially increase and the FDR may increase because there are more possible variables that may explain a target node better than true regulators. IAL1 performed consistently in all three different network sizes while the performance of SML is affected by the network size in dense networks (). However, IAL2 shows consistent TPRs and FDRs in all three different network sizes when the network density is normal () while TPR of IAL2 in Figures 2(g) and 2(k) is lower than Figure 2(c) and also FDR increases in Table 1 when the network size increases in more dense networks ().

The result shows that the performance is better in sparse networks () than dense networks () because a complicated structure is more likely to cause false positive edges because of indirect regulations. For example, TPRs in Figures 2(a), 2(e), and 2(i) are much better than in Figures 2(c), 2(g), and 2(k). Similarly FDR is quite increased with in Figures 2(d), 2(h), and 2(l) compared to the case of in Figures 2(b), 2(f), and 2(j).

Overall results imply that the proposed IAL1 works perfectly with known in any network size and density. It means that the performance of IAL2 is significantly affected by false positive inference of in steps 1 and 2 because of unknown . More precisely without sparsity in step 2 is more likely to have false positive nonzero elements even though a number of candidate elements of are filtered in step 1. Therefore, the selection of nonzero element of in IAL2 is the most critical part since IAL1 is able to correctly infer only if is given as eQTL information.

3.2. Experiments with Psychiatric Disorder Data

In this section, the proposed method is applied to real gene expression and genotype data for psychiatric disorder. In the application to real data, we explore the performance of GRN inferences and eQTL identifications through the inferred networks. As far as we know, the proposed method is the first solution to provide both GRN inference and eQTL identification. Thus, the performance comparison with other methods was not performed. The psychiatric disorder data consists of gene expression data of 25833 genes and 852963 SNPs for 131 samples, which were measured from human brain. Since we focus on the network inference but not gene selection, the network construction is performed with a predefined set of genes and SNPs that are selected by preliminary test of multiple sets of genes and eQTLs based on related GWAS for psychiatric disorders. The result of SGRN inference is displayed in Figure 3 where two yellow colored genes, EGFR and CACNA1C, are selected from [23, 24] and the rest of two pairs are from [22]. In applying IAL2 to the data, the weights of and are set to 0.5 instead of 1. Otherwise, tends to be zero. The reason for this is that gene variables are more correlated with their eQTLs because generally eQTLs are independently selected to other genes. In Figure 3, SNP and gene are distinguished by node shape, and a red edge indicates a correct edge from eQTL to corresponding gene. A blue edge represents false positive eQTL mapping. For eQTL identification, one false positive edge appears and thirteen true positive edges are detected (TPR = 0.9286, FDR = 0.0714).

4. Discussion

The most difficult part in network inference is to identify directions of edges. In the adjacency matrix , both and could have a high coefficient value. In this case, regression-based methods tend to show better performance than MI-based methods because candidate edges are evaluated together in regression-based methods but each edge is independently evaluated to other edges in MI-based methods. Despite the advantage, the regression-based method needs to be integrated with other methods that can provide different information of structure. Another issue to improve in IAL is the computational cost to estimate two different per each row. Intuitively, a searched optimal per each row of and should provide a better result but it causes a high computation cost. Lastly, we also assumed that a gene has at least a single eQTL given a set of genes and SNPs, but multiple eQTLs should be considered and a gene may not have any eQTL in practice. Thus, the multiple eQTL of a gene is a future work in SGRN inference.

5. Conclusion

In this paper, we proposed a novel network inference method that provides both eQTL identification and network construction of both genes and SNPs. In order to understand gene regulatory mechanisms for a target disease phenotype, the regulatory network inference needs to consider effect of genetic variation and expression phenotype together but not only gene expression data. To achieve the high quality of reliable inference with better TPR and FDR, three different regression skills are integrated. Ridge regression and elastic net are used to remove more likely false positive edges and select eQTL as preliminary steps, and then the finial network is estimated by iterative adaptive lasso removing more false positive edges between genes. Through the experiments with synthetic data, it was demonstrated that IAL1 outperforms SML in SGRN inference and also IAL2 performs eQTL identification effectively. The method was also applied to psychiatric disorder data. Using the genes and eQTLs selected from GWAS of psychiatric disorder, we explored the ability of eQTL identification through inferred SGRN.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.