Abstract
In genomic data analysis, it is commonplace that underlying regulatory relationship over multiple genes is hardly ascertained due to unknown genetic complexity and epigenetic regulations. In this paper, we consider a joint mean and constant covariance model (JMCCM) that elucidates conditional dependent structures of genes with controlling for potential genotype perturbations. To this end, the modified Cholesky decomposition is utilized to parametrize entries of a precision matrix. The JMCCM maximizes the likelihood function to estimate parameters involved in the model. We also develop a variable selection algorithm that selects explanatory variables and Cholesky factors by exploiting the combination of the GCV and BIC as benchmarks, together with Rao and Wald statistics. Importantly, we notice that sparse estimation of a precision matrix (or equivalently gene network) is effectively achieved via the proposed variable selection scheme and contributes to exploring significant hub genes shown to be concordant to a priori biological evidence. In simulation studies, we confirm that our model selection efficiently identifies the true underlying networks. With an application to miRNA and SNPs data from yeast (a.k.a. eQTL data), we demonstrate that constructed gene networks reproduce validated biological and clinical knowledge with regard to various pathways including the cell cycle pathway.
1. Introduction
Generally, joint estimation of mean and covariance has been developed to address problems related to biomedical data. In longitudinal data analysis, for instance, identifying correct correlation structures within each subject is a major focus and many studies come up with variants of joint mean and covariance estimation to enhance statistical efficiency (see Ye and Pan [1] and references therein). With regard to graphical models, particularly, conditional Gaussian graphical models (cGGM) aiming to elucidate conditional dependence structures subject to the mean have increasingly received much attention [2–5] and conceptually can also be viewed as examples of joint mean and covariance estimation in essence. On a methodological side, joint estimation has been found to be practically applicable in the sense that applications of joint estimation methods cover bioinformatics, such as hormone and transcriptome [1, 4, 6–8], and stock price analysis [9]. A majority of existing methods involve variable selection of covariates for the mean vector and sparsity of covariance mainly based on the penalization methods. However, in this paper, we purposely focus on variable selection and joint estimation with a pure implementation of the likelihoodbased method, unlike common penalization techniques, to better analyze eQTL data in light of genetic feature selection.
Over the decades, genomics research has focused on comprehensive understanding of regulatory networks in the context of system biology. Commonly we are interested in a gene network, which pictures the interplay among genetic factors (e.g., gene regulation and activation). Particularly, it is important to investigate how a given genotype (genetic variants) at a particular quantitative trait locus (QTL) affects measured phenotypes and traits at that locus. For instance, gene expression quantitative loci (eQTL) make use of gene expressions as quantitative traits. eQTL analysis has been widely applied to figure out the effect of genetic perturbations associated with diseases as well as to construct regulatory networks describing how genes regulate expressions of other genes [10, 11]. More precisely, the location of single nucleotide polymorphisms (SNPs) may affect multiple gene expression levels, and this accidentally causes misleading inference for dependency structure among genes [4]. Many popular methods [2–5, 12] have been introduced to identify gene networks, aiming at learning networks subject to perturbation effects by genetic variants on the basis of population gene expression and genotype data.
Yin and Li [4] proposed the sparse conditional graphical Gaussian model with and the adaptive lasso penalty function. Li et al. [3] suggested the twostage estimation framework: estimating a nonsparse conditional covariance matrix of genes based on a conditional variance operator between the reproducing kernel Hilbert spaces of marker genes and then using and adaptive lasso penalty to obtain sparse estimates of a precision matrix under the cGGM. Cai et al. [12] studied the covariateadjusted precision matrix estimation (CAPME) method using the constrained optimization.
While most recent studies encourage sparsity estimation of a precision matrix based on the penalized likelihood, we instead rely on classical variable methods based on the standard likelihood. Strictly speaking, the penalized likelihood, according to its definition, cannot be viewed as the likelihood. It naturally poses a question whether the likelihoodbased method performs better than the penalized likelihood approaches. To address this question, we consider a joint mean and constant covariance model (JMCCM) inspired by Pourahmadi [13] and propose methods for variable selection which effectively identify sparse conditional gene networks and covariates relevant to gene regulations. We employ the modified Cholesky decomposition to guarantee positive definiteness of an estimated precision matrix [13]. With this reparametrization of the precision matrix, the loglikelihood function corresponding to our model can be decomposed into an additive form of each response in terms of Cholesky parameters. This facilitates a coordinate descent type implementation we adopt for the precision matrix estimation. The combination of both generalized crossvalidation (GCV) and Bayesian information criterion (BIC) performs variable selection as a benchmark. Rao and Wald statistics are also utilized to add and delete genetic markers for each gene expression and the Cholesky factors for the precision matrix. To the best of our knowledge, only a few works use Rao and Wald statistics for variable selection in the joint estimation problem, particularly with an application to eQTL analysis.
In simulation studies, extensive simulation scenarios experimentally confirm improved estimation efficiency and precise variable selection of the proposed method. For real data applications, we perform eQTL analysis via the JMCCM with gene expressions and SNPs of yeast data, in pursuit of detecting the effects of variant perturbations and underlying gene regulations. We find that the JMCCM effectively uncovers biological pathways that may potentially account for known biological processes. Taken together, the JMCCM is shown to be effective in identifying conditional dependence structures among variables compared to the existing graphical models with penalization methods such as the sparse cGGM [5] and CAPME [12].
The paper is outlined as follows. In Section 2, we describe our JMCCM with modified Cholesky decomposition and maximum likelihood estimates. The variable selection algorithm using Rao and Wald statistics for SNPs and Cholesky factors is explained in Section 3. Section 4 deals with simulation studies to demonstrate performance of variable selection and estimation of the proposed model. The yeast cell cycle pathway genes with SNPs data analysis are presented in Section 5. Concluding remarks and discussion in Section 6 are followed by the Appendix, which provides mathematical details of our method.
2. Model And Estimators
2.1. JMCCM with the Modified Cholesky Decomposition
Contrary to previous methods [3, 4], the JMCCM primarily aims at simultaneous estimation over the mean and precision matrix of the Gaussian graphical model. Suppose that is a pair of the vector of genetic markers and the random vector of expression levels. Let denote the vector of SNPs and let denote the gene expression traits following variate multivariate normal distribution with the mean and covariance matrix as follows:where the th entry of is for , and is the linear regression coefficient vector indicating effects of SNPs perturbations to gene expressions and . Importantly, note that does not depend on . The coefficient is assumed to be sparse, since each gene is known to have only a few genetic regulators according to Cai et al. [12]. The precision matrix represents a gene network (graph), as in an undirected GGM, by corresponding the nonzero th element of the precision matrix with an edge between two vertices and [14]. This edge represents conditional dependence of genes and given all other gene expression levels. Thus, our goal is to identify nonzero entries of the precision matrix in order to construct a conditional dependence genetic network after the effects of SNPs perturbations are removed. The precision matrix is also expected to be sparse [4].
One of our primary interests is to estimate the precision matrix , which is symmetrically positive definite, so we need to ensure that the estimate of the precision matrix also satisfies symmetrical positive definiteness. To this end, we apply the modified Cholesky decomposition [13] to the precision matrix, denoted by , as follows:where is an upper triangular matrix with diagonal entries and abovediagonal elements consisting of negative of and is a diagonal matrix containing with for all as diagonal entries. Here, the superscript "" denotes the transpose of a matrix. Positive definiteness of is shown in Appendix A. Throughout this paper, a vector in the parenthesis is considered as a column vector. Let for . Then we can write . The parameter space for JMCCM is defined by , where represents the set of dimensional vectors of positive real numbers.
2.2. Maximum Likelihood Estimation
Suppose that we have the independent observations , , sampled from (1), where and represent the th observation of and , respectively. With , , and , the loglikelihood function corresponding to (1) is given bywhere and "tr" denotes the trace of a matrix. A derivation of (3) can be found in Appendix B. The maximum likelihood estimator (MLE) of is defined by
Some notations are needed to express MLE of in conjunction with variable selection. For , let be an index set of active (or significant) coefficients of and denote by the number of elements in . Let and . Obviously, and if , then all variables are significant for all . Define and for . Write for , , and , where denotes th column of . Let , , and , where and represent direct sum and Kronecker product, respectively, and is the identity matrix. Then the MLE of can be expressed asObserve that estimating involves .
We can estimate by obtaining MLE of and . For , let with . Let , , and represent the , , and components of the lower th principal submatrix of . Then we can expresswhere By (6), we can obtain from for each with and Thus, instead of finding a solution to the optimization problem with (3), we optimize each with respect to and . Thus, the MLE and are expressed in terms of in a way thatwith from (5) and . We estimate by and it is defined by Derivations of (5), (6), and (8) are presented in Appendix C.
3. Consecutive Variable Selection Algorithm
As mentioned above, and are commonly believed to be sparse in genomic data analysis. To address sparsity, the lassotype penalty imposed on both regression coefficients and precision matrix has been popularly applied to diverse graphical models [4, 5]. Stepping aside the lassotype approach, we develop a variable selection technique that mainly relies on the combination of classical variable selection methods. Generally, the numbers of SNPs and genes tend to be considerably huge so that computational costs normally become prohibitive. In order to address this problem, the proposed variable selection algorithm proceeds with largely two stages: preliminary variable selection for mean and precision matrix and secondary variable selection in the middle of the joint model estimation. It is important to note that the first stage leads possible variables to be limited in scope (i.e., working parameters in the model) in order to circumvent high computational complexity.
3.1. Preliminary Variable Selection
Preliminary variable selection is largely twofold: variable selection for the mean part and covariance part. The idea behind that is to add variables (or equivalently parameters) to the joint model with the maximum Rao statistic and to delete ones with the minimum Wald statistic. You may refer to Koo [15] for the basis selection method or Kooperberg et al. [16] that explain variable selection schemes based on Rao and Wald statistics.
3.1.1. Mean Part
When it comes to the mean part, we carry out selecting predictor variables (i.e., SNPs) for each response variable (i.e., gene expression), dealing with a univariate multiple regression problem. In the addition stage, we start off with a model including only an intercept term. The MLE is used as estimator for and maximum Rao statistic (Rao [17]) is the criterion for adding a predictor together with GCV (Friedman [18]) as a stopping rule. In the deletion stage, Wald statistic is calculated to exclude predictor variables such that the updated model minimizes Wald statistics. The final model for the mean regression is chosen by the minimum GCV. Details are summarized in Algorithm 1. Once Algorithm 1 is done, the number of predictor variables included in the joint model no longer increases, while variable reduction can happen in Algorithm 3 (Section 3.2).

3.1.2. Covariance Part
The rationale behind variable selection in the precision matrix estimation is that each , , is regressed on , and the regression coefficients are negative of offdiagonal entries of , the byproduct of the modified Cholesky decomposition (2) [13, 19, 20]. Clearly, this is one of the major benefits of reparametrization via Cholesky decomposition [13] in pursuit of improved interpretation.
Subsequent to selection for predictor variables, we compute . Given initial , we start with computing for as in (8) using obtained from Algorithm 1. Then compute Rao statistic to add one variable out of to the current model that builds on the response variable . We repeatedly update for , where . Next, we choose one variable, say , for and , such that Rao statistic is maximized and thereby update and , respectively. When calculating and , the BIC is computed, and the addition process stops if the BIC no longer decreases. At the completion of addition, we build a model with all selected variables as a full model and subsequently begin deletion from a full model. Deletion process is similar to the addition process except that the minimum Wald statistic is used for deletion in place of the maximum Rao statistic. Successive deletion continues until the BIC stops decreasing. In the last stage, the final model is also selected by the BIC. Details are presented in Algorithm 2. Once Algorithm 2 is finished, the number of included in the joint model no longer increases, while variable reduction could occur in the joint estimation (Algorithm 3, Section 3.2). Sparsity of is achieved through this variable selection scheme along with Algorithm 3.


3.2. Secondary Variable Selection in the Middle of the Joint Model Estimation
With variables (i.e., parameters) selected by the preliminary stage above, we implement simultaneous parameter estimation for both mean and precision in the joint model and additional variable selection. Once is computed via Algorithm 2 with fixed , we start joint estimation by updating , which is formed with weight least square estimates as in (5) and weights coming from . Then and , ultimately , are newly computed using updated , and again this updated serves as a weight for updating . Over the updates, the interplay between and continues until the loglikelihood function converges. Afterwards, deletion to current parameters begins by Wald statistic, excluding one parameter with the minimum Wald statistic. Finally, the BIC is used as a stopping rule for deletion and selection to finalize model estimation. Algorithm 3 contains details about this procedure. While the preliminary variable selection works for each mean and covariance part under the assumption of or are fixed, this joint estimation procedure is designed to improve estimation and selection accuracy by reflecting the changes of and .
4. Experimental Studies
In order to assess the performance of our proposed method, we carry out experimental simulations and compare the sparse conditional Gaussian graphical model (SCGGM), Zhang and Kim [5], covariateadjusted precision matrix estimation (CAPME), Cai et al. [12] and joint model with lasso penalty (JML), and Jhong et al. [21], all of which are based on penalized likelihood approaches. The competing methods are run with their default setting regarding tuning parameters. To evaluate similarity, the estimated precision matrix and true matrix are benchmarked by the Steins loss function: where is an estimate of the true precision matrix and denotes the determinant of a matrix. The Frobenius norm of difference between and , denoted by , where , is also considered. In addition to the Steins loss function, to measure how efficiently our model recovers the true conditional dependent relationship among genes, specificity (SPE) and sensitivity (SEN) are used, as defined by where TN, TP, FN, and FP are the numbers of true negatives, true positives, false negatives, and false positives with regard to offdiagonal elements of a precision matrix. Here, we treat a nonzero entry of a precision matrix as “positive.” To combine sensitivity and specificity, Youden’s index (=) is used. The smaller values of and are better, whereas the larger values of SPE, SEN, and Youden are better.
Inspired by Yin and Li [4], we generate simulation data sets in the form of eQTL data sets such that nonzero entries of a precision matrix, commonly called a link (or edge), are randomly assigned with probability , where is the number of genes and is some positive constant. For a link generated at the th entry of the precision matrix, denoted by , the corresponding element is sampled from the uniform distribution over . For each row, offdiagonal elements are divided by the sum of their absolute values multiplied by 1.5. And we obtain the true precision matrix by symmetrizing and setting diagonals as 1. To create the regression coefficients matrix , we first generate a indicator matrix that has 1 as entry with probability for some positive constant . If the th element of this indicator matrix is 1, is randomly generated from , where is the smallest absolute value of generated.
Producing and , we generate SNPs, with for . Finally, we simulated gene expressions by generating from the multivariate normal distribution given , . We generate a data set of i.i.d. random vectors , and simulations are repeated 50 times. In Table 1, we outline the six simulation scenarios of smallscale setup. Table 2 contains the three simulation scenarios of largescale setup.
The smallscale simulation results in Table 3 suggest that the JMCCM produces better estimates than all other methods across all six models. Due to computational issues, we drop some results of JML from Table 3. By comparing the results of model 3 with 6 and 2 with 4, when and the number of genes is fixed, we can see that JMCCM and CAPME show less changes in and than they appear in the SCGGM as the number of SNPs increases. This indicates that JMCCM and CAPME are less subject to the increment of the number of SNPs than SCGGM, when modest size of genes is involved. Estimation performance of CAPME seems to be affected by the number of genes more easily than JMCCM because Stein loss and Frobenius norm of CAPME for models 1, 2, and 3 with fixed increase more rapidly than those of JMCCM do. For identifying structures of the precision matrix, JMCCM surpasses SCGGM in discovering nonzero elements (higher sensitivity) as complexity of the model rises. This is possibly due to the fact that SCGGM tends to produce sparse estimates more than the true precision matrix. Higher sensitivity implies that our proposed model is less likely to miss the influential conditional dependency among genes. While CAPME and JML score near 1 in sensitivity, they mark poor number in specificity because 5fold crossvalidation for CAPME and validation approach for JML with default setting for tuning parameters selection tend to choose small ones, leading to dense estimates. JMCCM produces higher Youden’s index across all simulation scenarios compared to all other methods, and the performance gap between JMCCM and others is increasingly widened as the models increase in sample size and complexity. The results for the largescale simulations (models 7–9) are summarized in Table 4 and Table S3. Due to long computing time, the results of CAPME for models 8 and 9 are not reported. Overall, the results are consistent with the smallscale simulations. The gap between JMCCM and SCGGM in estimation performance widened as the complexity of the model increases (Table 4).
We also simulate SNPs to mimic the linkage disequilibrium (LD) which is known to be a common phenomenon in DNA sequence and to assess the performance of our approach. We randomly generate two groups of SNPs which lie on LD, each of which contains LD block including 10 correlated SNPs such that correlation is greater than 0.9. Together with these SNPs, 10 SNPs are generated independently from Bernoulli trials with probability of 0.5 for a total of 20 SNPs and 20 genes: and . Table 5 presents the simulation results using these data of sample size and with 50 repetitions. Compared to the results of model 4 in Table 3, which is based on 20 independent SNPs with , LD is found to have little impact on all methods, except for slight decreases in Stein loss and Frobenius norm of SCGGM.
5. eQTL Analysis of Yeast Data
In this section, we apply the proposed algorithm to genomic eQTL (i.e., expression quantitative trait loci) data in order to examine whether the proposed method effectively recovers true dependency of gene expressions, which builds on known molecular mechanisms. To this end, we collect a set of yeast data [22], including polymorphic genotypes and mRNA expressions. The yeast data have been widely applied to elucidate the biological interactions between nucleotide polymorphisms and their responding genes (e.g., perturbation effects) [23–26]. Thus, our primary goal is to identify conditional dependency among genes with an adjustment of SNPs perturbations to each gene expression level.
The data sets are collected for two yeast parent strains, BY4716 (BY) and RM111a (RM), and their 112 segregants. We obtain SNPs for 1,260 loci to the exclusion of the redundant SNPs observed in neighboring genetic regions and leave 3,684 expression genes after screening out genes of missing more than 5%. In order to validate whether or not the estimated gene network is consistent with unknown biological knowledge, we take the true signaling pathway for comparison. Out of 3,684 yeast genes, we purposely focus on a set of 64 genes that are ascertained in the cell cycle yeast pathway available in the KEGG database [24]. Together with 1,260 SNPs as predictors, we construct a gene network model of 64 genes. JMCCM selects 222 nonzero elements of the precision matrix and leaves nonzero regression coefficients, for a total of 111 links among genes. A total of 489 regression coefficients of SNPs over gene expression levels are included in the final model. Figure 2 displays the conditional gene network estimated by the proposed joint model. A pair of two genes is linked if an offdiagonal element of the estimated precision matrix is nonzero, and so 51 genes are shown to be linked and assemble in a module.
Given this estimated gene network, we notice that the gene network is found to be concordant with the true cell cycle pathway. For instance, according to the true cell cycle pathway in Figure 1, MCM3 is linked to ORC3, ORC5, MCM7, and MCM4. MCM5 is connected to TAH11 and ORC1. PHO4 is closely linked to CLN2 and SIC1, whose molecular function is related to MAP kinase orthologs (i.e., MAPK pathway genes). It is interesting to note that CLN3 and SWI4 are mutually connected together, linking to their downstream gene MBP1. Hence, this undirected graph contributes to recovering lots of links among the 64 genes of the pathway.
With regard to genetic variant effects, JMCCM is shown to effectively identify some of the wellknown direct genetic perturbations. Gene expressions are regulated by some genetic variants, which, unless otherwise taken into account, may falsely capture the interplay of genes in a network. Our founding includes Clbspecific Cdk inhibitor (SIC1) as influencing the molecular interface between cyclinCdk complexes (e.g., binding to and blocking Cdk1/Clb activity, ultimately to maneuver the timing of DNA replication (see Table S1 in Supplementary Materials; [27]). In addition, our gene networks demonstrate that SIC1 is strongly perturbed by CLB3, while SCGGM did not detect any perturbation effects. More interestingly, many previous experiments validated the association between SIC1 and CLB3 to uncover the underlying mechanism of the cell cycle [28–30]. Clearly, this implies that JMCCM does a better job in accounting for SNPs perturbation as compared to SCGGM.
Moreover, the gene module detected from JMCCM is shown in Table 6. Pertaining to these gene modules, we hypothesize if gene modules, each containing hub genes and their neighboring genes, are enriched with common biological processes or not. To test this hypothesis, we conduct the gene ontology (GO) enrichment analysis [31] over the detected gene network modules (see Table S2 in Supplementary Materials) from both JMCCM and SCGGM using Fisher’s exact tests. Table 6 demonstrates that the proposed method outstandingly performs detecting modules biologically associated with many molecular processes, while none is detected by SCGGM. More importantly, the pathway of the organic acid metabolic process enriched in gene module 1 is also reported and validated on the basis of the network modules constructed with largescale integration of yeast data in Zhu et al. [32].
Putting all things together, we conclude that the proposed method facilitates recovering the true SNPs perturbations in the midst of gene regulations and elucidating the underlying interplay of gene interactions. These fortes of the proposed algorithm are favored for reinforcing a priori biological knowledge and address a novel hypothesis related to clinical and translational potential.
6. Conclusion and Discussion
In this paper, we propose JMCCM to efficiently identify conditional dependent structures of gene expressions with adjustments to perturbation effects of SNPs. Contrary to the existing conditional graphical models, the precision matrix commonly used to reveal the true relationship among genes is parameterized via the modified Cholesky decomposition. The maximum likelihood estimates of the precision matrix were computed, while variable selection of SNPs and Cholesky factors are carried out separately and jointly by the GCV and BIC criterion. From experimental studies, it is clearly shown that JMCCM performs better than the existing penalization methods. Besides, JMCCM in the application to yeast cell cycle data successfully recovers many parts of the cell cycle pathway with adjustments of SNPs to each gene expressions level. Notably, the model entails the estimation of precision matrix, of which components are assumed to be constant. So, in the future, we may relax this somewhat strong assumption in the way that the model can parametrize over and in pursuit of more accurate estimation [13]. We leave this for next study.
Appendix
A. Positive Definiteness of
The matrix representation of the modified Cholesky decomposition (2) is given by
For simplicity, write and . For any nonzero vector , Since for all , if and only if , which cannot happen here because is a nonzero vector. Thus is a positive definite matrix.
B. A Derivation of the LogLikelihood Function
The probability density function of (1) is given by where . The loglikelihood function corresponding to JMCCM can be derived as follows: where the last equality comes from , , and
C. Derivations of MLE , , and
Using notations presented in Section 2.2, optimization problem for is expressed as Denote the score function of by . Observe that By solving the normal equation , we have (5).
To derive (6), let , th row of . Then we can see that . The loglikelihood function (3) can be expressed as the sum of as follows:
Finally, for any given , the gradient of loglikelihood function with respect to and is given by Then (8) is obtained by solving with respect to and .
Data Availability
The source of the eQTL data analyzed in this paper is [22]. An R package JMCCM and the eQTL data set are available at author’s webpage: https://sites.google.com/site/sunghwanshome/.
Disclosure
A part of results from the earlier version of this research were presented in the poster session at 2016 Spring Seminar held by the Korean Statistical Society.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this manuscript.
Acknowledgments
This research is supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF2015R1D1A1A01057747 and NRF2017R1C1B5017528).
Supplementary Materials
Table S1 shows the estimated coefficient matrix from JMCCM applied to the genes of cell cycle pathway. The columns refer to genes and the rows to SNPs. Table S2 contains the gene lists of each module estimated from both SCGGM and JMCCM. Table S3 shows results of simulation with a large data set of m = 1000 (genes), p = 4000 (SNPs), and N = 500. (Supplementary Materials)