Computational and Mathematical Methods in Medicine

Volume 2018, Article ID 4626307, 13 pages

https://doi.org/10.1155/2018/4626307

## Variable Selection and Joint Estimation of Mean and Covariance Models with an Application to eQTL Data

^{1}Department of Statistics, Korea University, Seoul 02841, Republic of Korea^{2}Department of Applied Statistics, Konkuk University, Seoul 05029, Republic of Korea

Correspondence should be addressed to SungHwan Kim; moc.liamg@747ssiws

Received 1 October 2017; Revised 25 December 2017; Accepted 18 April 2018; Published 25 June 2018

Academic Editor: Nadia A. Chuzhanova

Copyright © 2018 JungJun Lee et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In genomic data analysis, it is commonplace that underlying regulatory relationship over multiple genes is hardly ascertained due to unknown genetic complexity and epigenetic regulations. In this paper, we consider a joint mean and constant covariance model (JMCCM) that elucidates conditional dependent structures of genes with controlling for potential genotype perturbations. To this end, the modified Cholesky decomposition is utilized to parametrize entries of a precision matrix. The JMCCM maximizes the likelihood function to estimate parameters involved in the model. We also develop a variable selection algorithm that selects explanatory variables and Cholesky factors by exploiting the combination of the GCV and BIC as benchmarks, together with Rao and Wald statistics. Importantly, we notice that sparse estimation of a precision matrix (or equivalently gene network) is effectively achieved via the proposed variable selection scheme and contributes to exploring significant hub genes shown to be concordant to* a priori* biological evidence. In simulation studies, we confirm that our model selection efficiently identifies the true underlying networks. With an application to miRNA and SNPs data from yeast (a.k.a. eQTL data), we demonstrate that constructed gene networks reproduce validated biological and clinical knowledge with regard to various pathways including the cell cycle pathway.

#### 1. Introduction

Generally, joint estimation of mean and covariance has been developed to address problems related to biomedical data. In longitudinal data analysis, for instance, identifying correct correlation structures within each subject is a major focus and many studies come up with variants of joint mean and covariance estimation to enhance statistical efficiency (see Ye and Pan [1] and references therein). With regard to graphical models, particularly, conditional Gaussian graphical models (cGGM) aiming to elucidate conditional dependence structures subject to the mean have increasingly received much attention [2–5] and conceptually can also be viewed as examples of joint mean and covariance estimation in essence. On a methodological side, joint estimation has been found to be practically applicable in the sense that applications of joint estimation methods cover bioinformatics, such as hormone and transcriptome [1, 4, 6–8], and stock price analysis [9]. A majority of existing methods involve variable selection of covariates for the mean vector and sparsity of covariance mainly based on the penalization methods. However, in this paper, we purposely focus on variable selection and joint estimation with a pure implementation of the likelihood-based method, unlike common penalization techniques, to better analyze eQTL data in light of genetic feature selection.

Over the decades, genomics research has focused on comprehensive understanding of regulatory networks in the context of system biology. Commonly we are interested in a gene network, which pictures the interplay among genetic factors (e.g., gene regulation and activation). Particularly, it is important to investigate how a given genotype (genetic variants) at a particular quantitative trait locus (QTL) affects measured phenotypes and traits at that locus. For instance, gene expression quantitative loci (eQTL) make use of gene expressions as quantitative traits. eQTL analysis has been widely applied to figure out the effect of genetic perturbations associated with diseases as well as to construct regulatory networks describing how genes regulate expressions of other genes [10, 11]. More precisely, the location of single nucleotide polymorphisms (SNPs) may affect multiple gene expression levels, and this accidentally causes misleading inference for dependency structure among genes [4]. Many popular methods [2–5, 12] have been introduced to identify gene networks, aiming at learning networks subject to perturbation effects by genetic variants on the basis of population gene expression and genotype data.

Yin and Li [4] proposed the sparse conditional graphical Gaussian model with and the adaptive lasso penalty function. Li* et al.* [3] suggested the two-stage estimation framework: estimating a nonsparse conditional covariance matrix of genes based on a conditional variance operator between the reproducing kernel Hilbert spaces of marker genes and then using and adaptive lasso penalty to obtain sparse estimates of a precision matrix under the cGGM. Cai* et al.* [12] studied the covariate-adjusted precision matrix estimation (CAPME) method using the constrained optimization.

While most recent studies encourage sparsity estimation of a precision matrix based on the penalized likelihood, we instead rely on classical variable methods based on the standard likelihood. Strictly speaking, the penalized likelihood, according to its definition, cannot be viewed as the likelihood. It naturally poses a question whether the likelihood-based method performs better than the penalized likelihood approaches. To address this question, we consider a joint mean and constant covariance model (JMCCM) inspired by Pourahmadi [13] and propose methods for variable selection which effectively identify sparse conditional gene networks and covariates relevant to gene regulations. We employ the modified Cholesky decomposition to guarantee positive definiteness of an estimated precision matrix [13]. With this reparametrization of the precision matrix, the log-likelihood function corresponding to our model can be decomposed into an additive form of each response in terms of Cholesky parameters. This facilitates a coordinate descent type implementation we adopt for the precision matrix estimation. The combination of both generalized cross-validation (GCV) and Bayesian information criterion (BIC) performs variable selection as a benchmark. Rao and Wald statistics are also utilized to add and delete genetic markers for each gene expression and the Cholesky factors for the precision matrix. To the best of our knowledge, only a few works use Rao and Wald statistics for variable selection in the joint estimation problem, particularly with an application to eQTL analysis.

In simulation studies, extensive simulation scenarios experimentally confirm improved estimation efficiency and precise variable selection of the proposed method. For real data applications, we perform eQTL analysis via the JMCCM with gene expressions and SNPs of yeast data, in pursuit of detecting the effects of variant perturbations and underlying gene regulations. We find that the JMCCM effectively uncovers biological pathways that may potentially account for known biological processes. Taken together, the JMCCM is shown to be effective in identifying conditional dependence structures among variables compared to the existing graphical models with penalization methods such as the sparse cGGM [5] and CAPME [12].

The paper is outlined as follows. In Section 2, we describe our JMCCM with modified Cholesky decomposition and maximum likelihood estimates. The variable selection algorithm using Rao and Wald statistics for SNPs and Cholesky factors is explained in Section 3. Section 4 deals with simulation studies to demonstrate performance of variable selection and estimation of the proposed model. The yeast cell cycle pathway genes with SNPs data analysis are presented in Section 5. Concluding remarks and discussion in Section 6 are followed by the Appendix, which provides mathematical details of our method.

#### 2. Model And Estimators

##### 2.1. JMCCM with the Modified Cholesky Decomposition

Contrary to previous methods [3, 4], the JMCCM primarily aims at simultaneous estimation over the mean and precision matrix of the Gaussian graphical model. Suppose that is a pair of the vector of genetic markers and the random vector of expression levels. Let denote the vector of SNPs and let denote the gene expression traits following -variate multivariate normal distribution with the mean and covariance matrix as follows:where the th entry of is for , and is the linear regression coefficient vector indicating effects of SNPs perturbations to gene expressions and . Importantly, note that does not depend on . The coefficient is assumed to be sparse, since each gene is known to have only a few genetic regulators according to Cai* et al.* [12]. The precision matrix represents a gene network (graph), as in an undirected GGM, by corresponding the nonzero th element of the precision matrix with an edge between two vertices and [14]. This edge represents conditional dependence of genes and given all other gene expression levels. Thus, our goal is to identify nonzero entries of the precision matrix in order to construct a conditional dependence genetic network after the effects of SNPs perturbations are removed. The precision matrix is also expected to be sparse [4].

One of our primary interests is to estimate the precision matrix , which is symmetrically positive definite, so we need to ensure that the estimate of the precision matrix also satisfies symmetrical positive definiteness. To this end, we apply the modified Cholesky decomposition [13] to the precision matrix, denoted by , as follows:where is an upper triangular matrix with diagonal entries and above-diagonal elements consisting of negative of and is a diagonal matrix containing with for all as diagonal entries. Here, the superscript "" denotes the transpose of a matrix. Positive definiteness of is shown in Appendix A. Throughout this paper, a vector in the parenthesis is considered as a column vector. Let for . Then we can write . The parameter space for JMCCM is defined by , where represents the set of -dimensional vectors of positive real numbers.

##### 2.2. Maximum Likelihood Estimation

Suppose that we have the independent observations , , sampled from (1), where and represent the th observation of and , respectively. With , , and , the log-likelihood function corresponding to (1) is given bywhere and "tr" denotes the trace of a matrix. A derivation of (3) can be found in Appendix B. The maximum likelihood estimator (MLE) of is defined by

Some notations are needed to express MLE of in conjunction with variable selection. For , let be an index set of active (or significant) coefficients of and denote by the number of elements in . Let and . Obviously, and if , then all variables are significant for all . Define and for . Write for , , and , where denotes th column of . Let , , and , where and represent direct sum and Kronecker product, respectively, and is the identity matrix. Then the MLE of can be expressed asObserve that estimating involves .

We can estimate by obtaining MLE of and . For , let with . Let , , and represent the , , and components of the lower th principal submatrix of . Then we can expresswhere By (6), we can obtain from for each with and Thus, instead of finding a solution to the optimization problem with (3), we optimize each with respect to and . Thus, the MLE and are expressed in terms of in a way thatwith from (5) and . We estimate by and it is defined by Derivations of (5), (6), and (8) are presented in Appendix C.

#### 3. Consecutive Variable Selection Algorithm

As mentioned above, and are commonly believed to be sparse in genomic data analysis. To address sparsity, the lasso-type penalty imposed on both regression coefficients and precision matrix has been popularly applied to diverse graphical models [4, 5]. Stepping aside the lasso-type approach, we develop a variable selection technique that mainly relies on the combination of classical variable selection methods. Generally, the numbers of SNPs and genes tend to be considerably huge so that computational costs normally become prohibitive. In order to address this problem, the proposed variable selection algorithm proceeds with largely two stages: preliminary variable selection for mean and precision matrix and secondary variable selection in the middle of the joint model estimation. It is important to note that the first stage leads possible variables to be limited in scope (i.e., working parameters in the model) in order to circumvent high computational complexity.

##### 3.1. Preliminary Variable Selection

Preliminary variable selection is largely twofold: variable selection for the mean part and covariance part. The idea behind that is to add variables (or equivalently parameters) to the joint model with the maximum Rao statistic and to delete ones with the minimum Wald statistic. You may refer to Koo [15] for the basis selection method or Kooperberg* et al.* [16] that explain variable selection schemes based on Rao and Wald statistics.

###### 3.1.1. Mean Part

When it comes to the mean part, we carry out selecting predictor variables (i.e., SNPs) for each response variable (i.e., gene expression), dealing with a univariate multiple regression problem. In the addition stage, we start off with a model including only an intercept term. The MLE is used as estimator for and maximum Rao statistic (Rao [17]) is the criterion for adding a predictor together with GCV (Friedman [18]) as a stopping rule. In the deletion stage, Wald statistic is calculated to exclude predictor variables such that the updated model minimizes Wald statistics. The final model for the mean regression is chosen by the minimum GCV. Details are summarized in Algorithm 1. Once Algorithm 1 is done, the number of predictor variables included in the joint model no longer increases, while variable reduction can happen in Algorithm 3 (Section 3.2).