Computational and Mathematical Methods in Medicine

Volume 2017, Article ID 8520480, 10 pages

https://doi.org/10.1155/2017/8520480

## Node-Structured Integrative Gaussian Graphical Model Guided by Pathway Information

^{1}Department of Statistics, Keimyung University, Daegu, Republic of Korea^{2}The Institute of Natural Science, Keimyung University, Daegu, Republic of Korea^{3}Department of Statistics, Korea University, Seoul, Republic of Korea^{4}Graduate School of Information Security, Korea University, Seoul, Republic of Korea^{5}School of Industrial Management Engineering, Korea University, Seoul, Republic of Korea

Correspondence should be addressed to ByungYong Lee; moc.liamg@101901mor and SungWon Han; rk.ca.aerok@nahws

Received 31 October 2016; Revised 20 February 2017; Accepted 6 March 2017; Published 12 April 2017

Academic Editor: Hongmei Zhang

Copyright © 2017 SungHwan Kim et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Up to date, many biological pathways related to cancer have been extensively applied thanks to outputs of burgeoning biomedical research. This leads to a new technical challenge of exploring and validating biological pathways that can characterize transcriptomic mechanisms across different disease subtypes. In pursuit of accommodating multiple studies, the joint Gaussian graphical model was previously proposed to incorporate nonzero edge effects. However, this model is inevitably dependent on post hoc analysis in order to confirm biological significance. To circumvent this drawback, we attempt not only to combine transcriptomic data but also to embed pathway information, well-ascertained biological evidence as such, into the model. To this end, we propose a novel statistical framework for fitting joint Gaussian graphical model simultaneously with informative pathways consistently expressed across multiple studies. In theory, structured nodes can be prespecified with multiple genes. The optimization rule employs the structured input-output lasso model, in order to estimate a sparse precision matrix constructed by simultaneous effects of multiple studies and structured nodes. With an application to breast cancer data sets, we found that the proposed model is superior in efficiently capturing structures of biological evidence (e.g., pathways). An R software package nsiGGM is publicly available at author’s webpage.

#### 1. Introduction

Genomic data have been extensively applied to analyze disease mechanism on the basis of predictive signatures from DNA alterations (e.g., genotyping and mutation), RNA transcription (e.g., gene or isoform expression and fusion transcripts), and gene regulation by epigenetic changes (e.g., methylation, protein-DNA interaction, and miRNA expression). In particular, gene regulation is a complicated system that builds on tens of thousands of cellular components’ interactions and diverse activities across multiple layers. Biological networks are the most popularly used data resource to sketch this interconnectivity of gene regulations. High-throughput genomic technologies are paving the way toward systematically characterizing diverse types of biological networks and suggestive of underlying gene regulation mechanisms. And yet a complete inference of network’s complexity has been a long concern in the field of systems biology.

To circumvent the shortcoming of single feature-based analysis, the activity of a gene or of a whole biological process in a disease can be assessed by sets of genes (a.k.a. gene set enrichment analysis or pathway analysis). In doing so, a bulk of pathways have been identified through many cancer-related researches [1]. Pathway information demonstrates cellular functions and biological processes or represents a unique signature of deregulation of a given gene [2]. For example, the pathway or signature associated with the activity of a given oncogene is defined as the set composed of those genes most differentially expressed by perturbation of oncogenes [3–5]. Importantly, the usage of pathway information is increasingly prevalent in biomedicine. For instance, target drug associated with potential pathway is taken as a practical solution to overcome the traditional drug discovery that usually adopts the one-drug-one-target approach. This strategy takes into account the fact that the disease occurrence is usually the result of complex interactions of molecular events.

In recent years, large-scale genomic data generated from relevant biological experiments or clinical hypotheses have increasingly soared, as high-throughput experiment technologies have markedly advanced [6]. Such increasing genomic data has been publicly available in data repositories (e.g., Gene Expression Omnibus and Sequence Read Archive). This abundance of biological experiments poses a new challenge of multiple data in regard to exploring and validating biological signatures and pathways. More precisely, a question of network analysis often relates to how to characterize underlying transcriptomic patterns or molecular mechanisms across disease subtypes or between case-control groups, because it is commonplace that biological signals are not coherently present across studies. Generally a single network [7–9] is found to accurately estimate underlying dependency with an adjustment of gene perturbation effects (e.g., polymorphic genotype alteration [10, 11]). Nonetheless, these methods hardly discover network patterns of subtle signals and dynamic features in the midst of coupled networks under diverse conditions. Moreover, single networks potentially generate many potential false positive signals (edges) attributed to experimental biases and errors. To address this challenge, the recent trend of data analysis has been in the spotlight to data integration allowing for multiple data to achieve a more accurate network inference. To this end, many have proposed methods to combine multiple networks based on unified model [12–14]. This approach is also known as integrative analysis and is analogue to traditional meta-analysis.

The joint Gaussian graphical model (JGGM; Danaher et al. [12]) focuses on incorporating nonzero edge effects (i.e., off-diagonal entries of precision matrix) to combine multiple studies in view of integrative analysis. This model, however, inevitably is dependent on post hoc analysis when validating biological significance. Therefore it is interesting to combine not only DNA and/or transcriptomic changes but also pathway information as such well-ascertained biological evidence. Normally we perform post hoc analysis to see if the estimated gene networks are enriched for any pathways. Contrary to this, it is also sensible to estimate gene networks, with an adjustment of pathway information. It is common that we hardly combine pathway information in spite of its biological significance. To the best of our knowledge, no method has been proposed that can accommodate overlapping node structures, mainly due to overlapped gene annotations of pathway gene sets. To tackle this problem, we propose a new graphical model called “node-structured integrative Gaussian graphical model (nsiGGM)” jointly leveraging a priori knowledge of pathway information. This method allows for overlapping group lasso problems, making it possible to integrate overlapped genes of pathways. It is worthwhile for biological pathways to intervene the network estimation to reveal true gene regulatory network. The nsiGGM builds on prespecified structured nodes with multiple genes as building blocks in the stage of estimating a precision matrix. The implementation rule employs lasso penalty of structured input-output lasso model [15], in order to estimate sparse precision matrix that accounts for simultaneous effects of multiple studies and structured nodes. With an application to simulated and breast cancer genomic data, the proposed model is found to be superior in efficiently capturing transcriptional modules predefined by pathway database. A software package (nsiGGM) is publicly available at author’s webpage (https://sites.google.com/site/sunghwanshome/).

This paper is outlined as follows. In Section 2, we review background knowledge of the standard and joint Gaussian graphical models. In addition, we propose the node-structured integrative Gaussian graphical model (nsiGGM). In Section 3, we describe an implementation strategy that is primarily based on the input-output lasso. In Section 4, we compare performance of our proposed methods with other methods using real breast cancer data (TCGA) and simulated data. In Section 5, conclusions and further studies are discussed.

#### 2. Method

In this section, we briefly discuss methodological backgrounds on the Gaussian graphical models (GGM) aiming at constructing gene networks. In what follows, we propose the node-structured integrative Gaussian graphical model (nsiGGM) that can accommodate a priori biological knowledge (e.g., pathway data or targeted predictive genes of miRNA).

##### 2.1. Gaussian Graphical Models for Gene Networks

A Gaussian graphical model demonstrates the conditional dependency of multiple random variables, , with a graph , where is a set of nodes and is a set of edges indicating that nodes are linked and conditionally dependent. Let follow the multivariate Gaussian distribution , where is a covariance matrix. Let denote the inverse covariance matrix (also known as a precision matrix). More precisely, each nonzero off-diagonal element implies conditional dependency between the th and th nodes given all the other variables, , whereas the covariance presents marginal dependencies without considering other variables. This model is also called a GGM [16]. The graphical lasso [9, 17] produces a sparse Gaussian graphical model constructed in nonpenalized edges in . The graphical lasso minimizes the negative log-likelihood with the lasso penalty:where is the trace of matrix , is the sample covariance matrix, and is the regularization parameter adjusting the degree of sparsity. The optimal value for can be chosen by cross-validation or the Bayesian information criterion (BIC; Schwarz [18]; Yuan and Lin [8]).

##### 2.2. Joint Gaussian Graphical Models for Combining Multiple Studies

In this section, we revisit the joint Gaussian graphical models (JGGM) proposed by Danaher et al. [12]. Simply put, the JGGM combines multiple studies and constructs multiple networks in a unified model. Let denote the number of studies in our data and ) the true precision matrices. Consider genomic data of studies, , each of which consists of samples with common features, where . We assume that observations are independent and that those of each data set follow the multivariate normal distribution as for . It is well known in meta-analysis that multiple data sets are of common associations and genomic characteristics among features (e.g., genetic association intensity). It, therefore, is worth estimating precision matrices across studies in parallel rather than separate estimation. To this end, we assume that the features within each data set are centered and take the form of a penalized log-likelihood with the group sparsity-inducing penalty that maximizes (2) with respect to :subject to being positive definite, where is the sample covariance matrix of and , are nonnegative tuning parameters. It is interesting to note that the -penalty captures similarity across the precision matrices. Due to this property, the penalty terms of (2) are also referred to as the joint graphical lasso (JGL). Moreover, the penalty induces estimated precision matrices to be sparse.

##### 2.3. Node-Structured Integrative Gaussian Graphical Model

In this section, we propose an integrative graphical model that can accommodate a priori known structure of genomic features. Learning gene networks, the sparseness of precision matrix can be guided to some extent by known feature modules (e.g., pathway information). Typically data integration allows picturing the interplay of underlying biological factors. In this regard, it is worthwhile accommodating known feature module information ascertained in previous experiments. In doing so, we seek to integrate a priori feature module information to be embedded across multiple networks via an additional group penalty. The following objective function is taken to minimizewhere is a subset of off-diagonal entry indices of for , , is the number of a priori feature modules, and . Importantly, it is noted that elements of can be overlapped (e.g., duplicated genes of two different pathways). The third penalty, adjusted by pertains to structured feature modules (i.e., structured node in networks) on the basis of a priori known information. Here, unbiased regularization to each feature should be taken into consideration, in the sense that the feature overlapping inevitably comes into play.

In what follows, we present a toy example to demonstrate how a priori information constructs feature modules in . In Figure S1, in Supplementary Material available online at https://doi.org/10.1155/2017/8520480, we take an example of networks consisting of 5 common nodes (e.g., genomic features) across three studies. In Figure S1A, the second penalty with captures matched up common edges (e.g., ) identical to the joint graphical lasso. Besides, the third group lasso penalty with accommodates the six edges of the three features in a predefined module so that feature regulatory effects can be further modeled in the context of data integration (see Figure S1B). Importantly note that this module structure (e.g., pathway) is priorly known knowledge. It is interesting that this approach is in line with the integrative cluster [19] that allows for* cis*-regulatory effects and target gene prediction for miRNAs. In the case of multiple modules in network, suppose that we are given a set of five genes and a precision matrix for . Let a priori information generate two feature modules defined as Module 1, , and Module 2, , and then we can enumerate precision matrix’s index of each module for all , say, and . Of note, the component is simultaneously present in both and , implicating that a suitable implementation is required for regularization to the overlapped component . To estimate solutions to (3), we apply the structured input-output lasso [15] that can handle overlapped features, making it possible to learn a model allowing for both single-node effects across studies and predefined node structures (e.g., pathway modules). Inspired by integrative nature of this method, we call this graphical model the node-structured integrative Gaussian graphical model (nsiGGM). When it comes to tuning the penalty parameters (, , and ), the BIC is applied to determine the optimal sparseness of networks’ edges.

#### 3. Implementation Strategy

##### 3.1. Structured Alternating Directions Method of Multipliers Algorithm

In this section, we delineate the implementation strategy for the nsiGGM. We solve problem (3) by using structured alternating directions method of multipliers algorithm (sADMM). The alternating directions method of multipliers algorithm (ADMM) was previously introduced to tackle the problem of the JGL [12]. Similar to the JGL, the sADMM proposed in spirit of the ADMM is designed to adopt the structured input-output lasso in order to embed node structures into the model. We first reformulate (3) with and aswhere ; for and that satisfies positive definiteness. Boyd et al. [20] proposed the scaled augmented Lagrangian to solve problem (4) bywhere are dual variables and denotes the Frobenius norm of matrix (i.e., ). The sADMM algorithm repeatedly solves the three-step optimization with respect to , and , starting with initial values of the related parameters: , , and for . The iteration is repeated until convergence as follows: In -step for , update that minimizes In -step, for , update that minimizeswhere . To find the optimal solution of (7), we directly apply the structured input-output lasso [15] to (7) using both coordinate descent algorithm and KKT conditions considered to boost up the computational speed. For more details, see [15]. In -step, for , update as . Update repeatedly the three parameters until convergence by a stopping rule below: Putting together, Algorithm 1 encapsulates the structured alternating directions method of multipliers algorithm.