BioMed Research International

Volume 2015, Article ID 573956, 11 pages

http://dx.doi.org/10.1155/2015/573956

## Low-Rank and Sparse Matrix Decomposition for Genetic Interaction Data

^{1}Center for Quantitative Biology, Peking University, Beijing 100871, China^{2}Institute of Computing Technology, Chinese Academy of Science, Beijing 100190, China^{3}School of Mathematical Sciences, Peking University, Beijing 100871, China^{4}Center for Statistical Sciences, Peking University, Beijing 100871, China

Received 8 January 2015; Accepted 13 March 2015

Academic Editor: Junwen Wang

Copyright © 2015 Yishu Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

*Background*. Epistatic miniarray profile (EMAP) studies have enabled the mapping of large-scale genetic interaction networks and generated large amounts of data in model organisms. One approach to analyze EMAP data is to identify gene modules with densely interacting genes. In addition, genetic interaction score ( score) reflects the degree of synergizing or mitigating effect of two mutants, which is also informative. Statistical approaches that exploit both modularity and the pairwise interactions may provide more insight into the underlying biology. However, the high missing rate in EMAP data hinders the development of such approaches. To address the above problem, we adopted the matrix decomposition methodology “low-rank and sparse decomposition” (LRSDec) to decompose EMAP data matrix into low-rank part and sparse part. *Results*. LRSDec has been demonstrated as an effective technique for analyzing EMAP data. We applied a synthetic dataset and an EMAP dataset studying RNA-related processes in *Saccharomyces cerevisiae*. Global views of the genetic cross talk between different RNA-related protein complexes and processes have been structured, and novel functions of genes have been predicted.

#### 1. Introduction

Genetic interactions, which represent the degree to which the presence of one mutation modulates the phenotype of a second mutation, could be measured systematically and quantitatively in recent years [1, 2]. Genetic interactions can reveal functional relationships between genes and pathways. Furthermore, genetic networks measured via high-throughput technologies could reveal the schematic wiring of biological processes and predict novel functions of genes [3]. Recently, several high-throughput technologies have been developed to identify genetic interactions at genome scale, including Synthetic Genetic Array (SGA) [4], Diploid-Based Synthetic Lethality Analysis on Microarrays (dSLAM) [5], and epistatic miniarray profile (EMAP) [6]. In particular, EMAP systematically construct double deletion strains by crossing query strains with a library of test strains and identify genetic interactions by measuring a growth phenotype. An score was calculated based on statistical methods for each pair of genes, while negative scores represent synthetic sick/lethal and positive scores indicate alleviating interactions [6].

Consequently, for each pair of genes, there are two different measures of relationship in EMAP platform. First, the genetic interaction score ( score) represents the degree of synergizing or mitigating effects of the two mutations in combination. Second, the similarity (typically measured as a correlation) of their genetic interaction profiles represents the congruency of the phenotypes of the two mutations across a wide variety of genetic backgrounds. So there are two important aspects in exploiting EMAP data. On the one hand, cellular functions and processes are carried out in series of interacting events, so genes participating in the same biological process tend to interact with each other. Therefore, algorithms that detect gene modules composed of densely interacting genes are of great interest. Within these modules, genes tend to have similar genetic interaction profiles; thus the submatrix for these genes tends to have a low-rank structure. On the other hand, the cross talks between modules are usually indicated by gene pairs with high scores (so that the genetic interaction is significant). Removing them results in better low-rank structure. Evocatively, these gene pairs are likely shadows over the low-rank matrix and connect different rank areas. These cross talks reveal the relationships of different biological process or protein complexes. Meanwhile, gene pairs exhibiting high absolute value of scores may encode proteins that are physically associated or be enriched in protein-protein interactions [7–9]. So the investigation of score is equally important. However, the current methodologies in genetic interaction networks analysis did not efficiently address these two important issues simultaneously.

In order to identify modules and between-module cross talks in genetic interaction networks, we employ the “low-rank and sparse decomposition” (LRSDec) to decompose EMAP data matrix into a low-rank part and a sparse part. We propose that the low-rank structure accounts for gene modules, in which genes have high correlations, and the sparsity matrix captures the significant scores. In particular, entries in sparse matrix found by LRSDec correspond to two sources of biologically meaningful interactions, within-module interactions and between-module links. In this paper, we focus our discussion of the sparse matrix on the results of between-module links, while the results of within-module interactions can be found in the Supplementary Material available online at http://dx.doi.org/10.1155/2015/573956 (Supplementary Data 1).

Low-rank and sparse of matrix structures have been profoundly studied in matrix completion and compressed sensing [10, 11]. The robust principal component analysis (RPCA) [12] proved that the low-rank and the sparse components of a matrix can be exactly recovered if it has a unique and precise “low-rank + sparse” decomposition. RPCA offers a blind separation of low-rank data and sparse noises, which assumed ( is the sparse noise), and exactly decomposes into and without predefined and card. Another successful matrix decomposition method GoDec studied the approximated “low-rank + sparse” decomposition of a matrix by estimating the low-rank part and the sparse part from , allowing noise, that is, , and constrained the rank range of and the cardinality range of [13]. It has been stated that GoDec has outperformed other algorithms before.

In this paper, we modified the GoDec matrix decomposition method and developed “low-rank and sparse decomposition” (LRSDec) to estimate the low-rank part and the sparse part of . LRSDec minimizes the nuclear norm of and predefines the cardinality range of , while considering the additive noise . Different from GoDec, which directly constrains the rank range of , LRSDec minimizes its responding convex polytopes, that is, the nuclear norm of . It has been proven that the nuclear norm outperforms the rank-restricted estimator [14]. Furthermore, if, in presence of missing data, LRSDec could impute the missing entries while decomposing, with no need for data pretreatment, while GoDec could not accomplish decomposition and imputation simultaneously, then we stated the convergence properties of our algorithm and proved that, given the two regularization parameters, the objective value of LRSDec monotonically decreases. By applying both methods to a synthetic dataset, we demonstrated the superiority of LRSDec over GoDec. Finally, we analyzed a genetic interaction dataset (EMAP) using our algorithm and identified many biologically meaningful modules and cross talks between them.

#### 2. Model

Let be an matrix that represents a genetic interaction dataset, where is the number of query genes and is the number of library genes. We propose to decompose aswhere denotes the low-rank part and denotes the sparse part, and is the noise. Here, we introduce to account for modules, in which genes are highly correlated. These modules correspond to protein complexes, pathways, and biological pathways, in which genes tend to share similar genetic interaction profiles [15]. is introduced to account for significant scores, which are either gene pairs in the same module that have genetic interactions or cross talks among different functional modules.

Based on the assumptions above, we propose to solve the following optimization problem:where is a regularization parameter that controls the error tolerance, and denote the number of nonzero entries in matrix .

To make the minimization problem tractable, we relax the rank operator on with the nuclear norm, which has been proven to be an effective convex surrogate of the rank operator [14]where is the nuclear norm of (, where are the singular values of and is the rank of ).

However, missing data is commonly encountered in EMAP data, confounding techniques such as cluster analysis and matrix factorization. Here, we extend our basic model (3) to handle EMAP data with missing values by imputing missing entries in the matrix simultaneously when estimating low-rank matrix and sparse matrix . Suppose that we only observe a subset of , indexed by , and the missing entries are indexed by . In order to find a low-rank matrix and a sparse matrix based on the observed data, we propose to solve the following optimization problem:

#### 3. Algorithm

Similar to GoDec, the optimization problem of (3) can be solved by alternatively optimizing the following two subproblems until convergence:

In each iteration, we optimize the objective function by alternatively updating and . Firstly, the subproblem (5a) can be solved by [14]. For fixed , the solution of (5a) isHere, is a regularization parameter controlling the nuclear norm of estimated value , where is the* Singular Value Decomposition* (SVD) of and here . The notation refers to* soft-thresholding* [14].

Next, the subproblem (5b) in (3) could be updated via entry-wise hard thresholding of for fixed . Before giving the solution, we define an orthogonal projection operator . Suppose there is a subset of dataset , indexed by ; then the matrix can be projected onto the linear space of matrices supported by : And is its complementary projection; that is, .

Then the solution of (5b) could be given as follows:where is the orthogonal projection operator as defined above, is the nonzero subset of the first largest entries of . Then, the matrix can be projected onto the linear space of matrices supported by :

So far we have developed the algorithm for solving problem (3). As for problem (4), due to the existence of missing values, we took the optimization on the observed data, . We updated and of the following optimization subproblems, respectively:The term is the sum of squared errors on the observed entries indexed by .

The subproblem (11a) can be solved by updating with an arbitrary initialization using [14]The solution of subproblem (11b) iswhere is the nonzero subset of the first largest entries of .

Now we have the following algorithm.

*Algorithm 1 (LRSDec). *(i) Input: . Initialize .

(ii) Iterate until convergence:

-step: iteratively update using (12).

-step: Solve using (13).

(iii) Output: , .

The convergence analysis of our algorithms is provided in the Supplementary Material.

#### 4. Parameter Tuning

We have two parameters that need to be tuned in our models: and . Here, we propose a 10-fold cross validation strategy to select them. The idea is as follows: let be the index of observed entries of . We randomly partition into 10 equal size subsets and choose training entries and testing entries : and , . We may solve problem (15) on a grid of values on the training data:

Then we evaluate the prediction error (15) on the testing data:

The cross validation process is repeated for 10 times. Then we can find the optimal parameter , which minimizes the mean of the prediction error.

#### 5. Results

##### 5.1. Synthetic Data

We simulated a synthetic data and then applied LRSDec algorithm and GoDec algorithm to it. Specifically, low-rank part, sparse part, and noises are generated as follows.(i)Low-rank part: the covariance matrix is generated by , where and . Here is the number of hidden modules. The random entries are drawn from . Let .(ii)Sparse part: the non-zero entries in sparse matrix are generated from the tail of Gaussian distribution , whose upper quantile is . We randomly selected 70% of them to assign the opposite sign. This is consistent with EMAP datasets, in which negative genetic interactions are much more prevalent than the positive ones.(iii), wherein is a standard Gaussian matrix.

A low-rank matrix with rank 25 and sparse matrix with cardinality 250 were generated, respectively. Now we have

The first step is parameter training, and the result is showed in Figure 1. Minimal prediction error was achieved when and , which coincides with the rank and cardinality of the synthetic data. This demonstrated the effectiveness of cross validation procedure.