Complexity

Volume 2017, Article ID 4216797, 11 pages

https://doi.org/10.1155/2017/4216797

## Robust Nonnegative Matrix Factorization via Joint Graph Laplacian and Discriminative Information for Identifying Differentially Expressed Genes

School of Information Science and Engineering, Qufu Normal University, Rizhao 276826, China

Correspondence should be addressed to Jin-Xing Liu; moc.621@llevacds and Chun-Hou Zheng; moc.621@99hcgnehz

Received 17 January 2017; Accepted 6 March 2017; Published 6 April 2017

Academic Editor: Fang X. Wu

Copyright © 2017 Ling-Yun Dai et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Differential expression plays an important role in cancer diagnosis and classification. In recent years, many methods have been used to identify differentially expressed genes. However, the recognition rate and reliability of gene selection still need to be improved. In this paper, a novel constrained method named robust nonnegative matrix factorization via joint graph Laplacian and discriminative information (GLD-RNMF) is proposed for identifying differentially expressed genes, in which manifold learning and the discriminative label information are incorporated into the traditional nonnegative matrix factorization model to train the objective matrix. Specifically, -norm minimization is enforced on both the error function and the regularization term which is robust to outliers and noise in gene data. Furthermore, the multiplicative update rules and the details of convergence proof are shown for the new model. The experimental results on two publicly available cancer datasets demonstrate that GLD-RNMF is an effective method for identifying differentially expressed genes.

#### 1. Introduction

Cancer is one of the most serious diseases that endanger the health of human being. Millions of people die of cancer every year. With the development of gene sequencing technology and other gene detection technologies, huge gene data have been generated [1, 2]. Therefore, it is important and challenging for scientists to find pathogenic genes from a large number of gene expression data. Microarray datasets on each chip usually contain many gene expression data, and the number of samples is far less than that of genes, which makes the identification of differentially expressed genes difficult [3]. In addition, irrelevant or noisy variables may reduce the accuracy of the results. In recent years, many effective mathematical methods have been applied to identify differentially expressed genes. For example, principal component analysis (PCA) [4, 5] and penalized matrix decomposition (PMD) [6] have been used to analyze gene expression data. Liu et al. used robust principal component analysis (RPCA) to discover differentially expressed genes [7]. Zheng et al. employed nonnegative matrix factorization (NMF) on the selection of tumor genes [8]. Cai et al. proposed an algorithm named graph regularized nonnegative matrix factorization (GNMF) for data representation [9]. Wang et al. used robust graph regularized nonnegative matrix factorization (RGNMF) for identifying differentially expressed genes [10]. A CIPMD (Class-Information-Based Penalized Matrix Decomposition) algorithm was proposed to identify the differentially expressed genes on RNA-Seq data, which introduced the class information via a total scatter matrix [11]. The Consensus Clustering methodology was proposed for microarray data analysis by Giancarlo and Utro [12].

However, two characteristics of gene expression data pose a serious challenge to the existing methods. Firstly, a large number of researchers hold that gene expression data probably reside in a low dimensional manifold embedded in a high dimensional ambient space. Therefore it is critical to consider the geometrical structure in the original gene expression data. Manifold learning is clearly an effective method to preserve the data geometric structure embedded in the original gene expression data [13, 14]. Cai et al. proposed GNMF [9], in which the geometrical structure of data was constructed by an affinity graph. Another variant of NMF called manifold regularized discriminative nonnegative matrix factorization (MD-NMF) was also introduced [15]. MD-NMF considered both the local geometry of data and the discriminative information of different classes simultaneously. Long et al. proposed a method called graph regularized discriminative nonnegative matrix factorization (GDNMF) [16], in which both the geometrical structure and discriminative label information were considered in the objective function. Secondly, gene expression data often contain a lot of outliers and noise. However, existing methods cannot effectively eliminate outliers and noise. For example, least squares methods are sensitive to outliers and noise. In recent years, many researchers have been devoted to improving the robustness to outliers and noise. Zheng et al. proposed an algorithm named generalized hierarchical fuzzy -means [17], which is robust to noise and outliers. Wang et al. used -norm to reduce the effect of outliers and noise [10].

A novel algorithm, which we call robust nonnegative matrix factorization via joint graph Laplacian and discriminative information (GLD-RNMF), is proposed to overcome the aforementioned problems together. The proposed algorithm preserves the geometric structure of data space by constructing an affinity graph and improves the discriminative ability by the supervised label information. To do so, a new matrix decomposition objective function by integrating the geometric structure and label information is constructed. In addition, we employ -norm instead of -norm on the error function and the regularization term to reduce the influence of outliers and noise. For completeness, we present that the convergence proof of our iterative scheme is also shown in the Appendix. Experimental results indicate that the GLD-RNMF algorithm has better results than other existing algorithms for identifying differentially expressed genes.

The remainder of the paper is arranged as follows. In Section 2, we briefly introduce some relevant mathematical foundation and propose the GLD-RNMF algorithm in detail. In Section 3, the results of differentially expressed gene selection using our GLD-RNMF method and the other four methods (GNMF, NMFSC, RGNMF, and GDNMF) are shown for comparison. Finally, we conclude this paper in Section 4.

#### 2. Materials and Methods

##### 2.1. Mathematical Definition of

The mathematical definition of -norm [18] iswhere is the th row of and is matrix. -norm is interpreted as follows. Firstly, we compute -norm of the vector and then compute -norm of vector . The value of the elements of vector represents the importance of each dimension. -norm enables the vector sparse to achieve the purpose of dimension reduction.

##### 2.2. Manifold Learning

The purpose of this work is to get the best approximation of the original data. We also hope that the new representation can respect the intrinsic Riemannian structure. Recently, many researchers hold that high dimensional data often reside on a much lower dimensional manifold. The “manifold assumption” could be that data points nearby in the intrinsic geometry structure are also close under the new basis. Therefore, they usually have similar characteristics and can be categorized into the same class. In this paper, we employ manifold learning to achieve the aforementioned goal.

For a graph with vertices, each vertex corresponds to a data point. For each data point, we can find its nearest neighbors and connect it with the neighbors. There are many ways to define the weight matrix on the graph, for example, 0-1 weighting, heat kernel weighting, and dot-product weighting. Considering that 0-1 weighting is the simplest and easy to compute, we choose 0-1 weighting as the measure in this paper.

*0-1 Weight*. , if and only if two nodes and are connected by an edge. That is, where consists of nearest neighbors of and the neighbors have the same label with .

Therefore, the smoothness of the dimensional representation can be measured as follows:where represents the trace of a matrix. is a diagonal matrix and is the row sum (or column, because is symmetric, ) of ; that is, . is graph Laplacian matrix and . We measure the distance of two points in the low dimensional space by the Euclidean distance .

##### 2.3. Nonnegative Matrix Factorization (NMF)

We review the standard NMF in this section. Although the algorithm has been widely used in many aspects, there are still many shortcomings.

Given nonnegative samples in , arranged in columns of a matrix , in this paper, each row of represents the transcriptional response of the genes in one sample and each column of represents the expression level of a gene across all samples. Letting matrices , , and , NMF decomposes into the product of and ; that is, .

To ensure an approximate factorization , two update rules are introduced [19]. One of the objective functions is constructed by minimizing the square of the Euclidean distance between and . The optimization problem is described as follows:where denotes the matrix Frobenius norm. The corresponding optimization rules are as follows:

The convergence of the above optimization rules has been proven [19].

##### 2.4. Graph Regularized Discriminative Nonnegative Matrix Factorization (GDNMF)

Supervised label information is added to the objective function of GNMF [16]. The definition and iterative rules of GDNMF are presented below.

Class indicator matrix is defined as follows:where is the class label of and is the total number of classes in .

The objective function of GDNMF is formulated as follows:

The corresponding optimization rules are as follows:where is initialized to a random nonnegative matrix in the algorithm. and are nonnegative regularization parameters, respectively. Essentially, GDNMF incorporates the graph Laplacian and supervised label information into the objective function of NMF, which ensures the algorithm to keep consistent with the intuitive geometric structure of the data and improves the discriminative power of different classes.

##### 2.5. Robust Nonnegative Matrix Factorization via Joint Graph Laplacian and Discriminative Information (GLD-RNMF)

###### 2.5.1. The Objective Function

For the purpose of dimension reduction, NMF represents original data by a product of a nonnegative matrix and coefficient matrix . The approximation error is calculated according to the squared residuals; that is, . Due to the squared term in the objective function, smaller outliers can lead to larger errors. In this paper, we enforce -norm constraint on the objective function to reduce the impact of outliers and noise.

By employing -norm on GDNMF model, we can formulate the objective function of GLD-RNMF as follows:

This objective function can solve high dimensional, negative, noisy and sparse data simultaneously, keep consistent with the intuitive geometric structure of data, and improve the discriminative power of different classes.

###### 2.5.2. The Multiplication Update Rules of GLD-RNMF

Although the objective function is not convex jointly about , it is convex in regard to one of variables in when the others are fixed. The objective function can be expanded as follows:where and both are diagonal matrices and the diagonal elements are as follows:in which is an infinitesimal positive number.

In order to solve the optimization problem in (9), we introduce the Lagrange multipliers , , and for , , and , respectively. Firstly, we formulate the Lagrange function of GLD-RNMF as follows:

Taking the partial derivatives of with respect to , , and and setting them to zero and in view of and , we get

According to the KKT (Karush-Kuhn-Tucker) conditions [20], that is, , , and , we can obtain the following equations:

Then we can get the multivariate updating rules as follows:

The details of our method are described in Algorithm 1. The iterative procedure is performed until the algorithm converges.