Abstract

Differential expression plays an important role in cancer diagnosis and classification. In recent years, many methods have been used to identify differentially expressed genes. However, the recognition rate and reliability of gene selection still need to be improved. In this paper, a novel constrained method named robust nonnegative matrix factorization via joint graph Laplacian and discriminative information (GLD-RNMF) is proposed for identifying differentially expressed genes, in which manifold learning and the discriminative label information are incorporated into the traditional nonnegative matrix factorization model to train the objective matrix. Specifically, -norm minimization is enforced on both the error function and the regularization term which is robust to outliers and noise in gene data. Furthermore, the multiplicative update rules and the details of convergence proof are shown for the new model. The experimental results on two publicly available cancer datasets demonstrate that GLD-RNMF is an effective method for identifying differentially expressed genes.

1. Introduction

Cancer is one of the most serious diseases that endanger the health of human being. Millions of people die of cancer every year. With the development of gene sequencing technology and other gene detection technologies, huge gene data have been generated [1, 2]. Therefore, it is important and challenging for scientists to find pathogenic genes from a large number of gene expression data. Microarray datasets on each chip usually contain many gene expression data, and the number of samples is far less than that of genes, which makes the identification of differentially expressed genes difficult [3]. In addition, irrelevant or noisy variables may reduce the accuracy of the results. In recent years, many effective mathematical methods have been applied to identify differentially expressed genes. For example, principal component analysis (PCA) [4, 5] and penalized matrix decomposition (PMD) [6] have been used to analyze gene expression data. Liu et al. used robust principal component analysis (RPCA) to discover differentially expressed genes [7]. Zheng et al. employed nonnegative matrix factorization (NMF) on the selection of tumor genes [8]. Cai et al. proposed an algorithm named graph regularized nonnegative matrix factorization (GNMF) for data representation [9]. Wang et al. used robust graph regularized nonnegative matrix factorization (RGNMF) for identifying differentially expressed genes [10]. A CIPMD (Class-Information-Based Penalized Matrix Decomposition) algorithm was proposed to identify the differentially expressed genes on RNA-Seq data, which introduced the class information via a total scatter matrix [11]. The Consensus Clustering methodology was proposed for microarray data analysis by Giancarlo and Utro [12].

However, two characteristics of gene expression data pose a serious challenge to the existing methods. Firstly, a large number of researchers hold that gene expression data probably reside in a low dimensional manifold embedded in a high dimensional ambient space. Therefore it is critical to consider the geometrical structure in the original gene expression data. Manifold learning is clearly an effective method to preserve the data geometric structure embedded in the original gene expression data [13, 14]. Cai et al. proposed GNMF [9], in which the geometrical structure of data was constructed by an affinity graph. Another variant of NMF called manifold regularized discriminative nonnegative matrix factorization (MD-NMF) was also introduced [15]. MD-NMF considered both the local geometry of data and the discriminative information of different classes simultaneously. Long et al. proposed a method called graph regularized discriminative nonnegative matrix factorization (GDNMF) [16], in which both the geometrical structure and discriminative label information were considered in the objective function. Secondly, gene expression data often contain a lot of outliers and noise. However, existing methods cannot effectively eliminate outliers and noise. For example, least squares methods are sensitive to outliers and noise. In recent years, many researchers have been devoted to improving the robustness to outliers and noise. Zheng et al. proposed an algorithm named generalized hierarchical fuzzy -means [17], which is robust to noise and outliers. Wang et al. used -norm to reduce the effect of outliers and noise [10].

A novel algorithm, which we call robust nonnegative matrix factorization via joint graph Laplacian and discriminative information (GLD-RNMF), is proposed to overcome the aforementioned problems together. The proposed algorithm preserves the geometric structure of data space by constructing an affinity graph and improves the discriminative ability by the supervised label information. To do so, a new matrix decomposition objective function by integrating the geometric structure and label information is constructed. In addition, we employ -norm instead of -norm on the error function and the regularization term to reduce the influence of outliers and noise. For completeness, we present that the convergence proof of our iterative scheme is also shown in the Appendix. Experimental results indicate that the GLD-RNMF algorithm has better results than other existing algorithms for identifying differentially expressed genes.

The remainder of the paper is arranged as follows. In Section 2, we briefly introduce some relevant mathematical foundation and propose the GLD-RNMF algorithm in detail. In Section 3, the results of differentially expressed gene selection using our GLD-RNMF method and the other four methods (GNMF, NMFSC, RGNMF, and GDNMF) are shown for comparison. Finally, we conclude this paper in Section 4.

2. Materials and Methods

2.1. Mathematical Definition of

The mathematical definition of -norm [18] iswhere is the th row of and is matrix. -norm is interpreted as follows. Firstly, we compute -norm of the vector and then compute -norm of vector . The value of the elements of vector represents the importance of each dimension. -norm enables the vector sparse to achieve the purpose of dimension reduction.

2.2. Manifold Learning

The purpose of this work is to get the best approximation of the original data. We also hope that the new representation can respect the intrinsic Riemannian structure. Recently, many researchers hold that high dimensional data often reside on a much lower dimensional manifold. The “manifold assumption” could be that data points nearby in the intrinsic geometry structure are also close under the new basis. Therefore, they usually have similar characteristics and can be categorized into the same class. In this paper, we employ manifold learning to achieve the aforementioned goal.

For a graph with vertices, each vertex corresponds to a data point. For each data point, we can find its nearest neighbors and connect it with the neighbors. There are many ways to define the weight matrix on the graph, for example, 0-1 weighting, heat kernel weighting, and dot-product weighting. Considering that 0-1 weighting is the simplest and easy to compute, we choose 0-1 weighting as the measure in this paper.

0-1 Weight. , if and only if two nodes and are connected by an edge. That is, where consists of nearest neighbors of and the neighbors have the same label with .

Therefore, the smoothness of the dimensional representation can be measured as follows:where represents the trace of a matrix. is a diagonal matrix and is the row sum (or column, because is symmetric, ) of ; that is, . is graph Laplacian matrix and . We measure the distance of two points in the low dimensional space by the Euclidean distance .

2.3. Nonnegative Matrix Factorization (NMF)

We review the standard NMF in this section. Although the algorithm has been widely used in many aspects, there are still many shortcomings.

Given nonnegative samples in , arranged in columns of a matrix , in this paper, each row of represents the transcriptional response of the genes in one sample and each column of represents the expression level of a gene across all samples. Letting matrices , , and , NMF decomposes into the product of and ; that is, .

To ensure an approximate factorization , two update rules are introduced [19]. One of the objective functions is constructed by minimizing the square of the Euclidean distance between and . The optimization problem is described as follows:where denotes the matrix Frobenius norm. The corresponding optimization rules are as follows:

The convergence of the above optimization rules has been proven [19].

2.4. Graph Regularized Discriminative Nonnegative Matrix Factorization (GDNMF)

Supervised label information is added to the objective function of GNMF [16]. The definition and iterative rules of GDNMF are presented below.

Class indicator matrix is defined as follows:where is the class label of and is the total number of classes in .

The objective function of GDNMF is formulated as follows:

The corresponding optimization rules are as follows:where is initialized to a random nonnegative matrix in the algorithm. and are nonnegative regularization parameters, respectively. Essentially, GDNMF incorporates the graph Laplacian and supervised label information into the objective function of NMF, which ensures the algorithm to keep consistent with the intuitive geometric structure of the data and improves the discriminative power of different classes.

2.5. Robust Nonnegative Matrix Factorization via Joint Graph Laplacian and Discriminative Information (GLD-RNMF)
2.5.1. The Objective Function

For the purpose of dimension reduction, NMF represents original data by a product of a nonnegative matrix and coefficient matrix . The approximation error is calculated according to the squared residuals; that is, . Due to the squared term in the objective function, smaller outliers can lead to larger errors. In this paper, we enforce -norm constraint on the objective function to reduce the impact of outliers and noise.

By employing -norm on GDNMF model, we can formulate the objective function of GLD-RNMF as follows:

This objective function can solve high dimensional, negative, noisy and sparse data simultaneously, keep consistent with the intuitive geometric structure of data, and improve the discriminative power of different classes.

2.5.2. The Multiplication Update Rules of GLD-RNMF

Although the objective function is not convex jointly about , it is convex in regard to one of variables in when the others are fixed. The objective function can be expanded as follows:where and both are diagonal matrices and the diagonal elements are as follows:in which is an infinitesimal positive number.

In order to solve the optimization problem in (9), we introduce the Lagrange multipliers , , and for , , and , respectively. Firstly, we formulate the Lagrange function of GLD-RNMF as follows:

Taking the partial derivatives of with respect to , , and and setting them to zero and in view of and , we get

According to the KKT (Karush-Kuhn-Tucker) conditions [20], that is, , , and , we can obtain the following equations:

Then we can get the multivariate updating rules as follows:

The details of our method are described in Algorithm 1. The iterative procedure is performed until the algorithm converges.

Input: Data matrix , indicator matrix , parameters .
Output: Matrices , and .
Initialization: Randomly initialize three nonnegative matrices ,
    and , initialize , to be identity
   matrix. Set
Repeat
   Update , and separately by
   
   
   
   Calculate the diagonal matrices and by (11) and (12), separately.
Until convergence.

Considering the three update rules above, we ensure the convergence of the algorithm by the following theorem.

Theorem 1. The objective function is nonincreasing under the iterative rules in (16), (17), and (18).
The detailed proof of the theorem is shown in the Appendix.

3. Results and Discussion

In order to verify the effectiveness of GLD-RNMF algorithm for identifying differentially expressed genes, we perform experiments on real gene expression datasets to compare our algorithm with the other four feature extraction algorithms: (a) GNMF algorithm (Cai et al. [9]); (b) NMFSC algorithm (Hoyer [21]); (c) RGNMF algorithm (Wang et al. [10]); (d) GDNMF algorithm (Long et al. [16]). We conduct these experiments on two publicly available cancer datasets: pancreatic cancer dataset (PAAD) and cholangiocarcinoma dataset (CHOL).

3.1. Identifying Differentially Expressed Genes by GLD-RNMF

In this section, we use GLD-RNMF to identify differentially expressed genes. The matrix with size is the original gene expression data. Each row of indicates the transcriptional response of genes in a sample. Each column of indicates the expression level of a gene in all samples. Therefore, can be written as follows:where is the basis matrix with size and is the coefficient matrix with size and . Since the matrix contains all of the genes, the differentially expressed genes can be identified from the matrix [10]. By GLD-RNMF, the evaluating vector is obtained in which elements are sorted in descending order:

Generally, the larger the entry in is, the more differential this gene is. Therefore, the differentially expressed genes can be obtained by the first num () largest elements in .

The objective of the experiment is to identify the differentially expressed genes by GLD-RNMF algorithm. The identifying process is described below.(1)Obtain the nonnegative matrix according to the genomic dataset.(2)Construct the label matrix and the diagonal matrixes and .(3)Gain the Laplacian matrix and basis matrix via GLD-RNMF algorithm.(4)Identify differentially expressed genes through the vector .(5)Check differentially expressed genes by gene ontology tool.

3.2. Parameters Selection

We assign parameters in our GLD-RNMF algorithm following the same way proposed by Long et al. [16]. Distinguishingly, there are two parameters, that is, and , in GLD-RNMF method.

Fortunately, if and are set in a reasonable range they have little effect on the performance of the algorithm [15, 16]. In our experiments, we set and in the GLD-RNMF algorithm. Another important parameter in our GLD-RNMF algorithm is which is used to construct a -nearest graph. Empirically, we set and adopt the mode as the heat kernel in LPP [22]. Besides, we set the reduced dimension to 5 for all the methods. All the parameters in the other methods keep in line with those described in their paper [10, 16, 21, 23].

3.3. Gene Ontology Analysis

The gene ontology (GO) tool can interpret the genes that are input and discover the functions that these genes may have in common. As a web-based tool [24], GO Enrichment Analysis can find important GO items from a large number of genes and provide important information for the biological interpretation of high-throughput experiments. Another online tool that we use is ToppFun. It usually is used to interpret the differentially expressed genes.

To be fair, we extract 100 genes from the gene expression data by GNMF, NMFSC, RGNMF, GDNMF, and GLD-RNMF methods. Threshold parameters of ToppFun are set as follows: the maximum value is set to 0.01 and the minimum number of gene products is set to 2.

3.4. Pancreatic Cancer Dataset

Pancreatic cancer is a tumor with high malignancy, which is difficult to diagnose and treat. Early diagnosis of pancreatic cancer is not difficult and the mortality rate is high. The cause of pancreatic cancer is still not clear until now. In the experiment, there are 20502 genes in 180 samples contained in the dataset.

The top 10 GO items extracted and the values of the five methods are listed in Table 1. In this table, “ID” and “Name” represent items and their names associated with the GO in the whole genome and the lowest value of the five methods has been marked in bold font. As we find from Table 1, the values generated by GLD-RNMF are much smaller than those by the other four methods for the PAAD. Therefore, GLD-RNMF method is more superior than the other four methods for the PAAD. The name of “GO:0030198” is extracellular matrix organization. It contains DPT (Dermatopontin), POSTN (Periostin), sec24d (Sec24-related protein d), and other genes which are related to pancreatic cancer [2527]. For example, POSTN can create a tumor-supportive microenvironment in the pancreas [28]. Genes and gene products associated with extracellular structure organization (GO:0043062) can be found by GO tool, in which, DPT, POSTN, sec24d, uxs1 (UDP-Glucuronate Decarboxylase 1), fkrp (Fukutin Related Protein), and other genes have been illustrated to be associated with pancreatic cancer [29, 30]. For example, Sec24d is ubiquitously expressed but exhibits predominant expression in heart, placenta, liver, and pancreas. The other GO items can also be proven to be related to pancreatic cancer by some relevant literature material. Clearly, GLD-RNMF method is an effective method for identifying differentially expressed genes.

Comparing 100 genes extracted by GLD-RNMF with what we obtain from Gene Cards (http://www.genecards.org/) about pancreatic cancer, 82 of the 100 genes are associated with pancreatic cancer. Many genes, which were previously thought to be unrelated to clinical outcomes, are identified. We present the top 10 of 82 genes with higher relevance scores in Table 2, including their gene ID, names, function, and related diseases. Among the identified differentially expressed genes, MMP2, MMP7, IGFBP3, INS, and the other genes have been demonstrated to be related to pancreatic cancer [3134]. For example, the effect of MMP-2 and its activators MT1-MMP, MT2-MMP, and MT3-MMP in pancreatic tumor cell invasion and the development of the desmoplastic reaction characteristic of pancreatic cancer tissues have been discussed [35]. Akihisa Fukuda et al. demonstrated that serum MMP7 level in human pancreatic ductal adenocarcinoma patients is correlated with metastatic disease and survival. Conditioned medium from Capan-1 pancreatic cancer cells which contains abundant IGFBP-3 has been mentioned [36]. Other genes identified by GLD-RNMF have been illustrated to be related to pancreatic cancer by some relevant literature materials as well.

On the other hand, we use Kyoto Encyclopedia of Genes and Genomes (KEGG) online analysis tool to analyze the differentially expressed genes identified by GLD-RNMF. In this experiment, putting the identified 100 genes into KEGG, we can obtain the corresponding disease pathway. Figure 1 is the pathway of pancreatic cancer. The genes that have been found by GLD-RNMF are marked with red. The chromosomal instability pathway records the disease progression from normal duct to pancreatic cancer. Infiltrating ductal adenocarcinoma is the most common malignancy of the pancreas. Normal duct epithelium progresses to the stage of infiltrating cancer through a series of histologically defined precursors. These differentially expressed genes contain two oncogenes: K-Ras and HER2/neu. p16, p53, BRCA2, and Smad4 are tumor suppressors. Slebos et al. assessed that K-ras oncogene mutations and p53 protein accumulation are associated with known or postulated risk factors for pancreatic cancer [37]. Gu et al. found the expression of Smad4 and p16 is significantly lower in pancreatic cancer tissue compared with normal tissue. The lower expression of the proteins may impact the development of pancreatic cancer [38]. From Figure 1, we can see that the pancreatic cancer pathway map contains six other pathways: PI3K-Akt signaling pathway, ErbB signaling pathway, MAPK signaling pathway, VEGF signaling, Jak-STAT signaling pathway, p53 signaling pathway, and TGF- signaling pathway.

PI3K-Akt signaling pathway is presented in Figure 2. From Figure 2, we can find four proteinases from the differentially expressed genes identified by GLD-RNMF. Therefore, our algorithm achieves better results.

3.5. Cholangiocarcinoma Dataset

Cholangiocarcinoma is diagnosed in 12,000 patients in the US each year, but only 10 percent are discovered early enough to allow for successful surgical treatment. In our experiments, we apply the five methods on CHOL which contains 20502 genes on 45 samples. The 8 GO items closely related to cholangiocarcinoma and the values of the five methods are listed in Table 3. In this table, “ID” and “Name” represent items and their names associated with the GO in the whole genome. The lowest value of the five methods has been marked in bold font. We can see from the table that GLD-RNMF is much better than the other four methods. Genes and gene products associated with blood microparticle (GO:0072562) can be found by the GO tool, in which APOA4 (Apolipoprotein A4), kng1 (Kininogen 1), and other genes have been illustrated to be associated with cholangiocarcinoma [3941]. The name of “GO:0060205” is cytoplasmic vesicle lumen. It contains ada (Adenosine Deaminase), DBH (Dopamine Beta-Hydroxylase), and other genes which are related to cholangiocarcinoma [42, 43]. The top 8 genes identified by GLD-RNMF are listed in Table 4 including the gene ID, names, gene annotations, and related diseases. Consistent with the previous study, ALB (Albumin), HP (Haptoglobin), SERPINC1 (Serpin Family C Member 1), C3 (Complement C3), and other genes are successfully identified which represent potential biomarkers for cholangiocarcinoma and potential targets for clarifying the molecular mechanisms associated with cholangiocarcinoma. For example, HP was proposed as pronucleating proteins because they were highly expressed in the fast nucleating bile of patients with cholesterol stones [44]. Waghray et al. predicted survival in patients with hilar cholangiocarcinoma by serum albumin [45]. C3 and HP were identified as more abundant in cholangiocarcinoma [46]. The relative documents can illustrate that the other genes identified by GLD-RNMF are associated with cholangiocarcinoma.

4. Conclusions

By introducing -norm, manifold graph and discriminative label information, we propose an efficient algorithm named robust nonnegative matrix factorization via joint graph Laplacian and discriminative information (GLD-RNMF) in this paper. -norm can reduce the influence of outliers and noise, manifold graph can find the low dimensional manifold in high dimensional data space and the intrinsic law of the observed data, and the label information can increase the discriminative power of different classes. Nonnegative matrix factorization avoids the problems of high dimension and nonnegative data. As a result, GLD-RNMF can handle nonnegative, high dimension, outliers, and noise and improve the discriminative power of different classes. Experimental results on two datasets show that GLD-RNMF is superior to the state-of-the-art methods for identifying differentially expressed gene.

Appendix

Detailed Proof of Theorem

To prove Theorem, we need to show that the objective function in (9) is nonincreasing under the iterative rules in (16), (17), and (18). For the objective function , we need to fix and when we update . Similarly, we need to fix and when we update , and we need to fix and when we update . For the reason that we have similar update rules for and in GLD-RNMF with those in NMF, the detailed proof can be found [27]. Hence, we just need to prove that is nonincreasing under the iterative rules in (18). We use an auxiliary function similar to what is used in the Expectation-Maximization algorithm [47]. In the demonstration, we present the definition of auxiliary function [16]. Definition is an auxiliary function of if the following conditions are satisfied.

The auxiliary function is vital due to the following lemmas.

Lemma A.1. If is an auxiliary function of , then is nonincreasing under the update rule:

Proof. Obviouslynow, we will present the update step for in (18) is exactly the update in (A.2) with an appropriate auxiliary functionIt is enough to prove that each is nonincreasing under the update rules because our update is essentially wised. Therefore, we present the following lemma.

Lemma A.2. Function,is an auxiliary function of .

Proof. We only need to demonstrate that , because is obvious. Consequently, comparing the Taylor series expansion of ,with (A.2), we can find that is equivalent to Actually, we have

Therefore, (A.7) holds and . Now we can prove the convergence of theorem.

Proof of Theorem. Replacing in (A.2) by (A.5), we can obtain the following update rules:Since is an auxiliary function, is nonincreasing under the update rule.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported in part by the grants of the National Science Foundation of China, nos. 61572284, 61502272, 61373027, and 61672321.