Abstract
Differential expression plays an important role in cancer diagnosis and classification. In recent years, many methods have been used to identify differentially expressed genes. However, the recognition rate and reliability of gene selection still need to be improved. In this paper, a novel constrained method named robust nonnegative matrix factorization via joint graph Laplacian and discriminative information (GLDRNMF) is proposed for identifying differentially expressed genes, in which manifold learning and the discriminative label information are incorporated into the traditional nonnegative matrix factorization model to train the objective matrix. Specifically, norm minimization is enforced on both the error function and the regularization term which is robust to outliers and noise in gene data. Furthermore, the multiplicative update rules and the details of convergence proof are shown for the new model. The experimental results on two publicly available cancer datasets demonstrate that GLDRNMF is an effective method for identifying differentially expressed genes.
1. Introduction
Cancer is one of the most serious diseases that endanger the health of human being. Millions of people die of cancer every year. With the development of gene sequencing technology and other gene detection technologies, huge gene data have been generated [1, 2]. Therefore, it is important and challenging for scientists to find pathogenic genes from a large number of gene expression data. Microarray datasets on each chip usually contain many gene expression data, and the number of samples is far less than that of genes, which makes the identification of differentially expressed genes difficult [3]. In addition, irrelevant or noisy variables may reduce the accuracy of the results. In recent years, many effective mathematical methods have been applied to identify differentially expressed genes. For example, principal component analysis (PCA) [4, 5] and penalized matrix decomposition (PMD) [6] have been used to analyze gene expression data. Liu et al. used robust principal component analysis (RPCA) to discover differentially expressed genes [7]. Zheng et al. employed nonnegative matrix factorization (NMF) on the selection of tumor genes [8]. Cai et al. proposed an algorithm named graph regularized nonnegative matrix factorization (GNMF) for data representation [9]. Wang et al. used robust graph regularized nonnegative matrix factorization (RGNMF) for identifying differentially expressed genes [10]. A CIPMD (ClassInformationBased Penalized Matrix Decomposition) algorithm was proposed to identify the differentially expressed genes on RNASeq data, which introduced the class information via a total scatter matrix [11]. The Consensus Clustering methodology was proposed for microarray data analysis by Giancarlo and Utro [12].
However, two characteristics of gene expression data pose a serious challenge to the existing methods. Firstly, a large number of researchers hold that gene expression data probably reside in a low dimensional manifold embedded in a high dimensional ambient space. Therefore it is critical to consider the geometrical structure in the original gene expression data. Manifold learning is clearly an effective method to preserve the data geometric structure embedded in the original gene expression data [13, 14]. Cai et al. proposed GNMF [9], in which the geometrical structure of data was constructed by an affinity graph. Another variant of NMF called manifold regularized discriminative nonnegative matrix factorization (MDNMF) was also introduced [15]. MDNMF considered both the local geometry of data and the discriminative information of different classes simultaneously. Long et al. proposed a method called graph regularized discriminative nonnegative matrix factorization (GDNMF) [16], in which both the geometrical structure and discriminative label information were considered in the objective function. Secondly, gene expression data often contain a lot of outliers and noise. However, existing methods cannot effectively eliminate outliers and noise. For example, least squares methods are sensitive to outliers and noise. In recent years, many researchers have been devoted to improving the robustness to outliers and noise. Zheng et al. proposed an algorithm named generalized hierarchical fuzzy means [17], which is robust to noise and outliers. Wang et al. used norm to reduce the effect of outliers and noise [10].
A novel algorithm, which we call robust nonnegative matrix factorization via joint graph Laplacian and discriminative information (GLDRNMF), is proposed to overcome the aforementioned problems together. The proposed algorithm preserves the geometric structure of data space by constructing an affinity graph and improves the discriminative ability by the supervised label information. To do so, a new matrix decomposition objective function by integrating the geometric structure and label information is constructed. In addition, we employ norm instead of norm on the error function and the regularization term to reduce the influence of outliers and noise. For completeness, we present that the convergence proof of our iterative scheme is also shown in the Appendix. Experimental results indicate that the GLDRNMF algorithm has better results than other existing algorithms for identifying differentially expressed genes.
The remainder of the paper is arranged as follows. In Section 2, we briefly introduce some relevant mathematical foundation and propose the GLDRNMF algorithm in detail. In Section 3, the results of differentially expressed gene selection using our GLDRNMF method and the other four methods (GNMF, NMFSC, RGNMF, and GDNMF) are shown for comparison. Finally, we conclude this paper in Section 4.
2. Materials and Methods
2.1. Mathematical Definition of
The mathematical definition of norm [18] iswhere is the th row of and is matrix. norm is interpreted as follows. Firstly, we compute norm of the vector and then compute norm of vector . The value of the elements of vector represents the importance of each dimension. norm enables the vector sparse to achieve the purpose of dimension reduction.
2.2. Manifold Learning
The purpose of this work is to get the best approximation of the original data. We also hope that the new representation can respect the intrinsic Riemannian structure. Recently, many researchers hold that high dimensional data often reside on a much lower dimensional manifold. The “manifold assumption” could be that data points nearby in the intrinsic geometry structure are also close under the new basis. Therefore, they usually have similar characteristics and can be categorized into the same class. In this paper, we employ manifold learning to achieve the aforementioned goal.
For a graph with vertices, each vertex corresponds to a data point. For each data point, we can find its nearest neighbors and connect it with the neighbors. There are many ways to define the weight matrix on the graph, for example, 01 weighting, heat kernel weighting, and dotproduct weighting. Considering that 01 weighting is the simplest and easy to compute, we choose 01 weighting as the measure in this paper.
01 Weight. , if and only if two nodes and are connected by an edge. That is, where consists of nearest neighbors of and the neighbors have the same label with .
Therefore, the smoothness of the dimensional representation can be measured as follows:where represents the trace of a matrix. is a diagonal matrix and is the row sum (or column, because is symmetric, ) of ; that is, . is graph Laplacian matrix and . We measure the distance of two points in the low dimensional space by the Euclidean distance .
2.3. Nonnegative Matrix Factorization (NMF)
We review the standard NMF in this section. Although the algorithm has been widely used in many aspects, there are still many shortcomings.
Given nonnegative samples in , arranged in columns of a matrix , in this paper, each row of represents the transcriptional response of the genes in one sample and each column of represents the expression level of a gene across all samples. Letting matrices , , and , NMF decomposes into the product of and ; that is, .
To ensure an approximate factorization , two update rules are introduced [19]. One of the objective functions is constructed by minimizing the square of the Euclidean distance between and . The optimization problem is described as follows:where denotes the matrix Frobenius norm. The corresponding optimization rules are as follows:
The convergence of the above optimization rules has been proven [19].
2.4. Graph Regularized Discriminative Nonnegative Matrix Factorization (GDNMF)
Supervised label information is added to the objective function of GNMF [16]. The definition and iterative rules of GDNMF are presented below.
Class indicator matrix is defined as follows:where is the class label of and is the total number of classes in .
The objective function of GDNMF is formulated as follows:
The corresponding optimization rules are as follows:where is initialized to a random nonnegative matrix in the algorithm. and are nonnegative regularization parameters, respectively. Essentially, GDNMF incorporates the graph Laplacian and supervised label information into the objective function of NMF, which ensures the algorithm to keep consistent with the intuitive geometric structure of the data and improves the discriminative power of different classes.
2.5. Robust Nonnegative Matrix Factorization via Joint Graph Laplacian and Discriminative Information (GLDRNMF)
2.5.1. The Objective Function
For the purpose of dimension reduction, NMF represents original data by a product of a nonnegative matrix and coefficient matrix . The approximation error is calculated according to the squared residuals; that is, . Due to the squared term in the objective function, smaller outliers can lead to larger errors. In this paper, we enforce norm constraint on the objective function to reduce the impact of outliers and noise.
By employing norm on GDNMF model, we can formulate the objective function of GLDRNMF as follows:
This objective function can solve high dimensional, negative, noisy and sparse data simultaneously, keep consistent with the intuitive geometric structure of data, and improve the discriminative power of different classes.
2.5.2. The Multiplication Update Rules of GLDRNMF
Although the objective function is not convex jointly about , it is convex in regard to one of variables in when the others are fixed. The objective function can be expanded as follows:where and both are diagonal matrices and the diagonal elements are as follows:in which is an infinitesimal positive number.
In order to solve the optimization problem in (9), we introduce the Lagrange multipliers , , and for , , and , respectively. Firstly, we formulate the Lagrange function of GLDRNMF as follows:
Taking the partial derivatives of with respect to , , and and setting them to zero and in view of and , we get
According to the KKT (KarushKuhnTucker) conditions [20], that is, , , and , we can obtain the following equations:
Then we can get the multivariate updating rules as follows:
The details of our method are described in Algorithm 1. The iterative procedure is performed until the algorithm converges.

Considering the three update rules above, we ensure the convergence of the algorithm by the following theorem.
Theorem 1. The objective function is nonincreasing under the iterative rules in (16), (17), and (18).
The detailed proof of the theorem is shown in the Appendix.
3. Results and Discussion
In order to verify the effectiveness of GLDRNMF algorithm for identifying differentially expressed genes, we perform experiments on real gene expression datasets to compare our algorithm with the other four feature extraction algorithms: (a) GNMF algorithm (Cai et al. [9]); (b) NMFSC algorithm (Hoyer [21]); (c) RGNMF algorithm (Wang et al. [10]); (d) GDNMF algorithm (Long et al. [16]). We conduct these experiments on two publicly available cancer datasets: pancreatic cancer dataset (PAAD) and cholangiocarcinoma dataset (CHOL).
3.1. Identifying Differentially Expressed Genes by GLDRNMF
In this section, we use GLDRNMF to identify differentially expressed genes. The matrix with size is the original gene expression data. Each row of indicates the transcriptional response of genes in a sample. Each column of indicates the expression level of a gene in all samples. Therefore, can be written as follows:where is the basis matrix with size and is the coefficient matrix with size and . Since the matrix contains all of the genes, the differentially expressed genes can be identified from the matrix [10]. By GLDRNMF, the evaluating vector is obtained in which elements are sorted in descending order:
Generally, the larger the entry in is, the more differential this gene is. Therefore, the differentially expressed genes can be obtained by the first num () largest elements in .
The objective of the experiment is to identify the differentially expressed genes by GLDRNMF algorithm. The identifying process is described below.(1)Obtain the nonnegative matrix according to the genomic dataset.(2)Construct the label matrix and the diagonal matrixes and .(3)Gain the Laplacian matrix and basis matrix via GLDRNMF algorithm.(4)Identify differentially expressed genes through the vector .(5)Check differentially expressed genes by gene ontology tool.
3.2. Parameters Selection
We assign parameters in our GLDRNMF algorithm following the same way proposed by Long et al. [16]. Distinguishingly, there are two parameters, that is, and , in GLDRNMF method.
Fortunately, if and are set in a reasonable range they have little effect on the performance of the algorithm [15, 16]. In our experiments, we set and in the GLDRNMF algorithm. Another important parameter in our GLDRNMF algorithm is which is used to construct a nearest graph. Empirically, we set and adopt the mode as the heat kernel in LPP [22]. Besides, we set the reduced dimension to 5 for all the methods. All the parameters in the other methods keep in line with those described in their paper [10, 16, 21, 23].
3.3. Gene Ontology Analysis
The gene ontology (GO) tool can interpret the genes that are input and discover the functions that these genes may have in common. As a webbased tool [24], GO Enrichment Analysis can find important GO items from a large number of genes and provide important information for the biological interpretation of highthroughput experiments. Another online tool that we use is ToppFun. It usually is used to interpret the differentially expressed genes.
To be fair, we extract 100 genes from the gene expression data by GNMF, NMFSC, RGNMF, GDNMF, and GLDRNMF methods. Threshold parameters of ToppFun are set as follows: the maximum value is set to 0.01 and the minimum number of gene products is set to 2.
3.4. Pancreatic Cancer Dataset
Pancreatic cancer is a tumor with high malignancy, which is difficult to diagnose and treat. Early diagnosis of pancreatic cancer is not difficult and the mortality rate is high. The cause of pancreatic cancer is still not clear until now. In the experiment, there are 20502 genes in 180 samples contained in the dataset.
The top 10 GO items extracted and the values of the five methods are listed in Table 1. In this table, “ID” and “Name” represent items and their names associated with the GO in the whole genome and the lowest value of the five methods has been marked in bold font. As we find from Table 1, the values generated by GLDRNMF are much smaller than those by the other four methods for the PAAD. Therefore, GLDRNMF method is more superior than the other four methods for the PAAD. The name of “GO:0030198” is extracellular matrix organization. It contains DPT (Dermatopontin), POSTN (Periostin), sec24d (Sec24related protein d), and other genes which are related to pancreatic cancer [25–27]. For example, POSTN can create a tumorsupportive microenvironment in the pancreas [28]. Genes and gene products associated with extracellular structure organization (GO:0043062) can be found by GO tool, in which, DPT, POSTN, sec24d, uxs1 (UDPGlucuronate Decarboxylase 1), fkrp (Fukutin Related Protein), and other genes have been illustrated to be associated with pancreatic cancer [29, 30]. For example, Sec24d is ubiquitously expressed but exhibits predominant expression in heart, placenta, liver, and pancreas. The other GO items can also be proven to be related to pancreatic cancer by some relevant literature material. Clearly, GLDRNMF method is an effective method for identifying differentially expressed genes.
Comparing 100 genes extracted by GLDRNMF with what we obtain from Gene Cards (http://www.genecards.org/) about pancreatic cancer, 82 of the 100 genes are associated with pancreatic cancer. Many genes, which were previously thought to be unrelated to clinical outcomes, are identified. We present the top 10 of 82 genes with higher relevance scores in Table 2, including their gene ID, names, function, and related diseases. Among the identified differentially expressed genes, MMP2, MMP7, IGFBP3, INS, and the other genes have been demonstrated to be related to pancreatic cancer [31–34]. For example, the effect of MMP2 and its activators MT1MMP, MT2MMP, and MT3MMP in pancreatic tumor cell invasion and the development of the desmoplastic reaction characteristic of pancreatic cancer tissues have been discussed [35]. Akihisa Fukuda et al. demonstrated that serum MMP7 level in human pancreatic ductal adenocarcinoma patients is correlated with metastatic disease and survival. Conditioned medium from Capan1 pancreatic cancer cells which contains abundant IGFBP3 has been mentioned [36]. Other genes identified by GLDRNMF have been illustrated to be related to pancreatic cancer by some relevant literature materials as well.
On the other hand, we use Kyoto Encyclopedia of Genes and Genomes (KEGG) online analysis tool to analyze the differentially expressed genes identified by GLDRNMF. In this experiment, putting the identified 100 genes into KEGG, we can obtain the corresponding disease pathway. Figure 1 is the pathway of pancreatic cancer. The genes that have been found by GLDRNMF are marked with red. The chromosomal instability pathway records the disease progression from normal duct to pancreatic cancer. Infiltrating ductal adenocarcinoma is the most common malignancy of the pancreas. Normal duct epithelium progresses to the stage of infiltrating cancer through a series of histologically defined precursors. These differentially expressed genes contain two oncogenes: KRas and HER2/neu. p16, p53, BRCA2, and Smad4 are tumor suppressors. Slebos et al. assessed that Kras oncogene mutations and p53 protein accumulation are associated with known or postulated risk factors for pancreatic cancer [37]. Gu et al. found the expression of Smad4 and p16 is significantly lower in pancreatic cancer tissue compared with normal tissue. The lower expression of the proteins may impact the development of pancreatic cancer [38]. From Figure 1, we can see that the pancreatic cancer pathway map contains six other pathways: PI3KAkt signaling pathway, ErbB signaling pathway, MAPK signaling pathway, VEGF signaling, JakSTAT signaling pathway, p53 signaling pathway, and TGF signaling pathway.
PI3KAkt signaling pathway is presented in Figure 2. From Figure 2, we can find four proteinases from the differentially expressed genes identified by GLDRNMF. Therefore, our algorithm achieves better results.
3.5. Cholangiocarcinoma Dataset
Cholangiocarcinoma is diagnosed in 12,000 patients in the US each year, but only 10 percent are discovered early enough to allow for successful surgical treatment. In our experiments, we apply the five methods on CHOL which contains 20502 genes on 45 samples. The 8 GO items closely related to cholangiocarcinoma and the values of the five methods are listed in Table 3. In this table, “ID” and “Name” represent items and their names associated with the GO in the whole genome. The lowest value of the five methods has been marked in bold font. We can see from the table that GLDRNMF is much better than the other four methods. Genes and gene products associated with blood microparticle (GO:0072562) can be found by the GO tool, in which APOA4 (Apolipoprotein A4), kng1 (Kininogen 1), and other genes have been illustrated to be associated with cholangiocarcinoma [39–41]. The name of “GO:0060205” is cytoplasmic vesicle lumen. It contains ada (Adenosine Deaminase), DBH (Dopamine BetaHydroxylase), and other genes which are related to cholangiocarcinoma [42, 43]. The top 8 genes identified by GLDRNMF are listed in Table 4 including the gene ID, names, gene annotations, and related diseases. Consistent with the previous study, ALB (Albumin), HP (Haptoglobin), SERPINC1 (Serpin Family C Member 1), C3 (Complement C3), and other genes are successfully identified which represent potential biomarkers for cholangiocarcinoma and potential targets for clarifying the molecular mechanisms associated with cholangiocarcinoma. For example, HP was proposed as pronucleating proteins because they were highly expressed in the fast nucleating bile of patients with cholesterol stones [44]. Waghray et al. predicted survival in patients with hilar cholangiocarcinoma by serum albumin [45]. C3 and HP were identified as more abundant in cholangiocarcinoma [46]. The relative documents can illustrate that the other genes identified by GLDRNMF are associated with cholangiocarcinoma.
4. Conclusions
By introducing norm, manifold graph and discriminative label information, we propose an efficient algorithm named robust nonnegative matrix factorization via joint graph Laplacian and discriminative information (GLDRNMF) in this paper. norm can reduce the influence of outliers and noise, manifold graph can find the low dimensional manifold in high dimensional data space and the intrinsic law of the observed data, and the label information can increase the discriminative power of different classes. Nonnegative matrix factorization avoids the problems of high dimension and nonnegative data. As a result, GLDRNMF can handle nonnegative, high dimension, outliers, and noise and improve the discriminative power of different classes. Experimental results on two datasets show that GLDRNMF is superior to the stateoftheart methods for identifying differentially expressed gene.
Appendix
Detailed Proof of Theorem
To prove Theorem, we need to show that the objective function in (9) is nonincreasing under the iterative rules in (16), (17), and (18). For the objective function , we need to fix and when we update . Similarly, we need to fix and when we update , and we need to fix and when we update . For the reason that we have similar update rules for and in GLDRNMF with those in NMF, the detailed proof can be found [27]. Hence, we just need to prove that is nonincreasing under the iterative rules in (18). We use an auxiliary function similar to what is used in the ExpectationMaximization algorithm [47]. In the demonstration, we present the definition of auxiliary function [16]. Definition is an auxiliary function of if the following conditions are satisfied.
The auxiliary function is vital due to the following lemmas.
Lemma A.1. If is an auxiliary function of , then is nonincreasing under the update rule:
Proof. Obviouslynow, we will present the update step for in (18) is exactly the update in (A.2) with an appropriate auxiliary functionIt is enough to prove that each is nonincreasing under the update rules because our update is essentially wised. Therefore, we present the following lemma.
Lemma A.2. Function,is an auxiliary function of .
Proof. We only need to demonstrate that , because is obvious. Consequently, comparing the Taylor series expansion of ,with (A.2), we can find that is equivalent to Actually, we have
Therefore, (A.7) holds and . Now we can prove the convergence of theorem.
Proof of Theorem. Replacing in (A.2) by (A.5), we can obtain the following update rules:Since is an auxiliary function, is nonincreasing under the update rule.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work is supported in part by the grants of the National Science Foundation of China, nos. 61572284, 61502272, 61373027, and 61672321.