Abstract
Principal Component Analysis (PCA) as a tool for dimensionality reduction is widely used in many areas. In the area of bioinformatics, each involved variable corresponds to a specific gene. In order to improve the robustness of PCAbased method, this paper proposes a novel graphLaplacian PCA algorithm by adopting constraint ( gLPCA) on error function for feature (gene) extraction. The error function based on norm helps to reduce the influence of outliers and noise. Augmented Lagrange Multipliers (ALM) method is applied to solve the subproblem. This method gets better results in feature extraction than other stateoftheart PCAbased methods. Extensive experimental results on simulation data and gene expression data sets demonstrate that our method can get higher identification accuracies than others.
1. Introduction
With the rapid development of genechip and deepsequencing technologies, a lot of gene expression data have been generated. It is possible for biologists to monitor the expression of thousands of genes with the maturation of the sequencing technology [1–3]. It is reported that a growing body of research has been used to select the feature genes from gene expression data [4–6]. Feature extraction is a typical application of gene expression data. Cancer has become a threat to human health. Modern medicine has proved all cancers are directly or indirectly related to genes. How to identify what is believed to be related to cancer has become a hotspot in the field of bioinformatics. The major bottleneck of the development of bioinformatics is how to build an effective approach to integrate and analyze the expression data [7].
One striking feature of gene expression data is the case that the number of genes is far greater than the number of samples, commonly called the highdimensionsmallsamplesize problem [8]. Typically this means that expression data are always with more than thousands of genes, while the size of samples is generally less than 100. The huge expression data make them hard to analyze, but only a small size of genes can control the gene expression. More attention has been attached to the importance of feature genes by modern biologists. Correspondingly, it is especially important how to discover these genes effectively, so many dimensionality reduction approaches are proposed.
Traditional dimensionality reduction methods have been widely used. For example, Principal Component Analysis (PCA) recombines the original data which have a certain relevance into a new set of independent indicators [9–11]. However, because of the sparsity of gene regulation, the weaknesses of traditional approaches in the field of feature extraction become increasingly evident [12, 13]. With the development of deepsequencing technique, the inadequacy of conventional methods is emerging. Within the process of feature selection on biological data, the principal components of PCA are dense, which makes it difficult to give an objective and reasonable explanation on the significance of biology. PCAbased methods have achieved good results in the application of feature extraction [3, 12]. Although this method shows the significance of sparsity in the aspect of handling high dimensional data, there are still a lot of shortcomings in the algorithm.(1)The high dimensionality of data poses a great challenge to the research, which is called data disaster.(2)Facing with millions of data points, it is reasonable to consider the internal geometric structure of the data.(3)Gene expression data usually contain a lot of outliers and noise, but the above methods cannot effectively deal with these problems.
With the development of graph theory [14] and manifold learning theory [15], the embedded structure problem has been effectively resolved. Laplacian embedding as a classical method of manifold learning has been used in machine learning and pattern recognition, whose essential idea is recovery of low dimensional manifold structure from high dimensional sampled data. The performance of feature extraction will be improved remarkably after joining Laplacian in gene expression data. In the case of maintaining the local adjacency relationship of the graph, the graph can be drawn from the high dimensional space to a low dimensional space (drawing graph). However, graphLaplacian cannot dispose outliers.
In the field of dimensionality reduction, norm was getting more and more popular to replace , which was first proposed by Nie et al. [16]. Research shows that a proper value of can achieve a more exact result for dimensionality reduction [17]. Furthermore, Xu et al. developed an simple iterative thresholding representation theory for norm [18], which was similar to the notable iterative soft thresholding algorithm for the solution of [19] and norm [20]. Xu et al. have shown that norm generates more better solution than norm [21]. Besides, among all regularization with in , there is no obvious difference. However, when , the smaller is, the more effective result will be [17]. This provides a motivation to introduce norm constraint into original method. Since the error of each data point is calculated in the form of the square. It will also cause a lot of errors while the data contains some tiny abnormal values.
In order to solve the above problems, we propose a novel method based on norm constraint, graphLaplacian PCA ( gLPCA) which provides a good performance. In summary, the main work of this paper is as follows. (1) The error function based on norm is used to reduce the influence of outliers and noise. (2) GraphLaplacian is introduced to recover low dimensional manifold structure from high dimensional sampled data.
The remainder of the paper is organized as follows. Section 2 provides some related work. We present our formulation and algorithm for norm constraint graphLaplacian PCA in Section 3. We evaluate our algorithm on both simulation data and real gene expression data in Section 4. The correlations between the identified genes and cancer data are also included. The paper is concluded in Section 5.
2. Related Work
2.1. Principal Component Analysis
In the field of bioinformatics, the principal components (PCs) of PCA are used to select feature genes. Assume is the input data matrix, which contains the collection of data column vectors and dimension space. Traditional PCA approaches recombine the original data which have a certain relevance into a new set of independent indicators [9]. More specifically, this method reduces the input data to dim subspace by minimizing:where each column of is the principal directions and is the projected data points in the new subspace.
2.2. GraphLaplacian PCA
Since the traditional PCA has not taken into account the intrinsic geometrical structure within input data, the mutual influences among data may be missed during a research project [9]. With the increasing popularity of the manifold learning theory, people are becoming aware that the intrinsic geometrical structure is essential for modeling input data [15]. It is a wellknown fact that graphLaplacian is the fastest approach in the manifold learning method [14]. The essential idea of graphLaplacian is to recover low dimensional manifold structure from high dimensional sampled data. PCA closely relates to means clustering [22]. The principal components are also the continuous solution of the cluster indicators in the means clustering method. Thus, it provides a motivation to embed Laplacian to PCA whose primary purpose is clustering [23, 24]. Let symmetric weight matrix be the nearest neighbor graph where is the weight of the edge connecting vertices and . The value of is set as follows:where is the set of nearest neighbors of . is supposed as the embedding coordinates of the data and is defined as a diagonal matrix and . can be obtained by minimizing:where is the column or row sums of and is named as Laplacian matrix. Simply put, in the case of maintaining the local adjacency relationship of the graph, the graph can be drawn from the high dimensional space to a low dimensional space (drawing graph). In the view of the function of graphLaplacian, Jiang et al. proposed a model named graphLaplacian PCA (gLPCA), which incorporates graph structure encoded in [23]. This model can be considered as follows:where is a parameter adjusting the contribution of the two parts. This model has three aspects. (a) It is a data representation, where . (b) It uses to embed manifold learning. (c) This model is a nonconvex problem but has a closedform solution and can be efficient to work out.
In (4), from the perspective of data point, it can be rewritten as follows:In this formula, the error of each data point is calculated in the form of the square. It will also cause a lot of errors while the data contains some tiny abnormal values. Thus, the author formulates a robust version using norm as follows:but the major contribution of norm is to generate sparse on rows, in which the effect is not so obvious [3, 25].
3. Proposed Algorithm
Research shows that a proper value of can achieve a more exact result for dimensionality reduction [17]. When , the smaller is, the more effective result will be [17]. Then, Xu et al. developed a simple iterative thresholding representation theory for norm and obtained the desired results [18]. Thus, motivated by former theory, it is reasonable and necessary to introduce norm on error function to reduce the impact of outliers on the data. Based on the half thresholding theory, we propose a novel method using norm on error function by minimizing the following problem:where norm is defined as , is the input data matrix, and and are the principal directions and the subspace of projected data, respectively. We call this model graphLaplacian PCA based on norm constraint ( gLPCA).
At first, the subproblems are solved by using the Augmented Lagrange Multipliers (ALM) method. Then, an efficient updating algorithm is presented to solve this optimization problem.
3.1. Solving the Subproblems
ALM is used to solve the subproblem. Firstly, an auxiliary variable is introduced to rewrite the formulation (4) as follows:The augmented Lagrangian function of (8) is defined as follows:where is Lagrangian multipliers and is the step size of update. By mathematical deduction, the function of (9) can be rewritten asThe general approach of (10) consists of the following iterations:Then, the details to update each variable in (11) are given as follows.
Updating . At first, we solve while fixing and . The update of relates the following issue:which is the proximal operator of norm. Since this formulation is a nonconvex, nonsmooth, nonLipschitz, and complex optimization problem; an iterative half thresholding approach is used for fast solution of norm and summarizes according to the following lemma [18].
Lemma 1. The proximal operator of norm minimizes the following problem:which is given bywhere and is the half threshold operator and defined as follows: where .
Solving and . Here, we solve while fixing others. The update of amounts to solvingLetting , (16) becomes , taking partial derivatives of as follows:Setting the partial derivatives to 0, we haveThen, we solve while fixing others. Similarly, letting , , the update of can be listed as follows:By some algebra, we haveTherefore, (19) can be rewritten as follows:Thus, the optimal can be obtained by calculating eigenvectorswhich corresponds to the first smallest eigenvalues of the matrix .
Updating and . The update of and is standard:where is used to update the parameter . Since the value of is usually bigger than 1, and over a large number of experiments, we find are good choice. We selected in such practice conditions.
The complete procedure is summarized in Algorithm 1.

3.2. Properties of Algorithm
We set through all our gene expression data experiments. Whereas we introduce , is the largest eigenvalue of matrix and to normalize them, respectively. Setting where is the parameter to substitute for , (20) can be rewritten asTherefore, the solution of can be expressed by the eigenvectors of : It is easy to see that should be in the range . Without norm, there will be standard PCA if . Similarly, when , it reduces to Laplacian embedding.
Furthermore, we rewrite the matrix as follows:where is an eigenvector of : . We have , because is centered and it is easy to see that is centered. is semipositive definite, because is the biggest eigenvalue of ; thus is semipositive definite. Meanwhile, it is easy to see that is semipositive definite. Since is a symmetric real matrix that eigenvectors are mutually orthogonal, thus is orthogonal to others. Although we apply in the Laplacian matrix part, the eigenvectors and eigenvalues do not change, which guarantees that the lowest eigenvectors do not include .
4. Experiments
In this section, we compare our algorithm with Laplacian embedding (LE) [26], PCA [9], PCA, PCA [12], gLPCA, and RgLPCA [23] on simulation data and real gene expression data, respectively, to verify the performance of our algorithm. Among them, PCA and LE are obtained by adjusting the parameters of gLPCA and , respectively. Since our algorithm is not sensitive to parameter mu in practice. In the first subsection, we provide the source of simulation data and experimental comparison results. The experimental results and the function of selected genes on real gene expression data with different methods are compared in the next two subsections.
4.1. Results on Simulation Data
4.1.1. Data Source
Here, we describe a method to produce simulation data. Supposing we generate the data matrix , where and are the number of genes and samples, respectively, the simulation data are generated as Let be four 2000dimensional vectors; for instance, , , and , ; , , and , ; , , and , ; , , and , Given a matrix as a noise matrix with 2000dimension and different SignaltoNoise Ratio (SNR), which is added into , the four eigenvectors of can be expressed as , Let the four eigenvectors dominate; the eigenvalues of can be denoted as , , , , and for .
4.1.2. Detailed Results on Simulation Data
In order to give more accurate experiment results, the average values of the results of 30 times are adopted. For fairness and uniformity, 200 genes are selected by the five methods with their unique parameters. Here, we show the accuracy (%) of these methods. In Figure 1, two factors as two different axes are in the figure. In Figure 2, axis is the number of samples. axis is the value of parameter . The accuracy is defined as follows:where is the iterative times and is the identification accuracy of the th time. We define as follows:where denotes the number of genes, is a function that equals to 0 if and equals to 1 if . We use the function to map the identification of labels. In Figure 1, we show the average accuracies of the seven methods with different sparse parameters while the simulation data is and the average accuracy with all parameters is listed in Table 1. In general, if the algorithm is more sensitive to noise and outliers, the deviation will be greater and the accuracy will be greatly reduced. It is worthy to notice that gLPCA works better than other six methods with higher identification accuracies. This means that our algorithm has lower sensitivity to noise and outliers. This table clearly displays the detail of the identification accuracies in different sparse parameters; our method indicates the superiority when the parameter is larger than 0.4 and the curve is more stable. The accuracy of PCA and PCA starts a precipitous decline when the parameter is larger than 0.7 and 0.8. Compared with PCA and PCA, the methods of gLPCA, RgLPCA, gLPCA, PCA, and LE are not sensitive to the parameter, so there is no substantial change. The stability and average accuracy of various methods can be seen from Table 1.
Furthermore, the number of samples in real gene expression data has a significant influence on the identification accuracy when we select feature gene. Based on this theory, we test different numbers of samples with their best parameters and the average values of the results of 30 times. From the results of Figure 1, we select as the parameters of gLPCA, gLPCA, RgLPCA, PCA, and LE. For PCA and PCA, we do not change its parameters, since it can obtain the best result from the author’s description. The details of average identification accuracies which use seven methods with different sample numbers can be seen from Figure 2. As seen in Figure 2, the accuracy of gLPCA is generally better than other methods and increases with the increase of the number of samples. Besides, Table 2 shows the average accuracy and variance of seven different methods on simulation data with different number of samples. From Table 2, our approach performs better than other methods, even though, in the case of a small number of samples, the accuracy is still high.
4.2. Results on Gene Expression Data
In this subsection, the features (genes) are selected by these methods and sent to ToppFun to detect the geneset enrichment analysis, which is a type of GOTermFinder [27]. The primary role of GOTermFinder is to discover the common of large amounts of gene expression data. The analysis of GOTermFinder provides critical information for the experiment of feature extraction. It is available publicly at https://toppgene.cchmc.org/enrichment.jsp. We set value cutoff to 0.01 through all the experiment. For fair comparison, about gLPCA, RgLPCA, and gLPCA, we both set to control the degree of Laplacian embedding through all experiments in this paper. When , , it results in standard PCA and LE, respectively. Since our algorithm is not sensitive to parameter mu in practice, we set through our experiment.
4.2.1. Results on ALLAML Data
The data of ALLAML as a matrix includes 38 samples and 5000 features (genes), which are publicly available at https://sites.google.com/site/feipingnie/file. It is made up of 11 types of acute myelogenous leukemia (AML) and 27 types of acute lymphoblastic leukemia (ALL) [28]. This data contains the difference between AML and ALL, and ALL is divided into T and B cell subtypes. In this experiment, 300 genes are selected and sent to ToppFun. A series of enrichment analyses are conducted on the extracted top 500 genes corresponding to different methods. The complete experimental data have been listed as supplementary data. The value and hit count of top nine terms about molecular function, biological process, and cellular component of ALLAML data by different methods are listed in Table 3. The value is significance for these genes enrichment analysis in these GO terms; the smaller the value is, the more significant these GO terms are. In this Table, the number of hits is the number of genes from input, and the value was influenced by the number of genes from input and so on. Thus, the difference in number of hits is smaller than the difference in value. It shows clearly that our method performs better than compared methods in 8 terms. The lower value shows that the algorithm is less affected by noise and outliers and thus has high efficiency. If the algorithm is affected by noise and outliers significantly, the degree of gene enrichment will be reduced. Nevertheless, LE has the lowest value in term GO: 0098552. From this table, we can see that there are 93 genes in the item of “immune response” which are selected by our method. This item can be considered as the most probable enrichment item, since it has the lowest value. And many researches were focused on the immune status of leukemia [29–32]. Besides, 210 genes associated with leukemia are listed in an article, and 26 out of top 30 genes selected by our method can be found in this article [33]. And 30 genes selected by our method can be found in another published article [34]. The high overlap rate of these genes selected by our method with this published literature approved the effectiveness of our method.
4.2.2. Pathway Search Result on ALLAML Data
For the sake of the correlations between the selected genes and ALLAML data, the genes selected by gLPCA are proved based on geneset enrichment analysis (GSEA) that is publicly available at http://software.broadinstitute.org/gsea/msigdb/annotate.jsp. We make analysis by GSEA to compute overlaps for selected genes. Figure 3 displays the pathway of hematopoietic cell lineage that has highest gene overlaps in this experiment. From Figure 3, 15 genes from our experiment are contained. Among them, HLADR occurs seven times. Hematopoietic cell lineage belongs to organismal systems and immune system. On the subject of acute myeloid leukemia (AML), there is consensus about the target cell within the hematopoietic stem cell hierarchy that is sensitive to leukemic transformation, or about the mechanism, that is, basic phenotypic, genotypic, and clinical heterogeneity [35]. Hematopoietic stem cell (HSC) developing from the bloodcell can undergo selfrenewal and differentiate into a multilineage committed progenitor cell: one is a common lymphoid progenitor (CLP) and the other is called a common myeloid progenitor (CMP) [36]. A CLP causes the lymphoid lineage of white blood cells or leukocytes, the natural killer (NK) cells and the T and B lymphocytes. A CMP causes the myeloid lineage, which comprises the rest of the leukocytes, the erythrocytes (red blood cells), and the megakaryocytes that produce platelets important in blood clotting. Cells express a stage and lineagespecific set of surface markers in the differentiation process. So the specific expression pattern of these genes is one way to identify the cellular stages. Related diseases include hemophilia, BernardSoulier syndrome, and castleman disease. In medicine, leukemia is a kind of malignant clonal disease of hematopoietic stem cells. Bone marrow transplantation is a magic weapon for the cure of leukemia, by recreating the hematopoietic system to cure leukemia. Generally speaking, when a person has problem in hematopoietic system, it might be related to leukemia [37].
4.2.3. Results on TCGA with PAADGE Data
As the largest public database of cancer gene information, The Cancer Genome Atlas (TCGA, https://tcgadata.nci.nih.gov/tcga/) has been producing multimodal genomics, epigenomics, and proteomics data for thousands of tumor samples across over 30 types of cancer. At the same time, as a multidimensional combination of data, five levels of data are involved, such as gene expression (GE), Protein Expression (PE), DNA Methylation (ME), DNA Copy Number (CN), and microRNA Expression (miRExp). Two disease data sets are downloaded from TCGA to be analyzed in the following two experiments. Pancreatic cancer is a type of disease that threatens human health. In this experiment, pancreatic cancer gene expression data (PAADGE) is analyzed by these methods. The data of PAADGE data as a matrix includes 180 samples and 20502 features (genes). In this subsection, we extract PAADGE data to complete this set of comparative experiments and 500 genes are selected and sent to ToppFun. We select top nine terms from molecular function, biological process, and cellular component by gLPCA and compare with other methods. The value and hit count of these terms are listed in Table 4. It is indicated clearly in Table 4 that our method is more stable than other methods, which has lower value in 7 terms. But in terms GO:0045047 and GO:0072599, PCA performs better than other methods. Nevertheless, gLPCA has the same value with gLPCA in terms GO:0045047 and GO:0072599. 196 genes in the item of “extracellular space” are selected by our method.
4.2.4. Pathway Search Result on PAADGE Data
Similarly as the last experiment, we send our result to GSEA and list the highest genes overlap pathway map in Figure 4. In 1982, Ohhashi reported 4 cases with unique clinical pathological features and is different from normal pancreatic cancer cases, and these 4 cases belong to a completely new clinical type, known as “mucus production type carcinoma (mucinproducing carcinoma, MpC).” Focal adhesion belongs to cellular processes and cellular community. More specifically, cellmatrix adhesions play important roles in biological processes including cell motility, cell proliferation, cell differentiation, regulation of gene expression, and cell survival. At the cellextracellular matrix contact points, specialized structures are created and termed focal adhesions, where bundles of actin filaments are fixed to transmembrane receptors of the integrin family through a multimolecular complex of junctional plaque proteins. Integrin signaling is dependent on the nonreceptor tyrosine kinase activities of the FAK and src proteins as well as the adaptor protein functions of FAK, src and Shc to start downstream signaling events. Similar morphological alterations and modulation of gene expression are started by the binding of growth factors to their respective receptors, underling the considerable crosstalk between adhesion and growth factormediated signaling. The early literatures have shown that there is a certain relationship between the pancreatic cancer and focal adhesion [38]. Activation of focal adhesion kinase enhances the adhesion and invasion of pancreatic cancer cells. Besides, Type II diabetes mellitus is another important pathway and is widely believed to be associated with pancreatic cancer; a metaanalysis has examined this association [39].
4.2.5. Correlations between the Selected Genes and PAADGE Data
The function of top 7 genes selected by gLPCA is listed in Table 5 based on literatures and GeneCards (http://www.genecards.org/). As can be clearly seen from the table, most of these genetic lesions would likely incur pancreasrelated diseases. The etiology of pancreatic cancer is not very clear; it is noted that there is a certain relationship between the incidence of chronic pancreatitis and pancreatic cancer, and we find a significant increase in the proportion of chronic pancreatitis patients with pancreatic cancer. This view is consistent with our experimental result. The clinical observation shows that abdominal pain is the most obvious symptom in the early stage of pancreatic cancer. Some literature on these genes also made a further research as follows. The gene PRSS1 variant likely affects disease susceptibility by altering expression of the primary trypsinogen gene [40]. The pancreatic lipase gene (PNLIP) is located within the genomic region of a bovine marbling quantitative trait locus. PNLIP is a positional and functional candidate for the marbling gene [41].
4.3. The Accuracy and Highest Relevance Score
Because ALLAML and PPAD are human disease data sets, we can find them directly from GeneCards and they are publicly available at http://www.genecards.org/.
In order to summarize the experiments on gene expression data, we compute the accuracy and highest relevance score of these methods from GeneCards and list the details in Table 6. The accuracy in Table 6 indicates the proportion of genes which are real associated with the disease in all of the genes selected by these methods. From Table 6, we observe the following. (1) Both PCA and LE commonly provide better accuracy results than PCA and PCA, demonstrating the usefulness of PCA and LE. (2) gLPCA has a good performance in some conditions and is unstable. Thus, it is necessary to reduce the effects of outliers and noise. (3) gLPCA and RgLPCA consistently perform better than other methods, but gLPCA has the highest relevance score and highest accuracy.
5. Conclusions
This paper investigates a new method of graphLaplacian PCA ( gLPCA) by applying norm constraint on the former method. norm constraint is applied on error function to improve the robustness of the PCAbased method. Augmented Lagrange Multipliers (ALM) method is applied to solve the optimization problem. Extensive experiments on both simulation and real gene expression data have been performed. Results on these two kinds of data show that our proposed method performs better than compared methods. Based on our proposed method, many genes have been extracted to analyze. The identified genes are demonstrated that they are closely related to the corresponding cancer data set.
In future, we will modify the model to improve sparse and robustness of the structure at the same time.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported in part by the NSFC under Grants nos. 61572284 and 61502272.
Supplementary Materials
In the supplementary data, "resultALL_AML_data" are the top 500 genes found by our method on the ALLAML dataset, "find_official_name" is the way to find the official name of these genes and "0.5gLPCA" is the experimental code of our method.