BioMed Research International

Volume 2017 (2017), Article ID 5073427, 14 pages

https://doi.org/10.1155/2017/5073427

## Joint -Norm Constraint and Graph-Laplacian PCA Method for Feature Extraction

^{1}School of Information Science and Engineering, Qufu Normal University, Rizhao 276826, China^{2}Library of Qufu Normal University, Qufu Normal University, Rizhao 276826, China

Correspondence should be addressed to Ying-Lian Gao; moc.621@oagnailniy

Received 30 December 2016; Revised 12 February 2017; Accepted 1 March 2017; Published 2 April 2017

Academic Editor: Jialiang Yang

Copyright © 2017 Chun-Mei Feng et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Principal Component Analysis (PCA) as a tool for dimensionality reduction is widely used in many areas. In the area of bioinformatics, each involved variable corresponds to a specific gene. In order to improve the robustness of PCA-based method, this paper proposes a novel graph-Laplacian PCA algorithm by adopting constraint ( gLPCA) on error function for feature (gene) extraction. The error function based on -norm helps to reduce the influence of outliers and noise. Augmented Lagrange Multipliers (ALM) method is applied to solve the subproblem. This method gets better results in feature extraction than other state-of-the-art PCA-based methods. Extensive experimental results on simulation data and gene expression data sets demonstrate that our method can get higher identification accuracies than others.

#### 1. Introduction

With the rapid development of gene-chip and deep-sequencing technologies, a lot of gene expression data have been generated. It is possible for biologists to monitor the expression of thousands of genes with the maturation of the sequencing technology [1–3]. It is reported that a growing body of research has been used to select the feature genes from gene expression data [4–6]. Feature extraction is a typical application of gene expression data. Cancer has become a threat to human health. Modern medicine has proved all cancers are directly or indirectly related to genes. How to identify what is believed to be related to cancer has become a hotspot in the field of bioinformatics. The major bottleneck of the development of bioinformatics is how to build an effective approach to integrate and analyze the expression data [7].

One striking feature of gene expression data is the case that the number of genes is far greater than the number of samples, commonly called the high-dimension-small-sample-size problem [8]. Typically this means that expression data are always with more than thousands of genes, while the size of samples is generally less than 100. The huge expression data make them hard to analyze, but only a small size of genes can control the gene expression. More attention has been attached to the importance of feature genes by modern biologists. Correspondingly, it is especially important how to discover these genes effectively, so many dimensionality reduction approaches are proposed.

Traditional dimensionality reduction methods have been widely used. For example, Principal Component Analysis (PCA) recombines the original data which have a certain relevance into a new set of independent indicators [9–11]. However, because of the sparsity of gene regulation, the weaknesses of traditional approaches in the field of feature extraction become increasingly evident [12, 13]. With the development of deep-sequencing technique, the inadequacy of conventional methods is emerging. Within the process of feature selection on biological data, the principal components of PCA are dense, which makes it difficult to give an objective and reasonable explanation on the significance of biology. PCA-based methods have achieved good results in the application of feature extraction [3, 12]. Although this method shows the significance of sparsity in the aspect of handling high dimensional data, there are still a lot of shortcomings in the algorithm.(1)The high dimensionality of data poses a great challenge to the research, which is called data disaster.(2)Facing with millions of data points, it is reasonable to consider the internal geometric structure of the data.(3)Gene expression data usually contain a lot of outliers and noise, but the above methods cannot effectively deal with these problems.

With the development of graph theory [14] and manifold learning theory [15], the embedded structure problem has been effectively resolved. Laplacian embedding as a classical method of manifold learning has been used in machine learning and pattern recognition, whose essential idea is recovery of low dimensional manifold structure from high dimensional sampled data. The performance of feature extraction will be improved remarkably after joining Laplacian in gene expression data. In the case of maintaining the local adjacency relationship of the graph, the graph can be drawn from the high dimensional space to a low dimensional space (drawing graph). However, graph-Laplacian cannot dispose outliers.

In the field of dimensionality reduction, -norm was getting more and more popular to replace , which was first proposed by Nie et al. [16]. Research shows that a proper value of can achieve a more exact result for dimensionality reduction [17]. Furthermore, Xu et al. developed an simple iterative thresholding representation theory for -norm [18], which was similar to the notable iterative soft thresholding algorithm for the solution of [19] and -norm [20]. Xu et al. have shown that -norm generates more better solution than -norm [21]. Besides, among all regularization with in , there is no obvious difference. However, when , the smaller is, the more effective result will be [17]. This provides a motivation to introduce -norm constraint into original method. Since the error of each data point is calculated in the form of the square. It will also cause a lot of errors while the data contains some tiny abnormal values.

In order to solve the above problems, we propose a novel method based on -norm constraint, graph-Laplacian PCA ( gLPCA) which provides a good performance. In summary, the main work of this paper is as follows. (1) The error function based on -norm is used to reduce the influence of outliers and noise. (2) Graph-Laplacian is introduced to recover low dimensional manifold structure from high dimensional sampled data.

The remainder of the paper is organized as follows. Section 2 provides some related work. We present our formulation and algorithm for -norm constraint graph-Laplacian PCA in Section 3. We evaluate our algorithm on both simulation data and real gene expression data in Section 4. The correlations between the identified genes and cancer data are also included. The paper is concluded in Section 5.

#### 2. Related Work

##### 2.1. Principal Component Analysis

In the field of bioinformatics, the principal components (PCs) of PCA are used to select feature genes. Assume is the input data matrix, which contains the collection of data column vectors and dimension space. Traditional PCA approaches recombine the original data which have a certain relevance into a new set of independent indicators [9]. More specifically, this method reduces the input data to -dim subspace by minimizing:where each column of is the principal directions and is the projected data points in the new subspace.

##### 2.2. Graph-Laplacian PCA

Since the traditional PCA has not taken into account the intrinsic geometrical structure within input data, the mutual influences among data may be missed during a research project [9]. With the increasing popularity of the manifold learning theory, people are becoming aware that the intrinsic geometrical structure is essential for modeling input data [15]. It is a well-known fact that graph-Laplacian is the fastest approach in the manifold learning method [14]. The essential idea of graph-Laplacian is to recover low dimensional manifold structure from high dimensional sampled data. PCA closely relates to -means clustering [22]. The principal components are also the continuous solution of the cluster indicators in the -means clustering method. Thus, it provides a motivation to embed Laplacian to PCA whose primary purpose is clustering [23, 24]. Let symmetric weight matrix be the nearest neighbor graph where is the weight of the edge connecting vertices and . The value of is set as follows:where is the set of nearest neighbors of . is supposed as the embedding coordinates of the data and is defined as a diagonal matrix and . can be obtained by minimizing:where is the column or row sums of and is named as Laplacian matrix. Simply put, in the case of maintaining the local adjacency relationship of the graph, the graph can be drawn from the high dimensional space to a low dimensional space (drawing graph). In the view of the function of graph-Laplacian, Jiang et al. proposed a model named graph-Laplacian PCA (gLPCA), which incorporates graph structure encoded in [23]. This model can be considered as follows:where is a parameter adjusting the contribution of the two parts. This model has three aspects. (a) It is a data representation, where . (b) It uses to embed manifold learning. (c) This model is a nonconvex problem but has a closed-form solution and can be efficient to work out.

In (4), from the perspective of data point, it can be rewritten as follows:In this formula, the error of each data point is calculated in the form of the square. It will also cause a lot of errors while the data contains some tiny abnormal values. Thus, the author formulates a robust version using -norm as follows:but the major contribution of -norm is to generate sparse on rows, in which the effect is not so obvious [3, 25].

#### 3. Proposed Algorithm

Research shows that a proper value of can achieve a more exact result for dimensionality reduction [17]. When , the smaller is, the more effective result will be [17]. Then, Xu et al. developed a simple iterative thresholding representation theory for -norm and obtained the desired results [18]. Thus, motivated by former theory, it is reasonable and necessary to introduce -norm on error function to reduce the impact of outliers on the data. Based on the half thresholding theory, we propose a novel method using -norm on error function by minimizing the following problem:where -norm is defined as , is the input data matrix, and and are the principal directions and the subspace of projected data, respectively. We call this model graph-Laplacian PCA based on -norm constraint ( gLPCA).

At first, the subproblems are solved by using the Augmented Lagrange Multipliers (ALM) method. Then, an efficient updating algorithm is presented to solve this optimization problem.

##### 3.1. Solving the Subproblems

ALM is used to solve the subproblem. Firstly, an auxiliary variable is introduced to rewrite the formulation (4) as follows:The augmented Lagrangian function of (8) is defined as follows:where is Lagrangian multipliers and is the step size of update. By mathematical deduction, the function of (9) can be rewritten asThe general approach of (10) consists of the following iterations:Then, the details to update each variable in (11) are given as follows.

*Updating *. At first, we solve while fixing and . The update of relates the following issue:which is the proximal operator of -norm. Since this formulation is a nonconvex, nonsmooth, non-Lipschitz, and complex optimization problem; an iterative half thresholding approach is used for fast solution of -norm and summarizes according to the following lemma [18].

Lemma 1. *The proximal operator of -norm minimizes the following problem:which is given bywhere and is the half threshold operator and defined as follows: where .*

*Solving ** and *. Here, we solve while fixing others. The update of amounts to solvingLetting , (16) becomes , taking partial derivatives of as follows:Setting the partial derivatives to 0, we haveThen, we solve while fixing others. Similarly, letting , , the update of can be listed as follows:By some algebra, we haveTherefore, (19) can be rewritten as follows:Thus, the optimal can be obtained by calculating eigenvectorswhich corresponds to the first smallest eigenvalues of the matrix .

*Updating ** and *. The update of and is standard:where is used to update the parameter . Since the value of is usually bigger than 1, and over a large number of experiments, we find are good choice. We selected in such practice conditions.

The complete procedure is summarized in Algorithm 1.