BioMed Research International

Volume 2017, Article ID 1096028, 8 pages

https://doi.org/10.1155/2017/1096028

## Gene Feature Extraction Based on Nonnegative Dual Graph Regularized Latent Low-Rank Representation

School of Electrical Engineering and Automation, Jiangxi University of Science and Technology, Ganzhou 341000, China

Correspondence should be addressed to Guoliang Yang; moc.621@03gnailgy and Zhengwei Hu; moc.361@3991iewgnehzuh

Received 21 January 2017; Accepted 13 March 2017; Published 30 March 2017

Academic Editor: Gang Liu

Copyright © 2017 Guoliang Yang and Zhengwei Hu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Aiming at the problem of gene expression profile’s high redundancy and heavy noise, a new feature extraction model based on nonnegative dual graph regularized latent low-rank representation (NNDGLLRR) is presented on the basis of latent low-rank representation (Lat-LRR). By introducing dual graph manifold regularized constraint, the NNDGLLRR can keep the internal spatial structure of the original data effectively and improve the final clustering accuracy while segmenting the subspace. The introduction of nonnegative constraints makes the computation with some sparsity, which enhances the robustness of the algorithm. Different from Lat-LRR, a new solution model is adopted to simplify the computational complexity. The experimental results show that the proposed algorithm has good feature extraction performance for the heavy redundancy and noise gene expression profile, which, compared with LRR and Lat-LRR, can achieve better clustering accuracy.

#### 1. Introduction

With the accelerated pace of modern life, the high incidence of cancer has brought great challenges to human health. How to detect, prevent, and treat cancer effectively has become an international hotspot of medical research. Gene expression profile is a specific cDNA sequence data of cells, which can describe cells’ current physiological function and state. Researches show that tumor cells and normal cells could be identified effectively by analyzing and processing the original gene expression data. However, the scale of the gene expression profile is huge and complex due to the diversity and specificity of the cells; therefore the traditional methods of data analysis and processing have been unable to adapt to these extremely large-scale data.

Gene expression profile extracting includes two kinds of methods: linear and nonlinear. Early linear transformation methods include principal component analysis [1–3] (PCA), linear discriminant analysis [4–6] (LDA), and independent component analysis [7, 8] (ICA). The main methods of nonlinear transformation include kernel method [9], neural network [10, 11], manifold learning [12, 13], and sparse representation [14, 15]. In recent years, LRR [16–18] and neural networks have been widely used in feature extraction and classification of gene expression profile. Reference [19] used NMF for gene feature extraction and achieved more satisfactory results. Ref. [20] proposed a gene expression profile classification means based on ontology perception. Ref. [21] proposed a subcellular cooccurrence matrix feature extraction method. Ref. [22] proposed a gene expression profile classification method by neural network hybrid back-propagation. Ref. [23] proposed a supervised way of tumor prediction with multiview.

The size of the gene expression profile is large, and there are interrelationships between the samples. The internal spatial structure of the data may be destroyed in the process of linear transformation. In this paper, a model of feature extraction based on NNDGLLRR is proposed on the basis of Lat-LRR, which with low-rank sparse constraint can remove the redundant components of gene expression and suppress the noise. Nonnegative constraints make the calculation with a certain degree of sparsity, in line with the practical significance of the data, and enhance the robustness of the algorithm. And the manifold regularized constraint is introduced, so that the result of feature extraction can describe the spatial structure of the original data more completely.

#### 2. Related Work

##### 2.1. LRR

LRR is a combination of matrix low-rank decomposition and sparse decomposition. In recent years, it has been widely used in subspace clustering. LRR assumes that the original data comes from different subspaces and performs feature extraction by trying to find the lowest rank representation of the original data. And this low-rank representation coefficient is the reflection of the original data in the spatial distribution of structural information. If the original data , each column represents a sample, and generally the LRR uses the data itself as a dictionary. Then the model can be as shown in

The LRR matrix , and is the linear representation coefficient of the sample under the data dictionary . The original data usually contains a lot of noise, while the sparse constraint can maintain the robustness of the algorithm effectively. Ref. [24] shows the specific solution process of LRR.

Let ; we construct the following Augmented Lagrangian function:

The specific update algorithm is as follows.

Keep , ; update :Keep , , and ; update :Keep , ; update :

##### 2.2. Lat-LRR

LRR has two conditions; one is that the original data contains enough samples, and the other is that contains enough nonpolluting data. However, these two conditions are almost impossible to achieve for gene data. On the one hand, the available number of gene samples for research is small because of the high prices of gene sequencing. On the other hand, due to process, instrument electromagnetic interference, and other factors, noise pollution will be produced inevitably in the process of genetic sequencing. To overcome the limitation of LRR, [25] proposed a method of Lat-LRR which expressed the original observation data as a linear combination of principal feature and latent feature for feature extraction. Considering the characteristics of heavy noise in gene expression profile, we added sparsity constraints to the model to construct the following Lat-LRR function:

The solution of Lat-LRR is given in [26]. Alternating direction method (ADM) is adopted to solve the model (6). Let , ; we constructed the following Augmented Lagrangian function:

Keep and ; update :

Keep , ; update :

Keep , , , , and ; update :

Keep , , , and ; update :

Keep , ; update :

#### 3. Method

##### 3.1. NNDGLLRR

Lat-LRR overcomes the problem of too many constraints of LRR dictionary; however, Lat-LRR has limited ability to recover the subspace, and too many auxiliary variables are involved in the process of algorithm solving that involves a lot of matrix singularity value decomposition (SVD) and matrix inversion, which will affect the performance of the algorithm. Ref. [27] proposed a feature extraction method combining manifold constraint and nonnegative matrix factorization (NMF). In the case of NMF reducing dimensionality, the internal spatial structure of the data is maintained by manifold regularized constraint, and good experimental results are obtained. Ref. [28, 29] proposed an image clustering method combining manifold regularized constraint with Lat-LRR. Similar to the image data, the gene expression profile is also constituted by numerical matrix with high redundancy and heavy noise. Considering this characteristic, we constructed a new NNDGLLRR model on the basis of the original model.where , , and are nonnegative constants; the model is a nonnegative latent low-rank representation (NNLLRR) when *α* and are equal to zero. Model (13) takes a more general form. The dual regularized constraint is used to preserve the internal spatial structure of the original data, and sparse constraints and nonnegative constraints are used to maintain and enhance the robustness of the algorithm. and are Laplacian matrices, , . , and are weight matrix, and there are many ways to solve , and here we use Gaussian thermal weight. The specific solution is as follows:where is a constant; and represent the th column and th column of (th and th sample); and represent the th row and the th row of , .

ADM is used to solve model (12), and the following augmented Lagrange function is constructed:where is a Lagrangian multiplier; is a constant and .

Data in real life is generally nonnegative, and nonnegative constraints will make the calculation with a certain degree of sparseness and enhance the robustness of the algorithm. To maintain the nonnegative of feature extraction, we define the following operators:

The solution of model (15) is divided into three subproblems: first, the solution of variable , second, the solution of variables , and, third, the solution variable of .

*(1) Solving the First Subproblem.* Update :

Regarding Taylor second-order expansion to (17), the approximate solution of is as follows:

Nonnegative constraints to are as follows:

Define ; ; ; . Ref. [30] gives the solution of ; the solution process is as follows:

In (20), is the singular value decomposition (SVD) of , is the vector form of the singular value contraction operator (SVT), and is defined as follows:

* (2) Solving the Second Subproblem.* Similarly, update :

Nonnegative constraints to are as follows:

Define ; .

* (3) Solving the Third Subproblem.* Update :where is a soft threshold operator (ST); is defined as follows:

The iterative process of each variable of NNDGLLRR is given above. The concrete updating process is shown in Algorithm 1.