Gene Feature Extraction Based on Nonnegative Dual Graph Regularized Latent Low-Rank Representation
Aiming at the problem of gene expression profile’s high redundancy and heavy noise, a new feature extraction model based on nonnegative dual graph regularized latent low-rank representation (NNDGLLRR) is presented on the basis of latent low-rank representation (Lat-LRR). By introducing dual graph manifold regularized constraint, the NNDGLLRR can keep the internal spatial structure of the original data effectively and improve the final clustering accuracy while segmenting the subspace. The introduction of nonnegative constraints makes the computation with some sparsity, which enhances the robustness of the algorithm. Different from Lat-LRR, a new solution model is adopted to simplify the computational complexity. The experimental results show that the proposed algorithm has good feature extraction performance for the heavy redundancy and noise gene expression profile, which, compared with LRR and Lat-LRR, can achieve better clustering accuracy.
With the accelerated pace of modern life, the high incidence of cancer has brought great challenges to human health. How to detect, prevent, and treat cancer effectively has become an international hotspot of medical research. Gene expression profile is a specific cDNA sequence data of cells, which can describe cells’ current physiological function and state. Researches show that tumor cells and normal cells could be identified effectively by analyzing and processing the original gene expression data. However, the scale of the gene expression profile is huge and complex due to the diversity and specificity of the cells; therefore the traditional methods of data analysis and processing have been unable to adapt to these extremely large-scale data.
Gene expression profile extracting includes two kinds of methods: linear and nonlinear. Early linear transformation methods include principal component analysis [1–3] (PCA), linear discriminant analysis [4–6] (LDA), and independent component analysis [7, 8] (ICA). The main methods of nonlinear transformation include kernel method , neural network [10, 11], manifold learning [12, 13], and sparse representation [14, 15]. In recent years, LRR [16–18] and neural networks have been widely used in feature extraction and classification of gene expression profile. Reference  used NMF for gene feature extraction and achieved more satisfactory results. Ref.  proposed a gene expression profile classification means based on ontology perception. Ref.  proposed a subcellular cooccurrence matrix feature extraction method. Ref.  proposed a gene expression profile classification method by neural network hybrid back-propagation. Ref.  proposed a supervised way of tumor prediction with multiview.
The size of the gene expression profile is large, and there are interrelationships between the samples. The internal spatial structure of the data may be destroyed in the process of linear transformation. In this paper, a model of feature extraction based on NNDGLLRR is proposed on the basis of Lat-LRR, which with low-rank sparse constraint can remove the redundant components of gene expression and suppress the noise. Nonnegative constraints make the calculation with a certain degree of sparsity, in line with the practical significance of the data, and enhance the robustness of the algorithm. And the manifold regularized constraint is introduced, so that the result of feature extraction can describe the spatial structure of the original data more completely.
2. Related Work
LRR is a combination of matrix low-rank decomposition and sparse decomposition. In recent years, it has been widely used in subspace clustering. LRR assumes that the original data comes from different subspaces and performs feature extraction by trying to find the lowest rank representation of the original data. And this low-rank representation coefficient is the reflection of the original data in the spatial distribution of structural information. If the original data , each column represents a sample, and generally the LRR uses the data itself as a dictionary. Then the model can be as shown in
The LRR matrix , and is the linear representation coefficient of the sample under the data dictionary . The original data usually contains a lot of noise, while the sparse constraint can maintain the robustness of the algorithm effectively. Ref.  shows the specific solution process of LRR.
Let ; we construct the following Augmented Lagrangian function:
The specific update algorithm is as follows.
Keep , ; update :Keep , , and ; update :Keep , ; update :
LRR has two conditions; one is that the original data contains enough samples, and the other is that contains enough nonpolluting data. However, these two conditions are almost impossible to achieve for gene data. On the one hand, the available number of gene samples for research is small because of the high prices of gene sequencing. On the other hand, due to process, instrument electromagnetic interference, and other factors, noise pollution will be produced inevitably in the process of genetic sequencing. To overcome the limitation of LRR,  proposed a method of Lat-LRR which expressed the original observation data as a linear combination of principal feature and latent feature for feature extraction. Considering the characteristics of heavy noise in gene expression profile, we added sparsity constraints to the model to construct the following Lat-LRR function:
Keep and ; update :
Keep , ; update :
Keep , , , , and ; update :
Keep , , , and ; update :
Keep , ; update :
Lat-LRR overcomes the problem of too many constraints of LRR dictionary; however, Lat-LRR has limited ability to recover the subspace, and too many auxiliary variables are involved in the process of algorithm solving that involves a lot of matrix singularity value decomposition (SVD) and matrix inversion, which will affect the performance of the algorithm. Ref.  proposed a feature extraction method combining manifold constraint and nonnegative matrix factorization (NMF). In the case of NMF reducing dimensionality, the internal spatial structure of the data is maintained by manifold regularized constraint, and good experimental results are obtained. Ref. [28, 29] proposed an image clustering method combining manifold regularized constraint with Lat-LRR. Similar to the image data, the gene expression profile is also constituted by numerical matrix with high redundancy and heavy noise. Considering this characteristic, we constructed a new NNDGLLRR model on the basis of the original model.where , , and are nonnegative constants; the model is a nonnegative latent low-rank representation (NNLLRR) when α and are equal to zero. Model (13) takes a more general form. The dual regularized constraint is used to preserve the internal spatial structure of the original data, and sparse constraints and nonnegative constraints are used to maintain and enhance the robustness of the algorithm. and are Laplacian matrices, , . , and are weight matrix, and there are many ways to solve , and here we use Gaussian thermal weight. The specific solution is as follows:where is a constant; and represent the th column and th column of (th and th sample); and represent the th row and the th row of , .
ADM is used to solve model (12), and the following augmented Lagrange function is constructed:where is a Lagrangian multiplier; is a constant and .
Data in real life is generally nonnegative, and nonnegative constraints will make the calculation with a certain degree of sparseness and enhance the robustness of the algorithm. To maintain the nonnegative of feature extraction, we define the following operators:
The solution of model (15) is divided into three subproblems: first, the solution of variable , second, the solution of variables , and, third, the solution variable of .
(1) Solving the First Subproblem. Update :
Regarding Taylor second-order expansion to (17), the approximate solution of is as follows:
Nonnegative constraints to are as follows:
Define ; ; ; . Ref.  gives the solution of ; the solution process is as follows:
In (20), is the singular value decomposition (SVD) of , is the vector form of the singular value contraction operator (SVT), and is defined as follows:
(2) Solving the Second Subproblem. Similarly, update :
Nonnegative constraints to are as follows:
Define ; .
(3) Solving the Third Subproblem. Update :where is a soft threshold operator (ST); is defined as follows:
The iterative process of each variable of NNDGLLRR is given above. The concrete updating process is shown in Algorithm 1.
3.2. Sparse Representation Classifier (SRC)
Sparse representation is a hotspot in the field of pattern recognition in recent years. SRC has been successfully applied in the field of image classification and has achieved relatively ideal experimental results . Similar to the image data, the gene expression profile is also composed by a series of high redundancy and heavy noise of gene samples. In this paper, the latent features extracted by NNDGLLRR are regarded as data dictionary to construct the following SRC model:
According to the result of SRC, we can get the classification result of unknown gene sample :
The detailed flow of the SRC is shown in Algorithm 2.
3.3. Algorithm Flow
To sum up, the algorithm can be divided into two parts; one is to use NNDGLLRR to extract latent features of the original gene expression profile, and the other is to use SRC to classify the latent features. The overall flow is as shown in Algorithm 3.
4. Results and Discussion
4.1. Selecting the Test Data
To test the feature extraction performance of the algorithm, we used diffuse large B-cell lymphoma  (DLBCL), mixed lineage leukemia  (MLL), lung cancer  (LC), acute lymphoblastic leukemia  (ALL) gene sequences to make test, and the sample information of each group of genes as is shown in Table 1.
4.2. Accuracy Test
-means and sparse representation classifier (SRC) are simple and common classifiers. To compare the clustering results of -means and SRC, the two kinds of classifiers are used to classify the original gene expression profile. Clustering results are shown in Table 2. It is not difficult to find that the classification effect of SRC is significantly higher than that of -means, which is due to the small number of gene expression profiles. To verify the effectiveness of the algorithm for feature extraction, the extracted features from LRR, Lat-LRR, and NNDGLLRR are classified by SRC. Classification results as shown in Table 2.
Table 2 shows that any one of LRR, Lat-LRR, and NNDGLLRR can achieve feature extraction effectively. However, the feature extraction effect of NNDGLLRR is better than that of Lat-LRR. The category and number of samples, as well as dimension of the gene expression profile, will have an impact on the final recognition effect.
4.3. The Influence of Graph Regularized Coefficients
Generally, we set . To verify the influence of graph regularized coefficients on feature extraction, we have compared the recognition results of LRR, Lat-LRR, and NNDGLLRR under the condition of different values. The results are shown in Figure 1.
(a) Dataset of MLL
(b) Dataset of LC
Through the test results of MLL and LC, we can find that manifold regularized constraint has obvious optimization effect on the gene expression profile feature extraction when the values of and are appropriate, and it can significantly improve the recognition effect of feature extraction. However, and should not be too large or too small. The optimal graph regularized coefficients may be different for different test data sets.
4.4. The Influence of Sparse Representation Coefficients
During the process of gene sequencing, the resulting gene expression profile will usually contain heavy noise due to the sequencing process. To verify the effect of the sparse constraint on the feature extraction, we tested the classification accuracy of LRR, Lat-LRR, and NNDGLLRR for feature extraction under different sparse constraint coefficients . The test results are shown in Figure 2.
(a) Dataset of NHL
(b) Dataset of AL
Figure 2 shows that different sparse constraint coefficients have a considerable effect on the final feature extraction results. When the value of is appropriate, the performance of Lat-LRR and NNDGLLRR on feature extraction is better than that of LRR. In general, the performance of NNDGLLRR is better than that of Lat-LRR, which proves the validity of manifold constraint again.
4.5. Complexity Analysis
, , and , and we set the lowest ranks of and obtained by the algorithm as and . Then the complexity of SVT operation for and is about and , and the complexity of ST operation for is about . The complexity of construction the Laplacian matrix of and is about and ; and the complexity of one positive operation for and is about and . If the iteration of the algorithm is , then the overall complexity of LRR, Lat-LRR, and NNDGLLRR algorithms is shown in Table 3.
Generally, it is considered that for gene expression profile. It can be seen from Table 3 that LRR is the simplest in terms of computational complexity, but the performance of LRR on feature extraction is less effective than that of Lat-LRR and NNDGLLRR, and it is difficult to meet the actual demand. The result of Lat-LRR on feature extraction can be not bad, but the partitioning ability of the subspace is limited, and the operation speed is slow because of too many introduced variables. The variable update algorithm of NNDGLLRR not only reduces the calculated amount, but also achieves satisfactory results on feature extraction.
Aiming at the characteristics of high redundancy and heavy noise of gene expression profile, a feature extraction model of NNDGLLRR is proposed in this paper. In the process of experiment, we extracted the features of different gene expression profile by LRR, Lat-LRR, and NNDGLLRR and classified the extracted features by SRC. The experimental results show that the performance of NNDGLLRR on feature extraction is better than that of LRR and better than Lat-LRR slightly, which verified the comparative advantages of NNDGLRR. At the same time, compared with Lat-LRR, the overall complexity of NNDGLLRR is reduced through the improvement of the variable update algorithm. The experiments using different gene expression data sets for testing have made comparatively ideal experimental results, which proves the validity of the dual graph regularized constraint. In summary, the proposed nonnegative low-rank sparse constraint and dual graph regularized constraint are reasonable, and NNDGLLRR has good adaptability to different gene expression profile with high redundancy and heavy noise.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This paper is supported by the National Nature Science Foundation of China (nos. 51365017 and 61305019) and the Science and Technology Project of Jiangxi Province Education Department (no. GJJ150680).
D. Lutter, K. Stadlthanner, F. Theis et al., “Analyzing gene expression profiles with ICA,” in Proceedings of the 24th IASTED International Conference on Biomedical Engineering (BioMed '06), pp. 25–30, ACTA Press, Innsbruck, Austria, 2006.View at: Google Scholar
I. Hiroyuki, S. Hiroki, A. Kazuhiko et al., “Classification of gastric cancer subtypes by applying ICA to gene expression data and pathway analysis using Bayesian network,” IPSJ SIG Technical Reports, vol. 2012, no. 12, pp. 1–2, 2012.View at: Google Scholar
G. Ye, M. Tang, J. F. Cai et al., “Correction: low-rank regularization for learning gene expression programs,” PLoS ONE, vol. 9, no. 1, Article ID e82146, 2014.View at: Google Scholar
M. Vimaladevi and B. Kalaavathi, “A microarray gene expression data classification using hybrid back propagation neural network,” Genetika, vol. 46, no. 3, pp. 1013–1026, 2014.View at: Google Scholar
G. Lee, A. Singanamalli, H. Wang et al., “Supervised Multi-view Canonical Correlation Analysis (sMVCCA): integrating histologic and proteomic features for predicting recurrent prostate cancer,” IEEE Transactions on Medical Imaging, vol. 34, no. 1, pp. 284–297, 2015.View at: Publisher Site | Google Scholar
Z. Zhang, G. Ely, S. Aeron et al., “Novel methods for multilinear data completion and de-noising based on tensor-SVD,” Computer Science, vol. 44, no. 9, pp. 3842–3849, 2014.View at: Google Scholar
A. Bhattacharjee, W. G. Richards, J. Staunton et al., “Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 24, pp. 13790–13795, 2001.View at: Publisher Site | Google Scholar