Computational and Mathematical Methods in Medicine

Volume 2018 (2018), Article ID 5490513, 11 pages

https://doi.org/10.1155/2018/5490513

## Feature Genes Selection Using Supervised Locally Linear Embedding and Correlation Coefficient for Microarray Classification

^{1}College of Computer and Information Engineering, Henan Normal University, Xinxiang 453007, China^{2}Engineering Technology Research Center for Computing Intelligence and Data Mining, Henan Province 453007, China

Correspondence should be addressed to Jiucheng Xu

Received 27 September 2017; Revised 17 December 2017; Accepted 21 December 2017; Published 31 January 2018

Academic Editor: Xiaoqi Zheng

Copyright © 2018 Jiucheng Xu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The selection of feature genes with high recognition ability from the gene expression profiles has gained great significance in biology. However, most of the existing methods have a high time complexity and poor classification performance. Motivated by this, an effective feature selection method, called supervised locally linear embedding and Spearman’s rank correlation coefficient (SLLE-SC^{2}), is proposed which is based on the concept of locally linear embedding and correlation coefficient algorithms. Supervised locally linear embedding takes into account class label information and improves the classification performance. Furthermore, Spearman’s rank correlation coefficient is used to remove the coexpression genes. The experiment results obtained on four public tumor microarray datasets illustrate that our method is valid and feasible.

#### 1. Introduction

Cancer develops through either a series of genetic events or external influential factors that cause differential gene expression profile in the cancerous cells. The DNA microarray technology is pervasively used in the area of genomic research for diagnosing cancers [1]. Since the number of genes is typically larger than the number of samples, classification of microarray data is subjected to “the curse of dimensionality.” However, only a small number of genes are required in cancer diagnosis whereas the search space can be huge. Feature selection is an important step to reduce both dimension and redundancy (there is some obvious inaccuracy of gene expression in the experiment to obtain the gene expression data) of gene expression data during the classification process. According to the literature [2], the selection of feature genes methods is usually more important than developing classifier in the genomic data analysis. Therefore, how to choose the feature genes in gene expression profile effectively is the key point of bioinformatics study at present.

When mining in high-dimensional data, “the curse of dimensionality” is one of the major difficulties to overcome. The aim of feature selection is to reduce computational complexity while some desired inherent information of the data is conserved [3, 4]. Manifold learning is an ideal tool for machine learning that discovers the structure of high-dimensional data and gives better understanding of the data [5]. The representative of such methods comprises locally linear embedding (LLE), isometric mapping (Isomap), Laplacian eigenmaps (LE), and local tangent space alignment (LTSA) [6], and so on. In between, LLE is one of the most noted manifold learning methods and widely used in spectral analysis [7], edit propagation [8], fault detection [9, 10], image recognition [11, 12], and so on.

Subsequently, various improved LLE methods are designed to enhance the performance. Lai et al. [31] proposed a unified sparse learning framework by introducing the sparsity or L1-norm learning, which further extended the LLE-based methods to sparse cases. Theoretical connections between the orthogonal neighborhood preserving projection and the proposed sparse linear embedding are discovered. The ideal sparse embedding derived from the proposed framework is computed by iterating the modified elastic net and singular value decomposition. Cheng et al. [32] depended on the incremental locally linear embedding (ILLE) to improve the performance of fault-diagnosis for a satellite with high-dimensional telemetry data. Similarity, Liu et al. [33] put forward an incremental supervised LLE (I-SLLE) method for submersible plunger pump fault detection. In the I-SLLE algorithm, block matrix decomposition strategy is used to deal with out-of-sample data, while a part of original low-dimensional coordinates is also renovated, above which an iterative method is proposed to update all the dataset for improving the accuracy.

LLE has the advantage of global optimal solution of parsing without iteration. The low-dimensional embedding of calculation is summarized as sparse matrix eigenvalue calculation. So the complexity of calculation is relatively small. However, LLE mainly has the disadvantage of low self-learning ability and ignores the discriminant information. It is difficult to accurately capture the patterns on data and this could not gain higher effectiveness. Furthermore, the purpose of feature selection is to project the original data into a subspace with the following characteristics: the samples in the intraclass as close as possible and the samples in interclass far away from each other in the subspace. As mentioned before, feature genes selection distinguishes the pathogenic genes from normal genes. To solve this problem, de Ridder et al. extended the concept of LLE to multiple manifolds and proposed a supervised locally linear embedding (SLLE) algorithm which has been demonstrated to be a suitable feature for genes selection [34]. The dissimilarity between samples from different classes can be measured by metric function. It is commonly believed that the neighborhood of a sample in one class should consist of samples belonging to the same class. In the SLLE method, by taking into account class label information, the distance of interclass is larger than the Euclidean distance by adding a parameter to the pairs of points belonging to different classes. Otherwise, it remains as the Euclidean distance.

Feature selection reduces the dimension of feature and ensures the integrity of original dataset. It can improve the efficiency of data mining and dig out the results which are basically identical to the original dataset. More broadly, it is the problem of “the curse of the dimension.” However, the major consideration of SLLE is the relationship between the attributes and categories. The way to judge if an attribute is redundant is based on whether the attribute affects information discrimination of the class label. That is to say, SLLE remains not fully considered by the relationship between the attributes. In practice, it is not independent between the attributes, and there is a certain correlation between them. For instance, the dressing index and temperature are usually related: a high temperature means a low clothing index; otherwise the opposite occurs. It is inevitable that data redundancy will be caused by placing a large number of associated attributes in the reduction result. Correlation coefficient reflects the coexpression relationship between genes. The two genes are considered as coexpression when their correlation coefficient value is greater than a certain threshold; thus it can be removed [35, 36].

In order to solve the problem of poor classification performance in tumor classification, a novel feature genes selection method, called supervised locally linear embedding and Spearman’s rank correlation coefficient (SLLE-SC^{2}), is put forward in this paper. Supervised LLE algorithm, by taking into account class label information, is utilized to delete redundant genes. Meanwhile, Spearman’s rank correlation coefficient is used to remove the coexpression genes. We also show biological investigation of the selected genes. Finally, we compared the performance of various classifiers based on the selected feature genes datasets. Results show that the SLLE-SC^{2} method selects a small set of nonredundant disease related genes with high specificity and achieves better efficiently compared with other related methods.

#### 2. Research Methodology

##### 2.1. Locally Linear Embedding

LLE approximates the input data with a low-dimensional surface and reduces its dimensionality by learning a mapping to the surface [37]. It first finds a group of the nearest neighbors of each data point. Then it calculates a set of weights for each data point that wonderfully describe the point as a linear combination of its neighbors. Finally, it finds the low-dimensional embedding of points by using an eigenvector-based optimization technique; thus each point is also described with the same linear combination of its neighbors. LLE is designed to establish such a feature mapping: low-dimensional embedding maintains the same local neighborhood relationship in high-dimensional space. It gets the corresponding low-dimensional embedding from the nearest neighbor graph of geometric properties in high-dimensional space under certain conditions. In fact, LLE considers the point of nearest neighbors, rather than distant points.

*(a) Assigning Neighbors to Each Data Point*. To find a group of nearest neighbors, LLE adopts nearest neighbors (i.e., Euclidean distance) standard. Let be a given dataset of points, ; Euclidean distance is adopted to calculate the distance between samples and find refactoring neighborhood of the nearest neighbors for each data point.

*(b) Computing the Weights Best Linearly Reconstructed from Its Neighbors*. LLE computes the barycentric coordinates of a point based on its neighbors . The original point is reconstructed by a linear combination and given by the weight matrix of its neighbors. Reconstruction errors are measured by the cost functionwhere is reconstruction error; is a local graham matrix.where is a positive definite symmetric matrix. Equation (1) is a constrained least squares problem, and it is minimized under two constraints:in which, (3) is a constraint of coefficient. That is to say, each data point is reconstructed only from its neighbors. Equation (4) means the sum of every row of weight matrix equals 1. Thus (1) is rewritten as constrained optimization form:Equation (5) is calculated by Lagrange multiplier approach. As is positive definite symmetric matrix, the inverse of the matrix exists. The optimal weight is calculated by

*(c) Computing the Low-Dimensional Embedding Vector Best Reconstructed and Finding the Smallest Eigenmodes of the Sparse Symmetric Matric*. Each point in the high-dimensional space is mapped onto a point in the low-dimensional space. The low-dimensional space is calculated by the following function:

Cost function (7) is based on locally linear reconstruction errors, in which ( is inner product; is a sparse matrix ( being the number of data points). where is a positive definite symmetric matrix. Equation (7) is a minimization problem. Significantly, we can translate to any position without affecting the reconstruction error. Thus a constraint is added to eliminate the translational degree of freedom in (7). It requires all the center of low-dimensional embedding at the origin. Namely,

In order to eliminate the rotational and proportion degree of freedom, we add a constraint of unit covariance:then (7) is regarded as a constrained optimization problem.

Equation (11) can be solved in multiple ways. One of the most effective methods is calculating cost matrix relatively minimum eigenvalue with its eigenvector which is optimized by using Lagrange multipliers. Notice that eigenvalue with its eigenvector is a fully 1 vector; it represents translation degrees of freedom corresponding to the 0 eigenvalue and requires removing. The retained eigenvectors formed the output of LLE.

##### 2.2. Supervised Locally Linear Embedding

LLE is an unsupervised manifold feature selection algorithm, which ignores the discriminant information of data. In order to improve the classification capability of LLE, discriminant information is assembled in the cost function of LLE (i.e., SLLE). SLLE is based on assumptions of the distance of data point from the same class less than the data point from the different classes and adds the discriminant information to the interclass distance. One of the solutions is to increase the Euclidean distance by adding a constant to the pairs of points from different classes, and the distance of data points from the same class is kept.

In a given set , the distance metric is defined aswhere is the Euclidean distance between and . is a tunable parameter. is the maximum of Euclidean distance set . is equal to 0 or 1 which is used to indicate whether the points belong to the same class; if and belong to the same class, ; otherwise, .

It is worth noting that when , the SLLE is turned into the original unsupervised LLE; when , it is the supervised LLE; otherwise, it is a semisupervised LLE.

##### 2.3. Spearman’s Rank Correlation Coefficient

The relationship between attributes and categories relates to the feature reduction effectiveness and classification accuracy. Similarity, this connection is similar for attributes. In general, the connection between attributes is measured by correlation coefficient. The conventional measures of correlation coefficient are bivariate normal distribution, chi-square test for independence and rank correlation coefficient, and so on. Among them, Spearman’s rank correlation coefficient is a nonparametric measure of rank correlation (statistical dependence between the ranking of two variables). It assesses how well is the relationship between two variables which is described with the monotonic function.

In a given dataset sample , attribute . The sequence in sample , relatively, attribute with its attribute value is . Then the sequence is sorted in descending order with rank for each sample (i.e., sample of the smallest attribute value with rank of 1, sample of the largest attribute value with rank of ; the rank takes an average with the attribute with the same value). Next, according to original sample order, we reorder the new rank sequence .

For the attributes , of sample , its rank sequence is and , respectively. So we obtain pairs rank combination . Spearman’s rank correlation coefficient of attributes , is defined aswhere , . Correlation coefficient meets the following properties:

.

always gives an answer between 0 and 1. The numbers in between are like a scale, where 1 indicates a very strong link and 0 indicates no link.

For more detailed instructions, we use an example to work out in Table 1. Sample ; attribute .