Abstract

The selection of feature genes with high recognition ability from the gene expression profiles has gained great significance in biology. However, most of the existing methods have a high time complexity and poor classification performance. Motivated by this, an effective feature selection method, called supervised locally linear embedding and Spearman’s rank correlation coefficient (SLLE-SC2), is proposed which is based on the concept of locally linear embedding and correlation coefficient algorithms. Supervised locally linear embedding takes into account class label information and improves the classification performance. Furthermore, Spearman’s rank correlation coefficient is used to remove the coexpression genes. The experiment results obtained on four public tumor microarray datasets illustrate that our method is valid and feasible.

1. Introduction

Cancer develops through either a series of genetic events or external influential factors that cause differential gene expression profile in the cancerous cells. The DNA microarray technology is pervasively used in the area of genomic research for diagnosing cancers [1]. Since the number of genes is typically larger than the number of samples, classification of microarray data is subjected to “the curse of dimensionality.” However, only a small number of genes are required in cancer diagnosis whereas the search space can be huge. Feature selection is an important step to reduce both dimension and redundancy (there is some obvious inaccuracy of gene expression in the experiment to obtain the gene expression data) of gene expression data during the classification process. According to the literature [2], the selection of feature genes methods is usually more important than developing classifier in the genomic data analysis. Therefore, how to choose the feature genes in gene expression profile effectively is the key point of bioinformatics study at present.

When mining in high-dimensional data, “the curse of dimensionality” is one of the major difficulties to overcome. The aim of feature selection is to reduce computational complexity while some desired inherent information of the data is conserved [3, 4]. Manifold learning is an ideal tool for machine learning that discovers the structure of high-dimensional data and gives better understanding of the data [5]. The representative of such methods comprises locally linear embedding (LLE), isometric mapping (Isomap), Laplacian eigenmaps (LE), and local tangent space alignment (LTSA) [6], and so on. In between, LLE is one of the most noted manifold learning methods and widely used in spectral analysis [7], edit propagation [8], fault detection [9, 10], image recognition [11, 12], and so on.

Subsequently, various improved LLE methods are designed to enhance the performance. Lai et al. [31] proposed a unified sparse learning framework by introducing the sparsity or L1-norm learning, which further extended the LLE-based methods to sparse cases. Theoretical connections between the orthogonal neighborhood preserving projection and the proposed sparse linear embedding are discovered. The ideal sparse embedding derived from the proposed framework is computed by iterating the modified elastic net and singular value decomposition. Cheng et al. [32] depended on the incremental locally linear embedding (ILLE) to improve the performance of fault-diagnosis for a satellite with high-dimensional telemetry data. Similarity, Liu et al. [33] put forward an incremental supervised LLE (I-SLLE) method for submersible plunger pump fault detection. In the I-SLLE algorithm, block matrix decomposition strategy is used to deal with out-of-sample data, while a part of original low-dimensional coordinates is also renovated, above which an iterative method is proposed to update all the dataset for improving the accuracy.

LLE has the advantage of global optimal solution of parsing without iteration. The low-dimensional embedding of calculation is summarized as sparse matrix eigenvalue calculation. So the complexity of calculation is relatively small. However, LLE mainly has the disadvantage of low self-learning ability and ignores the discriminant information. It is difficult to accurately capture the patterns on data and this could not gain higher effectiveness. Furthermore, the purpose of feature selection is to project the original data into a subspace with the following characteristics: the samples in the intraclass as close as possible and the samples in interclass far away from each other in the subspace. As mentioned before, feature genes selection distinguishes the pathogenic genes from normal genes. To solve this problem, de Ridder et al. extended the concept of LLE to multiple manifolds and proposed a supervised locally linear embedding (SLLE) algorithm which has been demonstrated to be a suitable feature for genes selection [34]. The dissimilarity between samples from different classes can be measured by metric function. It is commonly believed that the neighborhood of a sample in one class should consist of samples belonging to the same class. In the SLLE method, by taking into account class label information, the distance of interclass is larger than the Euclidean distance by adding a parameter to the pairs of points belonging to different classes. Otherwise, it remains as the Euclidean distance.

Feature selection reduces the dimension of feature and ensures the integrity of original dataset. It can improve the efficiency of data mining and dig out the results which are basically identical to the original dataset. More broadly, it is the problem of “the curse of the dimension.” However, the major consideration of SLLE is the relationship between the attributes and categories. The way to judge if an attribute is redundant is based on whether the attribute affects information discrimination of the class label. That is to say, SLLE remains not fully considered by the relationship between the attributes. In practice, it is not independent between the attributes, and there is a certain correlation between them. For instance, the dressing index and temperature are usually related: a high temperature means a low clothing index; otherwise the opposite occurs. It is inevitable that data redundancy will be caused by placing a large number of associated attributes in the reduction result. Correlation coefficient reflects the coexpression relationship between genes. The two genes are considered as coexpression when their correlation coefficient value is greater than a certain threshold; thus it can be removed [35, 36].

In order to solve the problem of poor classification performance in tumor classification, a novel feature genes selection method, called supervised locally linear embedding and Spearman’s rank correlation coefficient (SLLE-SC2), is put forward in this paper. Supervised LLE algorithm, by taking into account class label information, is utilized to delete redundant genes. Meanwhile, Spearman’s rank correlation coefficient is used to remove the coexpression genes. We also show biological investigation of the selected genes. Finally, we compared the performance of various classifiers based on the selected feature genes datasets. Results show that the SLLE-SC2 method selects a small set of nonredundant disease related genes with high specificity and achieves better efficiently compared with other related methods.

2. Research Methodology

2.1. Locally Linear Embedding

LLE approximates the input data with a low-dimensional surface and reduces its dimensionality by learning a mapping to the surface [37]. It first finds a group of the nearest neighbors of each data point. Then it calculates a set of weights for each data point that wonderfully describe the point as a linear combination of its neighbors. Finally, it finds the low-dimensional embedding of points by using an eigenvector-based optimization technique; thus each point is also described with the same linear combination of its neighbors. LLE is designed to establish such a feature mapping: low-dimensional embedding maintains the same local neighborhood relationship in high-dimensional space. It gets the corresponding low-dimensional embedding from the nearest neighbor graph of geometric properties in high-dimensional space under certain conditions. In fact, LLE considers the point of nearest neighbors, rather than distant points.

(a) Assigning Neighbors to Each Data Point. To find a group of nearest neighbors, LLE adopts nearest neighbors (i.e., Euclidean distance) standard. Let be a given dataset of points, ; Euclidean distance is adopted to calculate the distance between samples and find refactoring neighborhood of the nearest neighbors for each data point.

(b) Computing the Weights Best Linearly Reconstructed from Its Neighbors. LLE computes the barycentric coordinates of a point based on its neighbors . The original point is reconstructed by a linear combination and given by the weight matrix of its neighbors. Reconstruction errors are measured by the cost functionwhere is reconstruction error; is a local graham matrix.where is a positive definite symmetric matrix. Equation (1) is a constrained least squares problem, and it is minimized under two constraints:in which, (3) is a constraint of coefficient. That is to say, each data point is reconstructed only from its neighbors. Equation (4) means the sum of every row of weight matrix equals 1. Thus (1) is rewritten as constrained optimization form:Equation (5) is calculated by Lagrange multiplier approach. As is positive definite symmetric matrix, the inverse of the matrix exists. The optimal weight is calculated by

(c) Computing the Low-Dimensional Embedding Vector Best Reconstructed and Finding the Smallest Eigenmodes of the Sparse Symmetric Matric. Each point in the high-dimensional space is mapped onto a point in the low-dimensional space. The low-dimensional space is calculated by the following function:

Cost function (7) is based on locally linear reconstruction errors, in which ( is inner product; is a sparse matrix ( being the number of data points). where is a positive definite symmetric matrix. Equation (7) is a minimization problem. Significantly, we can translate to any position without affecting the reconstruction error. Thus a constraint is added to eliminate the translational degree of freedom in (7). It requires all the center of low-dimensional embedding at the origin. Namely,

In order to eliminate the rotational and proportion degree of freedom, we add a constraint of unit covariance:then (7) is regarded as a constrained optimization problem.

Equation (11) can be solved in multiple ways. One of the most effective methods is calculating cost matrix relatively minimum eigenvalue with its eigenvector which is optimized by using Lagrange multipliers. Notice that eigenvalue with its eigenvector is a fully 1 vector; it represents translation degrees of freedom corresponding to the 0 eigenvalue and requires removing. The retained eigenvectors formed the output of LLE.

2.2. Supervised Locally Linear Embedding

LLE is an unsupervised manifold feature selection algorithm, which ignores the discriminant information of data. In order to improve the classification capability of LLE, discriminant information is assembled in the cost function of LLE (i.e., SLLE). SLLE is based on assumptions of the distance of data point from the same class less than the data point from the different classes and adds the discriminant information to the interclass distance. One of the solutions is to increase the Euclidean distance by adding a constant to the pairs of points from different classes, and the distance of data points from the same class is kept.

In a given set , the distance metric is defined aswhere is the Euclidean distance between and . is a tunable parameter. is the maximum of Euclidean distance set . is equal to 0 or 1 which is used to indicate whether the points belong to the same class; if and belong to the same class, ; otherwise, .

It is worth noting that when , the SLLE is turned into the original unsupervised LLE; when , it is the supervised LLE; otherwise, it is a semisupervised LLE.

2.3. Spearman’s Rank Correlation Coefficient

The relationship between attributes and categories relates to the feature reduction effectiveness and classification accuracy. Similarity, this connection is similar for attributes. In general, the connection between attributes is measured by correlation coefficient. The conventional measures of correlation coefficient are bivariate normal distribution, chi-square test for independence and rank correlation coefficient, and so on. Among them, Spearman’s rank correlation coefficient is a nonparametric measure of rank correlation (statistical dependence between the ranking of two variables). It assesses how well is the relationship between two variables which is described with the monotonic function.

In a given dataset sample , attribute . The sequence in sample , relatively, attribute with its attribute value is . Then the sequence is sorted in descending order with rank for each sample (i.e., sample of the smallest attribute value with rank of 1, sample of the largest attribute value with rank of ; the rank takes an average with the attribute with the same value). Next, according to original sample order, we reorder the new rank sequence .

For the attributes , of sample , its rank sequence is and , respectively. So we obtain pairs rank combination . Spearman’s rank correlation coefficient of attributes , is defined aswhere , . Correlation coefficient meets the following properties:

  .

always gives an answer between 0 and 1. The numbers in between are like a scale, where 1 indicates a very strong link and 0 indicates no link.

For more detailed instructions, we use an example to work out in Table 1. Sample ; attribute .

Obtain the sequence in sample ; relatively attribute with its attribute value is .

The sequence is sorted in descending order with rank for each sample. Thus we obtain an ordered sequence of attribute and rank sequence .

According to original sample order, we reorder the new rank sequence .

In the same way, the rank sequence in sample relative attribute with its attribute value is .

The rank sequences and in sample relatively attributed to its attribute value are shown in Table 2.

Finally, according to (13), Spearman’s rank correlation coefficient is 0.9 for this set of data.

2.4. Feature Genes Selection Using Supervised Locally Linear Embedding and Correlation Coefficient

Microarray data often contain redundant and noise features. These features could lead to poor classification performance and overfitting problems. Meanwhile, the gene expression data are in high-dimension and the number of feature gene datasets is very small which leads to the calculation falling into local optima and being computationally expensive. The key technique is to find a new feature genes selection method which can provide understanding and insight into tumor related cellular processes.

SLLE (by taking into account class label information) finds an ideal low-dimensional manifold of mapping for separating the intraclass and interclass. However, the main consideration of supervised algorithm is the relationship between the attributes and categories. That is to say, supervised learning algorithm is not fully considering the relationship between the attributes. In practice, the relationship between the attributes affects the reduction results and classification accuracy. It is inevitable that data redundancy will be caused by placing a large number of associated attributes in the reduction result. In general, the connection between attributes can be measured by correlation coefficient. Correlation coefficient reflects the coexpression relationship between genes. The two genes are considered as coexpression when their value of correlation coefficient is greater than a certain threshold; thus they are removed in feature genes selection. Spearman’s rank correlation coefficient is a nonparametric measure of rank correlation (statistical dependence between the ranking of two variables).

Therefore we propose an effective SLLE-SC2 method for the selection of feature genes. Firstly, SLLE is used for reduction, mapping into the original data in a new feature space. Then considering the relationship between the attributes in the new feature space, Spearman’s rank correlation coefficient is used for feature selection. Specifically, the PCA is used to compute the contribution of attributes, respectively, in the new feature space. Spearman’s rank correlation coefficient is used to compute the maximum contribution of attribute and other attributes, respectively. If the value of correlation coefficient between attributes is greater than or equal to a preset threshold, the attribute is removed. Then loop is over the other attributes. SLLE method description is shown in Algorithm 1. Spearman’s rank correlation coefficient method description is shown in Algorithm 2. Feature genes selection using SLLE-SC2 method description is shown in Algorithm 3.

Input:
Output: Reduction set
Step  1. For each data in high-dimensional space, find the nearest points in terms of the Euclidean distance;
Step  2. Calculate the local reconstructed weight matrix for each sample point. The current sample point is expressed by the nearest
neighboring points and gets the weight matrix, the error function is defined as: ;
Step  3. According to the weight for the sample point and neighboring point in the high-dimensional space. Then the
embedding space in low-dimension is calculated. The weight is fixed to a constrained optimization problem;
Step  4. By minimizing the loss functions to get the corresponding weight matrix and reconstructed coordinates. The retained
eigenvectors are formed the output of LLE algorithm;
Step  5. Return reduction set.
Input:
Output: Correlation coefficient
Step  1. Obtain the sequence , in sample relatively attribute , with its attribute value;
Step  2. The sequence, is sorted in descending with rank for each sample. The rank takes an average when the attributes with
the same value;
Step  3. The new rank sequence are obtained according to original sample order for , ;
Step  4. for to do;
   calculate the average rank sequence for rank sequence ;
Step  5. for to do;
   calculate ;
Step  6. Calculate ;
Step  7. Return correlation coefficient .
Input: Data set
Output: Feature genes set
Step  1. = null; flag set flag = null; // the initial state is empty;
Step  2. SLLE // using Algorithm 1 for feature genes selection;
Step  3. for do;
   calculate the contribution of attributes respectively by PCA, where attribute ;
    if , output attribute ;
   end for
Step  4. for ; do
calculate correlation coefficient for attribute and by Algorithm 2, where ;
      if then
     flag = ;
     go to Step  3;
   end if
      if red = then
      go to Step  5;
   end for
Step  5. Return red.

3. Experiments and Results

3.1. Data Preparation

In order to verify the effectiveness of the proposed algorithm, four public tumor microarray datasets are used for making simulation experiment. Particularly, all of them represent binary classification tasks. Detailed information of datasets is shown in Table 3.

All numerical experiments are performed on a personal computer with 3.1 GHz AMD Athlon (tm) II and 4-G-byte memory. This computer runs Windows 7, with Matlab-R2010 and Weka-3.9.0.

3.2. Results and Analysis

In order to illustrate the reliability and comparability of tumor microarray datasets, we do experiment many times for the average. Experiments use 10-fold cross-validation. Specifically, based on preliminary tuning experiment, we set the nearest neighbors for each data point as 5 for SLLE-SC2 method.

PCA algorithm is used for analyzing four tumor microarray datasets before SLLE-SC2 method test and drawing Pareto diagram (i.e., the information in genomic datasets) of the principal components explained variance for each dataset (blue curve said before the information content of total genes in Figure 1). The results are shown in Figures 1(a), 1(b), 1(c), and 1(d).

The accumulation contribution rate of most datasets (except lung dataset) reaches more than 90 percent when the principal components of datasets are 50 (see Figure 1). It illustrates that gene expression profile datasets contain a large amount of redundancy (i.e., irrelevant and confounding factors) and the number of feature genes are a small part, so it is necessary to remove the redundancy genes.

The classification accuracies vary with the threshold of correlation coefficient; threshold takes values from 0 to 1 with step 0.1. For each value of the threshold, SLLE-SC2 obtains a subset of genes based on the average classification accuracies of SVM classifier. Experiments use 10-fold cross-validation. Classification accuracies with threshold are shown in Figure 2.

All the results show a common rule that the classification accuracies based on SVM increase with the value of threshold at first, arrive at a peak value, and then are stable relatively. It is easier for the classification of leukemia data than the others. When is among 0 to 0.3, classification accuracy increases faster, and when , classification accuracy is relatively stable. It conforms to the actual performance. When is large, it has less strict requirements for removing redundant attributes, so the classification accuracy has no obvious change. Instead, when is small, it has many strict requirements for removing redundant attributes and causes decline of classification accuracy. For overall consideration, the threshold of correlation coefficient is 0.3.

For convenient description, the datasets in Table 3 are divided into positive and negative: positive ones are ALL and tumor, negative ones are AML and normal, respectively. TP and TN mean the number of right positive and negative examples; FN and FP denote the number of misclassified positive and negative examples, respectively.(Note: Acc: overall accuracy; TPR: true positive rate; TNR: true negative rate; FPR: false positive rate; AUC: area under the receiver operating characteristic curve—it is the area below the ROC curve that depicts the performance of a classifier using the FPR and TPR pairs [38])

To present the superiority of SLLE-SC2 method, we evaluate it in comparison with that of SVM classification approaches and adopt the procedure of 10-fold cross-validation. Table 4 reports the results of various performance metrics on four biomedicine datasets.

From the results in Table 4, our method with that of SVM classification results in better performance. Lung data acquire the lowest Acc value on all datasets. In terms of six important performance metrics, leukemia data obtain the largest Acc value, as well as taking the first place on four datasets for TNR, -measure, and AUC criteria, respectively. In general, SLLE-SC2 algorithm gets a better effect in the aspects of high-dimensional and imbalanced classification tasks.

(i) Classification Performance of Feature Genes. Laplacian eigenmaps (LE), locally linear embedding (LLE), supervised locally linear embedding (SLLE), and Spearman’s rank correlation coefficient (SC2) are implemented as competing methods to compare with the proposed SLLE-SC2 method. The nearest neighbor is 5 for LE, LLE, SLLE, and SLLE-SC2. Four classifiers are implemented for classification including SVM, C4.5 (a classification algorithm of decision tree), Naive Bayes (naive Bayesian classification), and -nearest neighbors (NN). Experiments use 10-fold cross-validation; the results are shown in Tables 58.

Each result composes the classification accuracy of 20 independent outcomes in Tables 58. We see that SLLE-SC2 gains the greatest average accuracy in four datasets. By averaging across four classifiers, SLLE-SC2 obtains the top accuracy, with 100% (NN classifier), 94.8% (Naive Bayes classifier), and 97.9% (SVM classifier) in the leukemia, lung, and prostate datasets, respectively. SC2 achieves the worst performance, and its accuracy is much lower than that of SLLE-SC2. SLLE by taking into account class label information gets much better classification performance.

(ii) Comparison of the Classification Effect with the Gene Selected by Different Methods. To verify classification effect with the gene selected by different methods, IGA-FBFE and other 9 feature selection methods are used for comparison in gene expression profiles. Lib-SVM classifier in Weka tool is used for simulation experiment. The number of feature genes and classification results are shown in Table 9.

As shown in Table 9, in terms of the number of selected genes, the difference between methods can be clearly found. For some methods, the number is as high as 60 (e.g., lung data with IGA-FBFE method) or even more, but for some methods the number is less than 10 (such as MAHP, SU, and SLLE-SC2 methods). However, it is hard to do a further comparison of the selected genes for the listed methods, as the genes selected by the other methods are not offered.

As for the classification accuracies, our method produces the results of 99.7% and 5 selected genes for the leukemia data. The results are not inferior to most of the published works. Colon data get small number of selected genes and higher accuracy. For lung data, ILasso and SU methods obtain better classification than our method but failure in number of feature genes. For prostate data, though BQPSO and IG-SGA acquire higher accuracy 99.25% and 100%, respectively, the number of feature genes is more than ours. Clearly, SLLE-SC2 cannot overcome all the existing methods. However, it can outperform some of the published methods and obtain a comparable result with most of the listed methods. Some of the methods produce high classification accuracy which use too large numbers of the selected genes in the classification (e.g., in prostate data, 26 genes are employed by IG-SGA method). However, such results may be difficult for a biological interpretation, all of which go to prove that our method selects the feature genes which have high classification ability and can reflect the structure of the data actuality. The small numbers of feature genes not only improve the running efficiency of the algorithm, but also can enhance the understanding of the microarray data.

(iii) Biological Significance. In order to validate the selected genes, Tables 1013 summarize the index, gene, and description of the selected genes.

We search genes from the web of National Center for Biotechnology Information (NCBI) to further understand the selected genes (https://www.ncbi.nlm.nih.gov/). It can be seen that most of genes are closely associated with cancer as seen in Tables 1013. Most of the selected genes are consistent with the results shown in the previous research [2230]; for example, gene M23197 has been certified for targeted antibody therapy to make leukemia AML die [22], and the gene X95735 codes an LIM domain protein that is significant in cell adhesion of fibroblasts [23]. Gene AL050224 takes effect in the RNA polymerase and finds the overexpression in lung tissues [26]. Gene AJ011497 shows low-expression in MPM while showing high-expression in ADCA [27]. It is considered as a biomarker for the lung cancer. Gene M84526 codes another serine protease adipsin which is secreted by adipocytes into the bloodstream and functions as part of the alternative complement pathway of the innate immune system [29].

4. Conclusions

In this work, we explore the effects and benefits of SLLE-SC2 in the context of feature selection from high-dimensional genomic data. Specifically, supervised LLE is used to remove redundant genes. Considering the relationship between the attributes, the coexpression relationship between genes is deleted by Spearman’s rank correlation coefficient. Our results on four microarray datasets are very promising and supported by existing biological knowledge. The results of our experiments give insight into both predominance and inferior position of SLLE-SC2 method and could represent a useful starting point to better understand the behavior of these techniques as well as the extent of their applicability to specific tumor problems. In more detail, we study genomic information to better understand pathogenesis of tumor and provide reference for the clinical treatment of tumor.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (nos. 61370169, 61772176, and 61402153) and Key Project of Science and Technology Department of Henan Province (no. 162102210261).