Abstract

The development of DNA microarray makes researchers screen thousands of genes simultaneously and it also helps determine high- and low-expression level genes in normal and disease tissues. Selecting relevant genes for cancer classification is an important issue. Most of the gene selection methods use univariate ranking criteria and arbitrarily choose a threshold to choose genes. However, the parameter setting may not be compatible to the selected classification algorithms. In this paper, we propose a new gene selection method (SVM-t) based on the use of t-statistics embedded in support vector machine. We compared the performance to two similar SVM-based methods: SVM recursive feature elimination (SVMRFE) and recursive support vector machine (RSVM). The three methods were compared based on extensive simulation experiments and analyses of two published microarray datasets. In the simulation experiments, we found that the proposed method is more robust in selecting informative genes than SVMRFE and RSVM and capable to attain good classification performance when the variations of informative and noninformative genes are different. In the analysis of two microarray datasets, the proposed method yields better performance in identifying fewer genes with good prediction accuracy, compared to SVMRFE and RSVM.

1. Introduction

Feature selection is a critical issue for disease subtyping and cancer classification. There are two goals when optimizing classification procedures, attaining highest accuracy and selecting smallest set of features. With the development of microarray technology, experimenters can detect expression profiles for more than ten thousands gene at a time. Classification problem with such huge amount of genes lead to inefficiency, inconsistency, and bias. There are many discussions on reducing the features by univariate rankings of gene and selecting the genes with highest rankings to build a classifier, for example, t-statistics (Golub et al. [1], Furey et al. [2], and Li et al. [3]) and F-score (Chen and Lin [4]) related ranking methods. Feature selection is the most challenging task in the pattern classification research especially for high-dimensional data.

Depending on the classification algorithms, feature selection techniques can be classified into three main groups: embedded, filter, and wrapper approaches. The filter methods rank features according to some criteria or indices of relevance which are completely independent of the classification algorithm such as value. The filter approach is a stand-alone prior step, regardless of which classification algorithm will be used. Afterwards, the selected feature subset will be applied to the classification algorithm. The effects of the selected features on the performance of the algorithm are not taken into account. Classical classification algorithms, such as Fisher’s linear discriminant analysis and k-nearest neighbour, often use the filter approach to select relevant predictors prior to classification (Chen and Lin [4], Roepman et al. [5], Mark and Kung [6], Pavlidis et al. [7], and Yu and Liu [8]). In the wrapper approach, the model selection algorithms are wrapped in the search process of feature subsets. This has the advantages that the feature selection process can take into account feature dependencies while building the classification model. However, these methods have high computational complexity to repeatedly training and testing predictors each time a new subset is defined. West et al. [9] used binary regression coefficients as importance in scoring genes to the contribution to the classification. Díaz-Uriarte and Alvarez de Andrés [10] proposed the use of Gini index as variable importance to perform gene selection in the classification algorithm of random forest. Sharma et al. [11] proposed a wrapper-like feature selection method based on null linear discriminant analysis (LDA) method. Embedded methods are an integral part of feature selection techniques and specific classification algorithms, for example, decision trees or neural networks with regularization. In addition, Sharma et al. [12] proposed a successive feature selection algorithm to iteratively eliminate redundant features with minimal information in terms of classification accuracy. Their method combines filters and wrappers together to search for the best top-r feature subset. A recent review (Saeys et al. [13]) summarized many more feature selection techniques and discussed their use for bioinformatics applications.

Over the recent years, support vector machine (SVM; Cristianini and Shawe-Taylor [14]), a supervised machine learning, is widely used in classification problem especially with high-dimensional data such as microarray gene expression profile. SVM maps input data points to construct maximal-marginal hyperplane in higher dimension space to classify data with different class labels. The hyperplane is constructed using only the support vectors (i.e., data that lie on the margin). The general form of the hyperplane is represented as where the training set is of size , is the input data of sample , is the Lagrange multiplier solved from the training set. is the class label for sample is the kernel function, x is the variable vector of a sample, and can be viewed as intercept. The optimized weight vector equals to , where is nonzero if belongs to support vector, otherwise equals to 0, and is a nonlinear mapping function from input space to feature space.

Support vector machine has many applications and performs very well on microarray related classification problems (Furey et al. [2] and Brown et al. [15]). In addition, many gene selection methods based on SVM have been proposed previously. For example, Guyon et al. [16] proposed a support vector machine recursive feature elimination algorithm (SVMRFE), which uses the coefficients of weight vector to compute the feature ranking score. However, like the concept of slope, the selected genes with higher coefficients of weight means that they will have higher expression values compared to nonselected genes. Hence, some noisy but high-expression value genes have high possibility to be selected. On the other hand, Zhang et al. [17] proposed the recursive SVM feature selection (RSVM), which combines the weight and data information (i.e., class mean) to formulate selection criterion. Such method takes all data information into consideration and the outlier data is also included. However, such feature selection result is greatly affected by class label assignment. There is no general rule for assigning the positive and negative signs to the two classes. Hence, such unstable criterion is difficult to use. Hence, we propose a new feature selection criterion SVM-t based on the use of t-statistics embedded in support vector machine. We use the univariate ranking method on support vectors for identifying significant genes and the backward elimination procedure follows the workflow of RSVM with nested subsets of feature. The aim of combining procedures is to identify more significant genes among the nearest support vectors. In this paper, we compare the performance of the three SVM-based gene selection methods via extensive simulations and real microarray data analyses.

2. Materials and Methods

2.1. SVMRFE

In general, all classification problems can be generalized to two-class classification problem. If there are more than two-classes, the simplest and widely used approach is the multiple one-against-all scheme. Hence, multiple class problems can be reduced to multiple simple two-class problem. Consider a binary decision function with linear kernel, the function in (1) can be represented as: where w is weight vector of the classifier. The optimized weight vector equals to where is nonzero if belongs to support vector, otherwise equals to 0. Hence, for support vector machine, the binary decision function is determined by the linear combination of support vectors. The class of new sample x can be easily determined by the sign of . For achieving the objective of better feature selection, Guyon et al. [16] proposed SVMRFE method to find important gene subset for cancer classification. To evaluate the importance of features, SVMRFE uses as ranking criterion. Features with the smallest ranking scores are eliminated. This criterion can also be interpreted as weighted sums of support vectors [17]. In addition, based on SVMRFE, Duan et al. [18] used cross-validation method to estimate multiple and take signal-to-noise ratio of the multiple weight value as the ranking criterion. Hence, the multiple weights SVMRFE is robust to small perturbations for single experiment.

2.2. RSVM

To enhance the robustness to noise and outliers, Zhang et al. [17] proposed recursive support vector machine (i.e., RSVM). They developed a feature selection method to construct the stepwise classification rules and reduce the gene numbers at the same time. From (2), one can obtain the weight of specific feature. Unlike SVMRFE, RSVM selects important features by the value of the product of weight and corresponding feature: where denotes weight of the th feature, and stand for the means of feature in the respective classes. This method takes both the classifier information (i.e., weight) and the data (i.e., class mean of two classes) into consideration.

With the ranking score calculated from (3), we can set a threshold to filter out any features with score below it in the next steps. Furthermore, the classification can be performed with the selected features step by step. After a series of iterations with smaller and smaller feature sets, the best rule is constructed by a prespecified number of features selected from the highest selected-frequency list. Finally, the best rule and number of selected features can be decided simultaneously. However, such feature selection result is greatly affected by class label assignment. We found that different assignments of class labelling (+ or −) will result in different selected gene sets. Hence, such unstable criterion is ambiguous to use.

2.3. Proposed SVM-t Method

Support vector machine uses only the information of support vectors to construct the maximal separation hyperplane and determine the classes for new samples. The support vectors, the set of closest points between two classes, play an important role in SVMRFE and RSVM for feature selection. These two methods use the weights of corresponding features to build the selection criteria; instead, we combined the univariate ranking methods (i.e., absolute t-statistics) with support vector machine. The proposed method uses the most important subset (i.e., SVs) of the data points to construct the selection criteria (4). In other words, we use the algorithm of support vector machine as a sampling technique for data points of two classes simultaneously. The standard two-sample t-statistic is used as a surrogate statistic to evaluate the significant differences between two classes. Our proposed method is easy to implement and keeping the computational complexity comparable to that of SVMRFE and RSVM. Therefore, with the variation of samples, we can identify the most significant differences for specific genes among the closest points: where (resp., ) indicates the number of support vectors for class +1 (resp., −1). We can calculate mean (resp., ) and standard deviation (resp., ) by using only the support vectors of feature -labelled class +1 (resp., −1) to obtain the score of each feature. The features with highest scores are the features with most significant difference between the two classes. It is intuitive to select features with the highest score as our feature set.

However, there are some exceptions according to such a criterion. We further break down the selection criterion into four situations as follows:(a)the data set can be well separated by two SVs. Both classes have only one SV. Equation (4) can thus be reduced as the difference of particular feature of SVs;(b)one class has only one SV, and the other class has more than one SV, and (4) becomes a single mean t-test statistics. Considering class −1 with only one SV, (4) reduces to (c)similar to (b), class +1 has only one SV. The selection criterion is (d)finally, in the most general condition, we usually encounter classification problems with more than one support vectors for each class. For practical use, this condition is more complicated than previous ones, and selection criterion in (4) shall be used.

We follow the workflow of feature selection suggested in [17] as in Figure 1. First, a monotone decreasing sequence for the number of selected features is specified where is the total number of runs. After executing the selection criteria for each run, the number of input features will be determined. Secondly, the cross-validation method is adopted to carry out gene selection procedure and execute the SVM procedures with features. For the comparison among the difference of selection methods, we apply these three different criteria in this step. For each run, the features with top ranking scores are selected until the sequence ends. Finally, the gene set with minimal CV error is selected. The set with fewest genes will be chosen when ties occur for CV errors.

3. Results

3.1. Simulation Experiments

First, we evaluate performance of the three methods using simulated data sets. Our simulation data contains different means and standard deviations for informative and noninformative genes. We first generate a training set with 100 samples (50 samples for each class), each containing expression values of 1000 genes where 300 of them are informative genes and the rest 700 noninformative genes. To validate the selected gene sets, we also generate another 1000 independent samples (500 samples for each class). We performed 100 simulations for each data set and used “Leave-one-out” version CV method. The following two simulation cases use the same aforementioned scheme with varying parameter setting.

Simulation Case I
In this simulation, we separate the informative gene set into two parts. The first 150 genes independently follow the Gaussian distribution for class 1 and for class 2. On the contrary, the last 150 genes are independently distributed from the Gaussian distribution for class 1 and for class 2. The rest 700 “noninformative” genes independently follow distribution. Table 1 summarizes the simulation result with 100 runs using average and standard deviation of the 100 runs with the percentage of informative genes coverage and the number of support vectors (SVs). With this parameter setting, the informative genes cannot be easily differentiated due to the small mean difference. However, taking the standard deviation into consideration, we found that the proposed method shows better results than RSVM and SVMRFE.
In addition, we increase the standard deviation for informative genes to 1 so that the difference of variation effect between informative and noninformative genes is reduced. Table 2 shows that RSVM yields better average test error rate and selects more informative genes than SVMRFE and the proposed method. However, the proposed method selected fewer support vectors in each recursive step.
Based on the previous setting, we further increase the variance of noninformative genes. We set the standard deviation of noninformative gene twice as large as that of informative genes and follows distribution. Table 3 shows that all three methods give worse test error rates than before because of the more complex nature of the simulation scheme. The coverage percentages of informative genes by RSVM and SVMRFE decrease with the decreasing gene levels, but the proposed method maintains high coverage percentages of informative genes. Besides, the test error rate of proposed method stays at around 25% which is lower than the error rates of RSVM and SVMRFE.

Simulation Case II
Here, we increase the mean values of the informative genes from 0.15 to 0.25 and keep the distribution of the 700 “noninformative” at . In Table 4, we find that this simulated data is well separated with the higher difference between the two classes. The test error rate and percentage of selected informative genes are comparable among the three methods. The prediction results are almost perfect (i.e., test errors are nearly zero) for all conditions in this simulation. The proposed method utilizes fewer support vectors than the other two methods though.
We further set standard deviation of noninformative to 1 to eliminate the variation effect between informative and noninformative genes. The result in Table 5 is very similar to that of Table 2 with lower test error rates for the three methods. With decreasing gene number, RSVM gives better average test error rate than SVMRFE and proposed method. The percent of selected informative genes are comparable among the three methods which may due to the higher difference of informative genes between two classes. In addition, our proposed method also selected fewer support vectors.
Finally, we set the standard deviation of noninformative gene twice as large as that of informative genes with distribution . The three methods all yield higher test error rate than previous settings but the proposed method performs better than RSVM and SVMRFE with significant margins for such complex simulation data.

Stability of Feature Selection
To investigate the stability of feature selection, we perform 200 times of simulation case I with switched class labels. The frequency plot of selected features in Figure 2 shows that different class labels in RSVM will result in selecting the different sets of features, while our proposed method does not alter the frequencies of selected features by switching the class labels. In addition, the RSVM tends to select noninformative genes much more frequently than our approach. Figure 3 plots the Jaccard’s coefficient of RSVM to comparing the similarity of feature selections between two switched class labels over 200 simulations. We observe that the class labelling in RSVM is crucial for achieving better agreement of feature selection. The low Jaccard’s coefficient reveals that there is heterogeneity between two different class labels for some simulations. In contrast to the RSVM, our proposed method is independent of the class labels and appears to be better than RSVM for selecting a small number of discriminative genes.

3.2. Application on the Human Breast Cancer Data

We next evaluated performance of the three methods by using two microarray datasets. The first dataset is from Affymetrix DNA microarray analysis of a series of primary breast cancer samples [9]. This data contains gene expression profiles of 7129 detected probe sets for 49 samples (24 ER positive and 25 negative patients). Preprocessings including background correction, normalization, PM correction and expression summary were conducted, and the gene expression levels were log2-transformed. All 7129 genes were included in the gene selection process. The minimal gene number is set at 5 with decreasing ration 0.85 for generating the decreasing gene selection sequence (Table 6).

Table 7 shows that RSVM selects fewer discriminant genes than SVMRFE and the proposed method. However, in earlier recursive steps, the proposed method attains the minimal CV error rate (2.04%). Proposed method may select more significant genes than SVMRFE and RSVM in the first few iterations. Taking also gene number into consideration we may choose 10 genes with CV error rate (8.16%) as our solution. SVMRFE can also attain CV error rate (8.16%) with 20 selected genes and RSVM can attain minimal error rate (4.08%) with 5 genes. For gene number fewer than 20, proposed method has higher number of support vectors than SVMRFE and RSVM, and these two methods have almost identical number of support vectors. Table 8 shows that proposed method selected 10 genes with value < 0.0004.

3.3. Application on the Lung Cancer Data

The second dataset is a lung cancer dataset used in CAMDA 2003 [19]. Gene expression profiles of 7129 genes for 86 lung cancer patients (67 stage I and 19 stage III tumors) are available for analysis. Similar preprocessing and parameter setting to the breast cancer analysis are adopted.

Error rates shown in Table 9 tell us that this lung cancer data is more difficult than the breast cancer data for predicting the disease subtypes. SVM-RFE and RSVM can only reach error rate at above 20%, while the proposed method attains CV error rate below 10% at several feature levels. Table 10 shows the selected genes by the proposed method, and most of them have values less than 0.05.

4. Discussion

The proposed SVM-t method is a multivariate feature selection approach in which sample heterogeneity is accommodated during a sequential backward elimination process. The method could benefit from the use of support vectors on the hyperplane of SVM and select a group of informative genes with aids in improving classification performance. The use of support vectors for feature selection is intuitive because they play an important role for building the decision function in SVM. For the linear SVM, the feature dimensions corresponding to the support vectors are also critical for classification by the property of linearity. Therefore, we consider trimmed t-statistics which compare group mean values after removing the nonsupport vectors and show that these have higher power to identify relevant features than the other two SVM-based approaches in case of high noise.

The result for simulation studies show that all three feature selection methods of SVMRFE, RSVM, and the proposed method have good performance when the distribution of differentially expressed genes is much different from noise genes. The influences could be attributed to two sources: the magnitude of difference and variance of gene expression levels between two classes. When fixing the magnitude of difference, the result is greatly affected by the variance. With identical variance of informative and noninformative genes, RSVM performs better result since it gives more weights to informative genes (i.e., genes with higher difference between the two classes become more important in this case). On the other hand, when considering distinct variances of informative and noninformative genes, the proposed method outperforms RSVM and SVMRFE. Moreover, the proposed method is able not only to select more informative genes with lower test error rate than RSVM and SVMRFE but also to accommodate heterogeneity within the complex clinical samples.

In applications, we use two published microarray datasets to evaluate the performance of the three methods via leave-one-out cross-validated accuracy. We can find that proposed method can consistently select a smaller subset of informative genes with good prediction accuracy. Both the proposed method and RSVM outperform SVMRFE for applications on the two datasets.

All SVM-based feature selection methods under comparisons select features based on the support vectors in the linear SVM. Ideally, SVMs can use a proper kernel function to map the data into a separating hyperplane in the feature space when the data points are not linearly separable. However, the corresponding property between support vectors and feature dimensions in the nonlinear SVM is far from being clear and some further work is required in order to fully understand this issue. In addition, it is well known that applying proper kernel functions and parameters to a specific real database remains challenging, and the number of support vectors that compose the decision function increases dramatically when the decision manifold becomes complicated. Therefore, our approach is limited to the use of linear SVM here and considers it as an alternative approximation to the real dataset. However, if the objective is to select relevant features for better classification performance, rather than to provide insights into relative importance of features in the feature space, then our approach provides an alternative solution. Further work will be focused on the impact of our trimmed t-statistics on the use of nonlinear SVMs for classification of nonlinear datasets.

Authors’ Contribution

C.-A. Tsai and C.-H. Huang contributed equally to this work.

Acknowledgments

The authors thank Dr. Chen-Hsin Chen for helpful discussion and edits to the paper. This project was initiated when all the authors were with the Genomic Statistics Unit (stationed at the Academia Sinica), Advanced Bioinformatics Core of the National Research Program for Genomic Medicine (NRPGM) in Taiwan. The work was partially supported by the NRPGM Grants (NSC94-3112-B-001-012-Y, NSC95-3112-B-001-018, and NSC 96-3112-B-001-017) from the National Science Council.