Research Article | Open Access
A Semisupervised Feature Selection with Support Vector Machine
Feature selection has proved to be a beneficial tool in learning problems with the main advantages of interpretation and generalization. Most existing feature selection methods do not achieve optimal classification performance, since they neglect the correlations among highly correlated features which all contribute to classification. In this paper, a novel semisupervised feature selection algorithm based on support vector machine (SVM) is proposed, termed SENFS. In order to solve SENFS, an efficient algorithm based on the alternating direction method of multipliers is then developed. One advantage of SENFS is that it encourages highly correlated features to be selected or removed together. Experimental results demonstrate the effectiveness of our feature selection method on simulation data and benchmark data sets.
Feature selection, with the purpose of selecting relevant feature subsets among thousands of potentially irrelevant and redundant features, is a challenging topic of pattern recognition research that has attracted much attention over the last few years. A good feature selection method has several advantages for a learning algorithm such as reducing computational cost, increasing its classification accuracy, and improving result comprehensibility .
Considering the usage of the class label information, feature selection methods can be classified into supervised methods, unsupervised methods, and semisupervised methods. Supervised feature selection methods usually use only information from labeled data to find the relevant feature subsets [2–4]. However, in many real, world applications, the labeled data are very expensive or difficult to obtain, which brings difficulty to create a large training data set. This situation arises naturally in practice, where large amount of data can be collected automatically and cheaply, when manual labeling of samples remains difficult, expensive and time consuming. Unsupervised feature selection methods could be an alternative in this case through exploiting the information conveyed by the large amount of unlabeled data [5, 6]. However, as these unsupervised algorithms ignore label information, important hints from labeled data are left out and this will generally downgrade the performance of unsupervised feature selection algorithms. The combination of both supervised methods and unsupervised methods is semisupervised approaches [7–10] which exploit the information of both labeled and unlabeled data. A good survey about semisupervised feature selection approaches can be found in .
The performances of the most existing semisupervised feature selection methods are insufficient when there are several highly correlated features, which are all relevant to classification and the way they interact can help with the interpretability of the objective problem . Given these premises, this paper provides two main contributions as follows.(i)We present a novel semisupervised feature selection scheme based on support vector machine (SVM) and the elastic net penalty proposed by Zou and Hastie  combining and regularizations, termed SENFS.(ii)In order to solve SENFS with the nondifferentiability of both the loss function and the -norm regularization term, an efficient algorithm based on the alternating direction method of multipliers (ADMM)  is developed.
Compared with other semisupervised feature selection algorithms, SENFS provides the following benefits.(i)It permits highly correlated features to be selected or removed together.(ii)It performs automatic feature selection as part of the training process and can achieve better classification performance using the selected features.
The effectiveness of SENFS is validated on simulated data and six benchmark semisupervised data sets. Our main finding is that SENFS can identify the features that are relevant to classification using the data set that consists of only a few labeled samples and many unlabeled samples.
This paper is organized as follows. Section 2 briefly introduces the methodology. In Section 3, we derive an iterative algorithm that yields the entire solution path based on ADMM to solve this proposed method. In Section 4, we evaluate the performance of this proposed method on both simulated and real-world data, followed by a summary in Section 5.
Assume that all samples sampled from the same population generated by target concept consist of features. Given a set of samples , in which is the number of samples, the th sample or input vector of original feature with features is denoted by . The set can be divided into two parts: labeled set for which labels are provided with for binary problem and unlabeled set whose labels are not given, where and are the number of labeled and unlabeled samples, respectively, and . Then, the generic goal of semisupervised feature selection is to find a feature subset with features which contains the most informative features using both data information of and . In other words, the samples represented in the -dimensional space can well preserve the information of the samples represented in the original -dimensional space.
We begin our discussion with the binary supervised feature selection based on the elastic net penalty. Wang et al.  proposed a supervised feature selection method based on SVM with the elastic net penalty term named doubly regularized support vector machine for binary classification problems, which solves the optimization of the following generic objective function over both the hyper plane parameters : where the decision function is defined as and both and are tuning parameters, and is the regularization parameter. is the margin loss function; for example, hinge loss . The role of the -norm penalty is to allow selection, and the role of the -norm penalty is to help groups of highly correlated features get selected or removed together which is denoted by the grouping effect .
As for semisupervised feature selection, considering and , inspired by the semisupervised learning algorithm TSVM , we apply the elastic net penalty for semisupervised feature selection (SENFS), which solves the following optimization task over both the hyper plane parameters and the unlabeled vector :
The constraint in (2) is called the balancing constraint and is necessary to avoid the trivial solutions where all unlabeled samples are assigned to the same class. This constraint enforces a manually chosen constant and an approximation of this constraint writes  with . So the constraint can be rewritten as . and employ the same loss; for example, hinge loss .
Obviously, the difficulty of the above optimization task consists in finding the optimal assignment for the unlabeled vector and the hyper plane parameters , which is a mixed-integer programming problem . As described in , for a fixed , . So the problem of (2) can be seen equivalently as
On the other hand, one effective approximation of the loss function was a clipped variant  which can be expressed as where is the Ramp loss defined as with . In our experiments, the typical value of is 0.3. The main reason to use the clipped symmetric hinge loss is the gain of sparsity in the number of support vectors yielded by the optimizer . Then we can get
From (5), we can know that solving the optimization problem (3) with the clipped symmetric hinge loss is equivalent to solving a classical SVM with the unlabeled samples counted twice with when and when , which are artificial labels. Therefore problem (3) can be rewritten as
As we can seen from (6), when , SENFS evolves into a supervised feature selection algorithm, and when , it becomes an unsupervised model.
In the following, we will illustrate how SENFS has the grouping effect for correlated features. The following theorem describes this point.
Theorem 1. Denote the solution to (6) by and , the input th feature by , and the input th feature by . Then for any pair , one can have where and are positive finite constants. Furthermore, if the input features and are centered and normalized, then where and are the sample correlations between and , , , and , .
The term in (7) describes the difference between the coefficient paths of and . If both features are highly correlated, that is, , Theorem 1 says that difference between the coefficient paths of them is almost 0, in which case both features will be selected or removed together. The upper bound in (7) or (8) provides a quantitative description for the grouping effect of SENFS.
3. Algorithm for SENFS
The alternating direction method of multipliers (ADMM) developed in the 1970s and is well suited to distributed convex optimization and in particular to large-scale problems arising in statistics, machine learning, and related areas. The method is closely related to many other algorithms, such as the method of multipliers , Douglas-Rachford splitting , Bregman iterative algorithms  for problems, and others.
In this section, we first propose an efficient algorithm to solve SENFS based on ADMM by introducing auxiliary variables and reformulating the original problem. Then prove its convergence property and get the adjustment principle for penalty parameters. Finally describe the stopping criterion and computational cost.
3.1. Deriving ADMM for SENFS
It is hard to solve the model (6) directly due to the nondifferentiability of three loss functions and a -norm term. In order to derive an ADMM algorithm, we introduce some auxiliary variables to handle these nondifferentiable terms.
Let denote labeled data, let denote unlabeled data, and let be diagonal matrixes with their diagonal elements to be the vector and , respectively. The constrained problem in (6) can be reformulated into an equivalent form where , , and , is an -column vector of 1s and is an -column vector of 1s, and . The Lagrangian function of (9) is
In problem (10), , and , , and are dual variables corresponding to the constraints , , and , respectively, is corresponding to the constraint , and is a scalar corresponding to the balancing constrain. As in the method of multipliers, we form the augment Lagrangian where are parameters. Problem (11) is the form of ADMM, which consists of the following iterations:
The efficiency of the iterative algorithm (12) lies on whether the first equation of (12) can be solved quickly. According to the theory of ADMM, these variables , , and are updated in an alternating or sequential fashion, which accounts for the term alternating direction. So we can get
For the first equation in (13), it is equivalent to the following convex optimization:
The objective function in the above minimization problem is quadratic and differentiable, and since minimizes this function by definition, the optimal solution can be found by solving a set of linear equations
In (15), is a unit matrix and the coefficient matrix is a matrix, independent of the optimization variables. For large , small setting, the term in the coefficient matrix will be a positive low rank matrix with rank at most while the term in the coefficient matrix will be a positive low rank matrix with rank at most . Therefore, the coefficient matrix is also low rank matrix with rank at most . And if we use CG to solve the problem (15), it will converge in less than steps .
For the second equation in (13), it is equivalent to solving
Proposition 2. Let where . Then
Combined with Proposition 2 and
we can update according to Corollary 3.
Corollary 3. The update of in (16) is equivalent to where .
Combined with Proposition 2 and
we can update according to Corollary 4.
Corollary 4. The update of in (20) is equivalent to where .
Proposition 5. Let , where . Then
Proof. The function is strongly convex, so it has a unique solution. Therefore, by the subdifferential calculus , is the unique minimizer of the following equation:
where is the subdifferential of the function . According to , can be expressed as
With (25) and (26), we can get the desired result.
Then, combined with Proposition 5 and we can update according to Corollary 6.
Corollary 6. The update of in (23) is equivalent to where .
For the fifth equation in (13), it is equivalent to solving
Corollary 7. The update of in (29) is where , and .
According to the theory of ADMM, , , , and must be small below some certain threshold . Considering the auxiliary variables in (11), we expect that , , and also must be small. Therefore, in our experiment, algorithm for SENFS stops whenever
Finally, we can get the algorithm for SENFS. The detailed procedure of the algorithm ADMM for SENFS is summarized in Algorithm 8 as follows.
Algorithm 8. ADMM algorithm for SENFS. Input. Labeled data set ; unlabeled data set ; tuning parameters and ; regularization parameter . Output. Selected feature set. Step 1. Initialize , and . Step 2. If (31) is satisfied, go to Step 3; otherwise,(1)update and according to (15);(2)update , , , and according to Corollaries 3–7, respectively;(3)update , and according to (12). Step 3. Get the best feature subset according to . If , the corresponding th feature is abandoned; otherwise, it is selected as an important feature.
3.2. Convergence Analysis and Computational Cost
The convergence property of Algorithm 8 can be derived from the theory of the alternating direction method of multipliers. According to the standard convergence theory of ADMM, Algorithm 8 satisfies the dual variable convergence . So Theorem 9 holds.
Theorem 9. Suppose that is one of solution of (5). Then the following property holds:
As for the computational issue, it is hard to predict the computational cost because it depends on the all the penalty parameters. According to our experience, we only need to iterate a few hundred iterations to get a reasonable result. On the other hand, the efficiency of Algorithm 8 lies mainly on whether we can quickly solve the linear equations (15). And the computational cost for solving (15) is .
3.3. Varying Penalty Parameter
In order to make performance less dependent on the initial choice of the penalty parameter, it is necessary to use different penalty parameters. According to our experiment experience, the penalty parameters , and have a huge influence on the performance and the number of iterations involved, so adaptive selections of them are performed.
The necessary optimality conditions for the problem (6) are dual feasibility as Since minimizes by definition, we have that
For , let = + , and the constraint conditions are and the constraint of (3). Through the same solving process as parameter , we can get the residual . Similarly, we can get the residual for parameter and for parameter . With these residuals, we can get a simple scheme to update , and , respectively, according to Corollary 10.
Corollary 10. The update of iswhere , and are parameters. Typical choices might be and .
4. Experimental Evaluation
This section examines the performance of SENFS with respect to its feature selection and test error on simulated data and six benchmark data sets. In order to evaluate the effectiveness of SENFS, we compare SENFS with an existing semisupervised feature selection algorithm: Spectral , and a supervised feature selection algorithm: DrSVM , which also has the characteristics of grouping effect. On the other hand, in order to evaluate the quality of selected features, SVM was executed on these selected features. The experiments are run on a desktop with Pentium() 2.0 G CPU, 1.99 G main memory. The programs are compiled in Windows system with Matlab in version R2009a.
The limited number of samples prohibits having enough and independent training and testing data for performance evaluation. It is very common to apply accross-validation (CV) in this scenario. We used 5-fold CV: we partitioned the data set into five complementary subsets of equal size. Four subsets were used as training data; the remaining subset served as test data. We repeated this process five times such that each of the five subsets was used exactly once as test data. To get more reliable estimate, we performed the 5-fold CV for 10 times and the experimental results are average results over test data sets. Moreover, finding the appropriate value of the tuning parameter pair and is essential for the performance of SENFS. We employed 10-fold CV over a large grid.
We evaluate the performance of SENFS using two parameters: the correlation between relevant features denoted by , the number of labeled samples, and the degree of overlapping among classes denoted by . Consider 2-class problem in which the samples are lying in a dimensional space with the first 10 dimensional being relevant to classification and the remaining features being noise, where the correlation between the first 10 features is . The number of samples is 300 with . For the samples from +1 class, they are sampled from a normal distribution with mean and covariance as follows: where the diagonal elements of are 1 and the off-diagonal elements are all equal to . The −1 class has a similar distribution expect that its mean is:
To evaluate the effect of the correlation between relevant features, SENFS is compared with Spectral and DrSVM, measured by the number of selected features with two labeled samples and . The results are summarized in Table 1. As shown in Table 1, on this simulated data, when the relevant features are highly correlated (e.g., ), Spectral and DrSVM tend to keep only a small subset of the relevant correlated variables and overlook the others, while the SENFS tends to identify all of them, due to the grouping effect. These three methods seem to work well in removing irrelevant features.
The effects of the number of labeled samples on test error over the top 10 selected features are summarized in Figure 1 with and . As can be seen, the test errors of SENFS, Spectral, and DrSVM decrease with the increase of the number of labeled samples, but SENFS seems to achieve the best classification performance when the number of labeled samples is varying, which may imply that SENFS can make better use of the labeled samples than spectral and DrSVM. The supervised feature selection method DrSVM achieves the worst results because it only relies on the few labeled samples and discards the large amount of unlabeled samples.
In Table 2, the effect of the degree of overlapping among classes on test error over the top 5 selected features is evaluated with two labeled samples and and , also reporting the typical computational time of our experimental campaign. As we can see, SENFS seems to have the best prediction performance. When is small, the two classes overlap largely and in this case, other methods achieved worse performance compared with SENFS. However, SENFS, solving the programming problem (6) needing an iterative procedure, requires more computational time than the other methods as you can see in Table 2. It is noted that the absolute values are not as important as the relative differences between the individual methods.
4.2. Application to Benchmark Data Sets
Several benchmark data sets are selected to test the performance of SENFS, which are used as benchmark data sets in [7, 8] to test the performances of semisupervised algorithms. These benchmark data sets consist of 9 semisupervised learning data sets. We did not test the SSL6, SSL8, and SSL9 data sets since the SSL6 data set includes six classes, the SSL8 data set contains too many samples ( is over one million) and the SSL9 data set has too many dimensions ( is over ten thousand). The names and characteristics of the left six data sets are given in Table 3.
In this study, we examine performance evaluation through 5-fold cross-validation that is, we randomly select four fifths of the unlabeled samples, plus all the labeled samples, for SENFS, Spectral, and DrSVM to select optimal feature subsets, while leaving the remaining one fifth for testing test error on the selected features using SVM, where all the labeled samples are used for training SVM. The results measured by test error are reported in Table 4. As can be seen, SENFS outperforms the semisupervised and supervised feature selection methods on all the six data sets when and . When , Spectral performs the second best, on USPS, COIL2, and BCI data sets, while DrSVM performs the second best on Digit1, BCI, g241c and g241n data sets when .
This paper has proposed a novel semisupervised feature selection algorithm based on SVM and the elastic net penalty. The whole methodology of SENFS and the solution path based on ADMM have been described in detail in this paper. The experimental results illustrate that SENFS can identify the relevant features and encourage highly correlated features to be selected or removed together.
Future work will address how these selected features interpret their semantic relationship with the data they are selected from, which can be used for unknown data analysis, and extend SENFS to be suitable for multiclass case.
Proof of Theorem 1. Consider another set of coefficients
Then we have
It is simple to verify that both the loss function and are Lipschitz continuous, so we have
Similarly, we can get where and are positive constants. As described in , we get and . Then combining (A.3), (A.4), and (A.2) implies that
Let , and (7) is obtained. For (8), we simply use the two inequalities , and .
This work was supported by the National Key Basic Research Program of China (973 Program) under Grant no. 613148 and the National Science and Technology Major Project of the Ministry of Science and Technology under Grant no. 2010ZX03006-002, 2011ZX03005-003-03. The authors are grateful to the anonymous reviewers for their helpful comments.
- Y. Hong, S. Kwong, Y. Chang, and Q. Ren, “Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm,” Pattern Recognition, vol. 41, no. 9, pp. 2742–2756, 2008.
- J. L. Rodgers and W. A. Nicewander, “Thirteen ways to look at the correlation coefficient,” The American Statistician, vol. 42, no. 1, pp. 59–66, 1988.
- J. Lin, J. Ming, and D. Crookes, “Robust face recognition with partial occlusion, illumination variation and limited training data by optimal feature selection,” IET Computer Vision, vol. 5, no. 1, pp. 23–32, 2011.
- T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley-Interscience, Hoboken, NJ, USA, 2nd edition, 2006.
- L. Talavera, C. Nord, and J. Girona, “Dependency-Based Feature Selection for Clustering Symbolic Data,” Intelligent Data Analysis, vol. 4, no. 1, pp. 19–28, 2000.
- H. Elghazel and A. Aussem, “Feature selection for unsupervised learning using random cluster ensembles,” in Proceedings of the 10th IEEE International Conference on Data Mining (ICDM '10), pp. 168–175, Sydney, Australia, December 2010.
- O. Chapelle, B. Scholkopf, and A. Zien, Semi-Supervised Learning, MIT Press, Cambridge, Mass, USA, 2006.
- M. Sugiyama, T. Idé, S. Nakajima, and J. Sese, “Semi-supervised local Fisher discriminant analysis for dimensionality reduction,” Machine Learning, vol. 78, no. 1-2, pp. 35–61, 2010.
- F. Bellal, H. Elghazel, and A. Aussem, “A semi-supervised feature ranking method with ensemble learning,” Pattern Recognition Letters, vol. 33, no. 10, pp. 1426–1432, 2012.
- Z. Zhao and H. Lu, “Semi-supervised feature selection via spectral analysis,” in Proceedings of the 7th SIAM International Conference on Data Mining, pp. 641–646, April 2007.
- L. Wang, J. Zhu, and H. Zou, “The doubly regularized support vector machine,” Statistica Sinica, vol. 16, no. 2, pp. 589–615, 2006.
- H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society B, vol. 67, no. 2, pp. 301–320, 2005.
- Y. Guibo, C. Yifei, and X. Xiaohui, “Efficient variable selection in support vector machines via the alternating direction method of multipliers,” Journal of Machine Learning Research, vol. 15, pp. 832–840, 2011.
- T. Joachims, “Transductive inference for text classification using support vector machines,” Proceedings of the 16th International Conference on Machine Learning (ICML '99), pp. 200–209, 1999.
- O. Chapelle, V. Sindhwani, and S. S. Keerthi, “Optimization techniques for semi-supervised support vector machines,” Journal of Machine Learning Research, vol. 9, pp. 203–233, 2008.
- C. A. Floudas, Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications, Oxford University Press, New York, NY, USA, 1995.
- R. Collobert, F. Sinz, J. Weston, and L. Bottou, “Large scale transductive SVMs,” Journal of Machine Learning Research (JMLR), vol. 7, pp. 1687–1712, 2006.
- R. T. Rockafellar, “A dual approach to solving nonlinear programming problems by unconstrained optimization,” Mathematical Programming, vol. 5, pp. 354–373, 1973.
- C. Wu and X.-C. Tai, “Augmented lagrangian method, dual methods, and split bregman iteration for ROF, Vectorial TV, and high order models,” SIAM Journal on Imaging Sciences, vol. 3, no. 3, pp. 300–339, 2010.
- T. Goldstein and S. Osher, “The split Bregman method for L1-regularized problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 2, pp. 323–343, 2009.
- Y. Saad, Iterative Methods For Sparse Linear Systems, Society for Industrial Mathematics, 2003.
- G.-B. Ye and X. Xie, “Split Bregman method for large scale fused Lasso,” Computational Statistics and Data Analysis, vol. 55, no. 4, pp. 1552–1569, 2011.
- J. B. Hiriart-Urruty and C. Lemarechal, Convex Analysis and Minimization Algorithms, Springer, Berlin, Germany, 1993.
- S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2010.
Copyright © 2013 Kun Dai et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.