Computational Data Mining in Cancer Bioinformatics and Cancer EpidemiologyView this Special Issue
Methodology Report | Open Access
Regularized F-Measure Maximization for Feature Selection and Classification
Receiver Operating Characteristic (ROC) analysis is a common tool for assessing the performance of various classifications. It gained much popularity in medical and other fields including biological markers and, diagnostic test. This is particularly due to the fact that in real-world problems misclassification costs are not known, and thus, ROC curve and related utility functions such as F-measure can be more meaningful performance measures. F-measure combines recall and precision into a global measure. In this paper, we propose a novel method through regularized F-measure maximization. The proposed method assigns different costs to positive and negative samples and does simultaneous feature selection and prediction with penalty. This method is useful especially when data set is highly unbalanced, or the labels for negative (positive) samples are missing. Our experiments with the benchmark, methylation, and high dimensional microarray data show that the performance of proposed algorithm is better or equivalent compared with the other popular classifiers in limited experiments.
Receiver Operating Characteristic (ROC) analysis has received increasing attention in the recent statistics and machine learning literatures (Pepe [1, 2]; Pepe and Janes ; Provost and Fawcett ; Lasko et al. ; Kun et al. ). ROC analysis originates in signal detection theory and is widely used in medical statistics for visualization and comparison of performance of binary classifiers. Traditionally, evaluation of a classifier is done by minimizing an estimation of a generalization error or some other related measures (Vapnik ). However the accuracy (the rate of correct classification) of a model does not always work. In fact when the data are highly unbalanced, accuracy may be misleading, since the all-positive or all-negative classifiers may achieve very good classification rate. In real life applications, the situations for which the data sets are unbalanced arise frequently. Utility functions such as F-measure or AUC provide a better way for classifier evaluation, since they can assign different error costs for positive and negative samples.
When the goal is to achieve the best performance under a ROC-based utility functions, it may be better to build classifiers through directly optimizing the utility functions. In fact, optimizing the log-likelihood function or the mean-square error does not necessarily imply good ROC curve performance. Hence, several algorithms have been recently developed for optimizing the area under ROC curve (AUC) function (Freund et al. ; Cortes and Mohri ; Rakotomamonjy ), and they have been proven to work well with different degrees of success. However, there are not many methods proposed for F-measure maximization. Most approaches to date that we know of maximize F-measure using SVMs and do so by varying parameters in standard SVM in an attempt to maximize F-measure as much as possible (Musicant et al. ). While this may result in a “best possible” F-measure for a standard SVM, there is no evidence that this technique should produce an F-measure comparable with one from the classifier designed to specifically optimize F-measure. Jansche  proposed an approximation algorithm for F-measure maximization in the logistic regression framework. His method, however, gives extremely large values for the estimated parameters and creates too many steep gradients. It, therefore, either converges very slow or fails to converge for large datasets.
Our aim in this paper is to propose a novel algorithm that directly optimizes an approximation of the regularized F-measure. The regularization term can be an , or a combination of and penalty based on different prior assumptions (Tibshirani [13, 14]; Wang et al. ). Due to the nature of penalty, our algorithm provides simultaneous feature selection and classification with penalty. The proposed algorithm can be easily applied to high dimensional microarray data. One advantage with this method is that it is very efficient when data is highly unbalanced, since it assigns different costs to the positive and negative samples.
The paper is organized as follows. In Section 2 we introduce the related concept of ROC and F-measure. The algorithm and the brief proof of its generalization bounds are proposed in Section 3. The computational experiments and performance evaluation are given in Section 4. Finally the conclusions and remarks are discussed in Section 5.
2. ROC Curves and F-Measure
In binary classification, a classifier attempts to map the instances into two classes: positive (p) and negative (n). There are four possible outcomes with the given classifier: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Table 1 summarizes these outcomes with their associated terminology. The number of positive instances is + FN. Similarly + FP is the number of negative instances.
From these counts the following statistics are derived:where true positive rate (also called recall or sensitivity) is denoted by tpr and true negative rate (specificity) by tnr. False positive rate and false negative rate are denoted by fpr and fnr, respectively. Note that , and . We also define the precision . ROC curves plot the true positive rate versus false positive rate by varying the threshold which is usually the probability of the membership to a class, distance to a decision surface, or a score produced by a decision function. In the ROC space, the upper left corner represents a perfect classification, while a diagonal line represents random classification. A point in ROC curve that lies upper left of another point represents a better model.
F-measure combines the true positive rate (recall) and precision Pr into a single utility function which is defined as -weighted harmonic mean:
can be expressed with TP, FP, and FN as follows:
where is the number of positive samples, and . Clearly and only when all the data are classified correctly. Maximizing F-measure is equivalent to maximizing the weighted sensitivity and specificity. Therefore, maximizing will indirectly lead to maximize the area under ROC curve (AUC).
To optimize , we have to define TP, FN, and FP mathematically. We first introduce an indicator functionwhere C is a set. Let be a classifier with coefficients (weights) and input variable , and let be the predicted value. Given samples, , where is a multidimensional input vector with dimension and class label ; TP, FN, and FP are given, respectively:
It is clear that F-measure is a utility function that applies for the whole data set.
3. The Algorithm
Usually given a classifier with known parameters , F-measure can be calculated with the test data to evaluate the performance of the model. The aim of this paper is, however, to learn a classifier and estimate the corresponding parameters with a given training data and regularized F-measure maximization. Since , we have . Statistically is a probability that measures the proportion of samples correctly classified. Based on these observations, we can maximize the in the maximum log likelihood framework. Different assumptions for the prior distribution of will lead to different penalty terms. Given the coefficient vector with dimension , we have for the assumption of Gaussian distribution and with that of Laplacian prior. In general, penalty encourages sparse solutions, while the classifiers with are more robust. We make TP, FN, and FP depend on explicitly and maximize the following penalized F-measure functions:We have
Note that TP, FN, and FP are all integers, and the index function in (7) is not differentiable. We first define an S-type function to approximate the index function : Let be a linear score function,The decision role such that if can be represented as
Figure 1 gives some insight about the . Figure 1 shows that is a better approximation of than the sigmoid function . The first derivative of is continuous and given in (12):Based on (10) and (11), the approximated version of TP and can be written as follows: We can find the first-order derivatives of and , respectively, as follows:where,Knowing and , and their derivatives and , we can maximize the penalized function and with gradient descent-related algorithm such as Broyden-Fletcher-Goldfarb-Shanno- (BFGS-) related quasi-Newton method (Broyden ). The algorithm for maximization is straight forward as shown in Algorithm 1. The step-size in the algorithm can be found with line search.
The regularized F-measure maximization with penalty () is of especial interest because it favors sparse solutions and can select features automatically. However, maximizing is a little bit complex since and are not differentiable at 0. For simplicity, let , we have . The Karush-Kuhn-Tucker (KKT) conditions for optimality are given as follows:The KKT conditions tell us that we have a set of nonzero coefficients which corresponds to the variables whose absolute value of first-order derivative is maximal and equal to , and that all variables with smaller derivatives have zero coefficients at the optimal penalized solution. Since is differentiable everywhere except at 0, we can design an algorithm to deal with the nonzero coefficients only. Algorithm 2 proposes an algorithm that can be applied to the subspace of nonzero coefficient set denoted by . The algorithm has a procedure to add or remove variables from , when the first-order derivative becomes large and when a coefficient hits 0, respectively.
3.1. Computational Considerations
Both and are free parameters that need to be chosen. We will choose the best parameter for and with the area under ROC curve (AUC). Area under the ROC curve (AUC) is another scalar measure for classifier comparison. Its value is between . Larger AUC values indicate better classifier performance across the full range of possible thresholds. For datasets with skewed class or cost distribution is unknown as in our applications, AUC is a better measure than prediction accuracy.
Given a binary classification problem with positive class samples and negative class samples, let be the score function to rank a sample . AUC is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Mathematicallywhere is an index function and if , otherwise . AUC is also called Wilcoxon-Mann-Whitney statistic (Rakotomamonjy ).
Note that is generally a nonconcave function with respect to ; only local maximum is guaranteed. One way to deal with this difficulty is to employ the multiple-points initialization. Multiple random points are generated, and our proposed algorithms are used to find the maximum for each point. The result with the lowest test error is chosen as our best solution.
4. Computational Results
4.1. Benchmark Data
To evaluate the performance of the proposed method, experiments were performed on six benchmark datasets which can be downloaded from http://ida.first.fraunhofer.de/projects/bench/benchmarks.htm. These benchmark datasets have been widely used in model comparison studies in machine learning. They are all binary classification problems, and the datasets were randomly divided into train and test data 100 times to prevent bias and overfitting. The data are normalized with zero mean and standard deviation. The overview of the datasets is given in Table 2. The computational results with our algorithms, logistic regression, and linear support vector machines are given in Figures 2-3.
Figures 2-3 show that F-measure maximization performs better or equivalent compared with logistic regression and linear support vector machines (SVM) in limited experiments. In fact, the test errors for all datasets except for Thyroid are competitive with that of the nonlinear classification methods reported by Ratsch (http://ida.first.fraunhofer.de/projects/bench/benchmarks.htm). The inferior performance of F-measure with Thyroid data indicates the strong nonlinear factors in that data.
4.2. Real Methylation Data
This methylation data are from 7 CpG regions and 87 lung cancer cell lines (Virmani et al. , Siegmund et al. ). 41 lines are from small cell lung cancer and 46 lines from nonsmall cell lung cancer. The proportion of positive values for the different regions ranges from 39% to 100 for the small cell lung cancer and from 65% to 98 for the nonsmall cell lung cancer. The data are available at http://www-rcf.usc.edu/kims/SupplementaryInfo.html. We utilize the twofold cross validation scheme to choose the best and test our algorithms. Other cross-validation schemes such as 10-fold cross validation will lead to similar results but are more computational intensive. We randomly split the data into two roughly equal-sized subsets and build the classifier with one subset and test it with the other. To avoid the bias arising from a particular partition, the procedure is repeated 100 times, each time splitting the data randomly into two folds and doing the cross validation. The average computational results with different s and are given in Table 3. Table 3 shows the selected variables (1: selected; 0: not selected), sensitivity, specificity, test errors, and AUC values with different 's. We can see clearly the sensitivity increases while the specificity decreases as increases. When , every example is classified as positive examples. The best will be 0.4 according to AUC but it will be 0.2 based on test error. Therefore, again there is some inconsistence between two measures. Figure 4 gives some sight about how to choose and the number of features. Given , the optimal , and those 5 out of 7 CpG regions selected by F-measure maximization have been proved to be predictive of lung cancer subtype (Siegmund et al. ). The performance of the model is improved roughly in AUC and in test error with only 5 instead of 7 CpG regions.
4.3. High Dimensional Microarray Data
The colon microarray data set (Alon et al. ) has 2000 features (genes) per sample and 62 samples which consisted 22 normal and 40 cancer tissues. The task is to distinguish tumor from normal tissues. The data set was first normalized for each gene to have zero mean and unit variance. The transformed data was then used for all the experiments. We employed a same twofold cross validation scheme to evaluate the model. This computational experiments are repeated 100 times. The AUC was calculated after each cross validation. The computational results for performance comparison are reported in Table 4.
Table 4 gives us some insight that how the model performance changes with different 's. Generally we can see that the false negative (FN) decreases and the false positive (FP) increases as increases. The only exception is when , both FN and FP have the worst performance. The best performance is achieved when according to both AUC and the number of misclassified samples.
The 10 genes selected are given in Table 5. The selected genes allow the separation of cancer from normal samples in the gene expression map. Some genes were selected because their activities resulted in the difference in the tissue composition between normal and cancer tissue. Other genes were selected because they played a role in cancer formation or cell proliferation. It was not surprise that some genes implicated in other types of cancer such as breast and prostate cancers were identified in the context of colon cancer because these tissue types shared similarity. Our method is supported by the meaningful biological interpretation of selected genes. For instance, three muscle-related genes (H20709, T92451, and J02854) were selected from the colon cancer data, reflecting the fact that normal colon tissue had higher muscle content, whereas colon cancer tissue had lower muscle content (biased toward epithelial cells), and the selection of x12671 ribosomal protein agreed with an observation that ribosomal protein genes had lower expression in normal than in cancer colon tissue.
5. Conclusions and Remarks
We have presented a novel regularized F-measure maximization for feature selection and classification. This technique directly maximizes the tradeoff between specificity and sensitivity. Regularization with and allows the algorithm to converge quickly and to do simultaneous feature selection and classification. We found that it has better or equivalent performances when compared with the other popular classifiers in limited experiments.
The proposed method has the ability to incorporate nonstandard tradeoffs between sensitivity and specificity with different . It is well suited for dealing with unbalanced data or data with missing negative (positive) samples. For instance, in the problem of gene function prediction, the available information is only about positive samples. In other words, we know which genes have the function of interested, while it is generally unclear which genes do not have the function. Most standard classification methods will fail but our method can train the model with only positive labels by setting .
One difficulty with the regularized F-measure maximization is the nonconcavity of the error function. We utilized the random multiple points initialization to find the optimal solutions. More efficient algorithms for nonconcave optimization will be considered to speed up the computations. The applications of the proposed method in gene function predictions and others will be explored in the future.
- M. S. Pepe, The Statistical Evaluation of Medical Tests for Classification and Prediction, Oxford University Press, Oxford, UK, 2003.
- M. S. Pepe, “Evaluating technologies for classification and prediction in medicine,” Statistics in Medicine, vol. 24, no. 24, pp. 3687–3696, 2005.
- M. S. Pepe and H. Janes, “Insights into latent class analysis of diagnostic test performance,” Biostatistics, vol. 8, no. 2, pp. 474–484, 2007.
- F. Provost and T. Fawcett, “Robust classification for imprecise environments,” Machine Learning, vol. 42, no. 3, pp. 203–231, 2001.
- T. A. Lasko, J. G. Bhagwat, K. H. Zou, and L. Ohno-Machado, “The use of receiver operating characteristic curves in biomedical informatics,” Journal of Biomedical Informatics, vol. 38, no. 5, pp. 404–415, 2005.
- D. Kun, C. Bourke, S. Scott, and N. V. Vinodchandran, “New algorithms for optimizing multi-class classifiers via ROC surfaces,” in Proceedings of the 3rd Workshop on ROC Analysis in Machine Learning (ROCML '06), pp. 17–24, Pittsburgh, Pa, USA, June 2006.
- V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995.
- Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, “An efficient boosting algorithm for combining preferences,” Journal of Machine Learning Research, vol. 4, no. 6, pp. 933–969, 2004.
- C. Cortes and M. Mohri, “AUC optimization vs. error rate minimization,” in Advances in Neural Information Processing Systems 16, pp. 313–320, MIT Press, Cambridge, Mass, USA, 2003.
- A. Rakotomamonjy, “Optimizing AUC with Support Vector Machine (SVM),” in Proceedings of European Conference on Artificial Intelligence Workshop on ROC Curve and AI, Valencia, Spain, 2004.
- D. R. Musicant, V. Kumar, and A. Ozgur, “Optimizing F-measure with support vector machines,” in Proceedings of the 16th International Florida Artificial Intelligence Research Society Conference (FLAIRS '03), pp. 356–360, St. Augustine, Fla, USA, May 2003.
- M. Jansche, “Maximum expected F-measure training of logistic regression models,” in Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP '05), pp. 692–699, Vancouver, Canada, October 2005.
- R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B, vol. 58, no. 1, pp. 267–288, 1996.
- R. Tibshirani, “The lasso method for variable selection in the Cox model,” Statistics in Medicine, vol. 16, no. 4, pp. 385–395, 1997.
- L. Wang, J. Zhu, and H. Zou, “The doubly regularized support vector machine,” Statistica Sinica, vol. 16, no. 2, pp. 589–615, 2006.
- C. G. Broyden, “Quasi-Newton methods and their application to function minimization,” Mathematics of Computation, vol. 21, no. 99, pp. 368–381, 1967.
- A. K. Virmani, J. A. Tsou, K. D. Siegmund et al., “Hierarchical clustering of lung cancer cell lines using DNA methylation markers,” Cancer Epidemiology Biomarkers & Prevention, vol. 11, no. 3, pp. 291–297, 2002.
- K. D. Siegmund, P. W. Laird, and I. A. Laird-Offringa, “A comparison of cluster analysis methods using DNA methylation data,” Bioinformatics, vol. 20, no. 12, pp. 1896–1904, 2004.
- U. Alon, N. Barkai, D. A. Notterman et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 12, pp. 6745–6750, 1999.
Copyright © 2009 Zhenqiu Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.