Abstract

-measure is one of the most commonly used performance metrics in classification, particularly when the classes are highly imbalanced. Direct optimization of this measure is often challenging, since no closed form solution exists. Current algorithms design the classifiers by using the approximations to the -measure. These algorithms are not efficient and do not scale well to the large datasets. To fill the gap, in this paper, we propose a novel algorithm, which can efficiently optimize -measure with cost-sensitive SVM. First of all, we present an explicit transformation from the optimization of -measure to cost-sensitive SVM. Then we adopt bundle method to solve the inner optimization. For the problem where the existing bundle method may have the fluctuations in the primal objective during iterations, an additional line search procedure is involved, which can alleviate the fluctuations problem and make our algorithm more efficient. Empirical studies on the large-scale datasets demonstrate that our algorithm can provide significant speedups over current state-of-the-art -measure based learners, while obtaining better (or comparable) precise solutions.

1. Introduction

SVM (Support Vector Machine) as a powerful classification tool is well known for its strong theoretical foundation and good generalization ability. In the binary setting, it is often evaluated by accuracy (the rate of correct classification). However, accuracy is not always a good measure and may be misleading, when the setting is imbalanced. In this situation, utility function such as -measure provides a better way for the classifier evaluation, since it is a trade-off between precision and recall [1]. As a popular performance metric, -measure has been widely used in diverse applications such as information retrieval [2, 3], biometrics [4, 5], and natural language processing [6, 7].

Owing to its importance, it has been well studied in machine learning area, and many works have been focused on designing the -measure based classifiers. However, directly optimizing -measure is often difficult as the resulting optimization problem is nonconvex, and no closed form solution exists. Therefore, various approximation algorithms have been proposed, which mainly fall into two paradigms [8]. The Empirical Utility Maximization (EUM) approach learns a classifier having optimal performance on training data [916], while the decision-theoretic (DT) approach learns a probabilistic model and then predicts labels with maximum expected -measure [1720]. Since, in this paper, our aim is to design an efficient classifier for maximizing -measure, and DT approach possibly needs high computational complexity for the prediction step [8], in the following, we are focused on the Empirical Utility Maximization approach.

As the -measure is a nonconvex metric, EUM approach often designs convex surrogates for optimizing -measure and results in the development of two types of methods. The first type belongs to the “direct method,” which directly defines different surrogate objective functions for maximizing -measure [915]. One representative work is SVMperf, which adopts structural SVM as surrogate framework, and uses cutting plane algorithm to solve the inner optimization [11]. This algorithm has many virtues (such as good generalization performance and rapid convergence speed) and is viewed as the most important and successful algorithm in EUM approach. Suzuki et al., Cheng et al., and Chinta et al. extended the work of SVMperf and applied it into different areas [1215]. The second type belongs to the “indirect method” and is recently proposed by Parambath et al. [16]. It is a novel method, which solves the problem by transforming it into a cost-sensitive classification.

Although those two methods are effective and are known to work fairly well in many different areas, there is still one disadvantage with those methods, where both of them are not very efficient, which may prohibit them from the large-scale applications. Moreover, for the novel “indirect method,” its key contribution is the theoretical part, and the authors presented a theoretical analysis that the optimal -measure classifier can be obtained by the reduction to cost-sensitive classification. But how to convert the procedure of maximizing -measure into a cost-sensitive problem, that paper does not give an explicit solution.

To fill the gap, in this paper we focus on binary classification and propose a novel algorithm, which can efficiently optimize -measure with cost-sensitive SVM. It seems that our algorithm belongs to the “indirect method”; however, it uses similar optimization technique like SVMperf, which means our algorithm can be viewed as the combination of the “direct method” and the “indirect method.” More specifically, this paper makes the following contributions:(1)Different from Parambath’s work, which only gives a theoretical analysis, we present an explicit transformation from maximizing -measure to cost-sensitive SVM.(2)For the new cost-sensitive problem, we propose to solve it with bundle method, which is similar to the cutting plane algorithm used in SVMperf and has rate of convergence.(3)Different from SVMperf, which is bothered with the fluctuations in primal objective, an additional line search procedure is introduced to the bundle method, which can avoid this undesirable effect and make our algorithm more efficient.(4)Empirical evaluations on the large-scale imbalanced datasets demonstrate that when compared with currently existing -measure based classifiers, the learner we proposed can greatly reduce the training time, while obtaining better (or comparable) accuracy of the model.

The remainder of the paper is organized as follows. In Section 2, the related work is presented. Section 3 discusses the details of our proposed algorithm and the empirical results on the benchmark datasets are reported in Section 4. Section 5 concludes the paper and discusses the future work.

2.1. Problem Setup and Notations

As discussed in the introduction, in this paper, we only consider the binary classification problem. Given a training dataset , where is th example and is the corresponding class label. For simplicity, we assume that positive instances are ahead of negative ones, which means are the indexes of positive instances and the rest are those of the negatives. and denote the number of positive instances and the negative ones, respectively. Binary classification problem is to construct a classifier function , which gives good generalization performance. In this paper, we assume the classifier is of the form and the decision function is used to find the label of an unseen example. Note that we have not included the bias term in the classifier function for notational convenience. However, it can be incorporated in a straightforward way.

In machine learning area, a common way to find the linear parameter is to minimize a regularized risk function:where is a constant that controls the trade-off between training error minimization and margin maximization. is a suitable loss function which measures the discrepancy between a true label and a predicted value from using parameter . Different loss functions yield different learners. One of the most famous loss functions is the hinge loss in SVM, which has the form of .

2.2. Relevant Background

When given a SVM classifier , its performance can be evaluated by Table 1.

TP, TN, FP, and FN in Table 1 denote true positive, true negative, false positive, and false negative, respectively, and

By using the confusion matrix in Table 1, precision and recall can be expressed as

In the imbalanced learning, people often use weighted harmonic mean of precision and recall, which is named -measure to evaluate the performance of a classifier [1], and it can be formally defined as where . It is obvious that if , -measure is the precision and if , -measure turns to the recall. In practice, the most widely used -measure is , which means .

Because of its popular usage in the imbalanced classification, various approaches have been proposed for maximizing -measure. One main paradigm is Empirical Utility Maximization, which learns a classifier having optimal -measure on the training data. However, direct optimization of -measure is difficult as the resulting optimization problem is nonconvex. Thus, approximation techniques are often used instead (since these algorithms directly design approximation objective functions oriented to -measure, they are termed as “direct method” [8]). For example, Musicant et al., Liu et al., and Joachims et al. have designed different surrogate algorithms for optimizing the -measure [911]. Among them, the work of Joachims et al. (which referred to SVMperf) is the most important, since not only their work provides a general framework for optimizing any imbalanced measure, but also their inner optimization technique is efficient. This algorithm makes use of a cutting plane solver along the lines of the structural SVM and has rate of convergence for any desired precision . Based on this work, Suzuki et al. and Cheng et al. applied SVMperf to the CRF and topical classification [12, 13], while Chinta et al. and Dembczynski et al. further extended it to the sparse learning and multilabel learning [14, 15]. Recently, an “indirect method” for optimizing -measure which used cost-sensitive technology has been proposed by Parambath et al. [16]. The authors took the advantage of the pseudolinearity of -measure and presented a theoretical analysis that the optimal classifier for -measure can be obtained by solving a cost-sensitive problem. Both “direct method” and “indirect method” are effective and are suitable for many various applications. However, those two methods have a common limitation that they are not very efficient, which may prohibit them from being used in the large-scale datasets. For the “direct method,” we take the SVMperf as an example, which is one of the most efficient algorithms in EUM. Although it has rapid convergence speed, its inner optimization can only guarantee the dual objective function increases monotonically and does not guarantee the primal objective decreases monotonically [21]. These fluctuations in primal objectives may slow down the practical convergence speed and make the SVMperf inefficient. Similar problem occurs on the “indirect method,” because this method should be implemented with other existing SVMs, which may also have these undesirable fluctuations during the iterations. Furthermore, for the recently proposed “indirect method,” the main contribution of it is the theoretical part, and the authors only present a theoretical analysis that maximizing -measure can be reduced to a cost-sensitive classification. However, they do not give an explicit transformation from the optimization of -measure to cost-sensitive SVM.

So in the following, by giving an explicit transformation, we will present a novel cost-sensitive SVM based algorithm that can maximize -measure. The algorithm uses the bundle method as the inner optimizer and avoids the fluctuations in primal objective by adding a line search procedure, which means our algorithm can be more efficient than the existing algorithms such as [8, 11, 16].

3. Efficient Algorithm for Optimizing -Measure with Cost-Sensitive SVM

3.1. From Maximizing -Measure to the Cost-Sensitive Classification

Based on the definition of formula (4), -measure can be further expressed as

By assuming , we can find that maximizing -measure is equivalent to minimizing the following problem:where is a positive constant. Since and are both constants, formula (6) can be simplified aswhere is a positive constant. Based on the definitions of and , formula (7) can be rewritten aswhere and , are the misclassification cost parameters for positive and negative classes, respectively. It is obvious that formula (8) is a cost-sensitive problem, and the lower the total cost, the better the classification performance.

3.2. Efficient Algorithm for the Cost-Sensitive SVM

Based on formula (8), we can transform the maximization of -measure problem to a cost-sensitive SVM, which is demonstrated as

Formula (9) is equivalent toOP1:where is a constant that controls the trade-off between training error minimization and margin maximization. For the OP1 above, we regard it as a regularized risk minimization problem and adopt the bundle method to solve it. Bundle method uses subgradients of the empirical risk function to approximate its piecewise linear lower bound, which is similar to the cutting plane algorithm (CPA) used in SVMperf. In contrast to CPA, the linear lower bound of bundle method is augmented with a stabilization term, which can guarantee a good quality solution [22]. By taking the first-order Taylor approximation to the empirical risk function, the lower bounder is tightened iteratively until the difference gap between the approximated lower bound and the real risk function is smaller than a predefined threshold . The whole algorithm can be described as shown in Algorithm 1.

(1) Input: convergence threshold ;
(2) Initialize: weight vector and iteration index ;
(3) repeat
(4) ;
(5) Compute ;
(6) Compute bias ;
(7) Update the lower bound ;
(8) ;
(9) Compute current gap ;
(10) until ;
(11) Output: as the of OP1.

It has been proved that the bundle method in Algorithm 1 had rate of convergence for any desired precision [22]. Although the bundle method is effective, it also has the fluctuations in primal objective, which also occurred on the CPA of SVMperf. More specifically, when solving the subproblem of step , it always selects a new cutting plane such that the dual objective monotonically increases. However, this selection does not guarantee that the primal problem monotonically decreases, which means that the primal objective values can heavily fluctuate between iterations, and it may slow down the practical convergence speed of the algorithm.

To solve this problem and speed up the convergence, in the following, we will present an additional line search algorithm for step , which can guarantee that the primal objective function monotonically decreases and make our algorithm more efficient.

3.3. Efficient Line Search Algorithm

First of all, we introduce an intermediate variable maintaining the best-so-best solution during the first iterations, which means is a monotonically decrease sequence.

Secondly, the new is found by searching a line along the previous and the origin solution , which gives the following:OP2:

Finally, the new cutting plane is computed to approximate the primal objective at a point , which lies in a vicinity of the . More specifically, the variable is obtained bywhere is a predefined parameter. With the point , the new cutting plane is given by , and .

Similar to step of Algorithm 1, a nature stopping condition for our improved algorithm is

With those changes to bundle method, we can generate a monotonically decrease sequence of primal objective and achieve faster convergence. However, in practice, there is still one problem with our improved algorithm, which is how to compute the OP2 efficiently. So in the following, we will give an efficient algorithm for solving this problem, which only needs time.

Firstly, combining formulas (10) and (11), we can obtainWe abbreviate and get

That is, where

The OP2 is equivalent to solving . Since is a convex function, its minimum is attained at the point , where the subdifferential contains zero, which means hold. The subdifferential can be expressed aswhere

For formula (18), the first two terms constitute an ascending linear function , since . Note means that , which indicates that our algorithm has converged to the optimum . The latter two terms and are either constants or step functions by the definitions of formula (19). Hence, is a monotonically increasing function, which can be depicted as in Figure 1.

From Figure 1, we can begin with to find the best solution of : where

Based on equalities (20) and (21), we can find that if , the minimum is attained at the point equal to 0, which means . While if , the optimum is obtained by finding an intersection between and the -axis (as Figure 1 shows). This can be done efficiently by sorting every step points. The whole algorithm is described as shown in Algorithm 2.

(1)Input:
(2)Compute with formula (20) and (21)
(3)if    then
(4)  
(5)else
(6)   // find the step point sets with
(7)  Sort in ascending order where
(8)   // begin with zero point
(9)  for    do
(10)  
(11)  if    then
(12)    // Case 1   lies in the slash area
             
(13)  else
(14)   
(15)   if    then
(16)      // Case 2   lies in the vertical area
(17)   end if
(18)  end if
(19)    end for
(20) end if
(21) Output: .

For Algorithm 2, the theorem below guarantees that it only requires time.

Theorem 1. The total running time of Algorithm 2 is .

Proof. From the pseudo code of Algorithm 2, we can find that the time of step to step is , and step needs at most time. Step is sorting, which can be implemented in time. Step to step take time, and other steps all require time. Therefore, Algorithm 2 has the complexity of . Since in practice , which means that even in the worst case the total running time of Algorithm 2 is .

4. Experiments

In this section, we will compare our classifier with other existing learners for maximizing -measure and give details about results on the benchmark datasets.

4.1. Baselines and Datasets

We evaluate the performance of our algorithm (termed as BM-ls-CS) with SVMperf [11], SVM-CS [16], BM-nls-CS, and LR- [8], which are all -measure based learners. The first three baselines follow EUM approach, while the last one falls into DT approach. More specifically, the first one, SVMperf, uses direct method and is the most popular imbalanced classifier for EUM approach. It adopts the structured SVM to maximize -measure and applies the cutting plane algorithm for inner optimization, which is similar to ours. The second one, SVM-CS, is a recently proposed classifier, and as mentioned before, it belongs to indirect method. Same as our BM-ls-CS, it is a cost-based algorithm. The main difference between SVM-CS and ours is the inner solver. The third comparison algorithm BM-nls-CS uses bundle method without line search as the optimizer. We include it in evaluations to see whether the line search technology we proposed can improve the speed of convergence. In addition, we also compare our algorithm with a decision-theoretic method named LR-, which is recently proposed by Ye et al. We select it as the fourth baseline, since it is an efficient algorithm and only needs time for computing the optimal predictions. Finally, it should be noted that, in our experiments, we do not implement the -measure based learners with “approximative solver” (such as SGD). Although these learners may have a low per-iteration cost and low total training time, their approximations to the optimal solution are crude and often fail to achieve a precise solution.

All the comparison algorithms adopt the same experimental setup and are carried out on a Linux machine with 3.4 GHz Intel Core and 8 GB of RAM. The penalty parameter for SVMperf is selected from set by cross-validation, and the corresponding parameters for SVM-CS and LR- are determined from set (the parameters in SVMperf and in SVM-CS and LR- satisfy the following relation: ), while for BM-ls-CS and BM-nls-CS, the regularization parameter (the parameters in SVMperf and in bundle method satisfy the following relation: ) and the parameter in formula (12) are selected from . For all the cost-sensitive algorithms (BM-ls-CS, SVM-CS, and BM-nls-CS), we set and . The proper is chosen from , which is suggested by Proposition in Parambath’s paper. The approximation gap is set as for each EUM algorithm.

With the algorithms above (SVMperf, SVM-CS, BM-nls-CS,LR-, and BM-ls-CS), we perform experiments on six datasets, which are a3a, acoustic, ijcnn1, letter, news20, and satimage. We choose them as the experimental sets, since they are all imbalanced datasets with large sample sizes. These datasets can be downloaded from LIBSVM website (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/), and their characteristics are summarized in Table 2.

In Table 2, the “#Examples” and “#Features” denote the number of examples and features, respectively. “Min” represents the proportion of examples in the minority class. For each dataset, we use the same split as in LIBSVM repository and report the results from the following two aspects.

4.2. Experimental Results
4.2.1. The Performance Behaviors of Different Learners ( Value and Running Time)

Since the main goal of this paper is to produce an efficient -measure learner, in the first part of experiments, we are concerned about the performance behaviors of different algorithms, and the comparison results are depicted in Table 3.

Note that, in Table 3, there are two values in each blank, the top one denotes the objective value measured by , and the bottom one is the running time in seconds. Higher value and lower time are better.

From Table 3, we can find that when measured by , the performances of EUM algorithms (SVMperf, SVM-CS, BM-nls-CS, and BM-ls-CS) and DT algorithm (LR-) vary from one dataset to another, and there is no one algorithm that can outperform other algorithms on all the datasets. For example, on news20 set, EUM algorithms are almost better than DT algorithm, while, on satimage set, DT algorithm is superior to EUM algorithms. This is coherent with the result of Ye et al., according to which both two approaches are effective, and it is difficult to say which one is better on the large datasets [8]. Meanwhile, statistics show that, for the EUM approach, three cost-sensitive algorithms (SVM-CS, BM-nls-CS, and BM-ls-CS) are better than (or comparable to) SVMperf, which once again indicate that we can produce a good -measure based classifier by transforming it into a cost-sensitive problem.

Moreover, Table 3 also shows that if measured by running time, our BM-ls-CS consistently outperforms SVMperf, BM-nls-CS, and LR- on all benchmark datasets. For example, when compared with SVMperf, BM-ls-CS performs better in terms of both value and CPU time. Especially for CPU time, BM-ls-CS can gain speedups of hundreds orders of magnitude over SVMperf on several experimental datasets. Similar comparison results appear with BM-nls-CS and LR-. Statistics show that our algorithm with line search is significantly faster than those two baselines, while obtaining better (or comparable) values. However, for SVM-CS which is implemented by Liblinear, it is a bit different. Experimental results show that our algorithm is faster than SVM-CS on five out of the six datasets (a3a, acoustic, ijcnn1, letter, and satimage) and is only slower than SVM-CS on news20. Statistics demonstrate that BM-ls-CS performs better than SVM-CS in terms of value (98.70 versus 81.67), while SVM-CS can achieve speedups of several orders of magnitude over BM-ls-CS (0.21 s versus 12.64 s). The reason maybe lies in their different inner optimizers. SVM-CS adopts Liblinear, which is specially designed for the dataset with large features, while the bundle method we use does not give special consideration of this situation (note that the cutting plane algorithm which SVMperf uses also does not consider this situation).

All the statistical data above proves that as an -measure based learner, our BM-ls-CS is both efficient and effective, when compared with other existing baselines.

Finally, from Table 3 we can observe that, although BM-nls-CS and SVMperf solve the same equivalent problem with similar optimization technique, their performances are quite different. Experimental results show that BM-nls-CS is faster than SVMperf on all the datasets, and it is largely due to their different implementations (e.g., QP solver). Therefore, in order to see whether our line search technology can enhance the convergence speed, in the following we will only compare the BM-ls-CS with BM-nls-CS for fair play.

4.2.2. The Comparison between BM-ls-CS and BM-nls-CS

In the second part of experiments, we are interested in the convergence speed between our algorithm and BM-nls-CS, which uses the same inner solver without line search. Thus, we consider the number of iterations used in reducing the primal objective value, and Figure 2 gives the objective value as a function of training iterations for the two algorithms on various datasets.

From Figure 2, we can find that even though the BM-nls-CS ultimately converges to the minimum, its values heavily fluctuate during the iterations. The reason for these fluctuations lies in the fact that, during the iterations, the cutting plane selected by BM-nls-CS only guarantees that dual value monotonically increases. However, there is no guarantee that such a cutting plane will lead the primal value to monotonically decrease, as the figure depicts that often occurs. On the contrary, it is clear from the figure that our algorithm enjoys progression with “strictly” decreasing objective values and achieves speedups of more than one order of magnitude over BM-nls-CS. This fact implies that our line search technology can help to avoid the “stalling” steps and accelerate the convergence speed of algorithm.

5. Conclusion

In this paper, we have presented a novel cost-sensitive SVM algorithm that can optimize -measure efficiently. We began our work with an explicit transformation from maximizing -measure to cost-sensitive classification and then proposed to use the bundle method for the inner optimization, which had rate of convergence. For the problem where the existing bundle method only guaranteed that the dual objective increases monotonically and did not guarantee that the primal objective decreases monotonically, an efficient line search algorithm has been proposed, which can avoid this undesirable effect, and accelerated the practical convergence speed of our BM-ls-CS algorithm. Experiments on the benchmark datasets showed that when compared with other existing -measure based learners, BM-ls-CS we proposed not only gave better generalization performance, but also provided significant speedups during the training. There are two issues that are worthy of further investigations in the future. The first topic is to extend our approach to other imbalanced measures such as PAUC [23] or SAUC [24] and design the efficient algorithms for optimizing these metrics. The second one is to solve our problem from the view of Multiobjective Optimization, since recent works on MOO [25, 26] show that a cost-sensitive problem can be regarded as a multiobjective problem. In the future, we plan to produce an efficient -measure classifier through the Multiobjective Optimization.

Competing Interests

The authors have declared that no conflict of interests exists.

Acknowledgments

This work is supported by the Humanities and Social Sciences Project of Chinese Ministry of Education (Grant no. 13YJC870003), the Natural Science Foundation of China (Grant no. 61402002), and Key Program of Natural Science Project of Educational Commission of Anhui Province (Grant no. KJ2015A070), Youth Foundation of Anhui University (Grant no. KJQN1119), the Doctor Foundation of Anhui University (Grant no. 01001902), and the Foundation for the Key Teacher by Anhui University. This work is also supported by Co-Innovation Center for Information Supply & Assurance Technology, Anhui University.