Abstract

In recent years, imbalanced learning problem has attracted more and more attentions from both academia and industry, and the problem is concerned with the performance of learning algorithms in the presence of data with severe class distribution skews. In this paper, we apply the well-known statistical model logistic discrimination to this problem and propose a novel method to improve its performance. To fully consider the class imbalance, we design a new cost function which takes into account the accuracies of both positive class and negative class as well as the precision of positive class. Unlike traditional logistic discrimination, the proposed method learns its parameters by maximizing the proposed cost function. Experimental results show that, compared with other state-of-the-art methods, the proposed one shows significantly better performance on measures of recall, -mean, -measure, AUC, and accuracy.

1. Introduction

Recently, class imbalance problem, also called skewed or rare class problem, has drawn a significant number of interests in academia, industry, and government. For the two-class case, this problem is characterized as having many more examples of one class (majority class or negative class) than the other (minority class or positive class) [13]. In many real-world applications, the correct prediction of examples in positive class is often more meaningful than the contrary case. For example, in cancer detection, most patients belong to common disease, rare patients may have cancer, and how to effectively recognize cancer patients is very meaningful. However, conventional classification methods such as C4.5, naive bayes, and neural network, try to pursue a high accuracy by assuming that all classes have similar size, which leads to the fact that the rare class examples are often overlooked and misclassified to majority class [4, 5].

Many approaches have been proposed to tackle this problem, which can be roughly categorized into three levels: data preprocessing level, algorithm learning level, and prediction postprocessing level. For the data preprocessing level, the algorithms focus more on examples with positive class through one of the three approaches: (1) the algorithms running on the rebalanced data sets obtained by manipulating the data space [6, 7] such as undersampling technique and oversampling one, (2) actively selecting the more valuable examples to learn models and leaving the ones with less information to improve models’ performance [8, 9], and (3) weighting data space using information concerning misclassification costs to avoid costly errors [10]. The approaches at the algorithm learning level try to adjust existing classifier learning algorithms such that the learned models are biased towards correctly classifying positive class examples, such as two-phase rule induction [11] and one-class learning. Existing approaches at prediction postprocessing level try to focus more on positive class by moving a decision threshold [12] or minimizing a cost function [13].

In this paper, we reconsider the imbalanced problem at algorithm level and propose a novel method called ILLD (Imbalanced Learning Based on Logistic Discrimination) to tackle the problem. The motivation is inspired by the following observation: there are very few researches studying the logistic discrimination on the class imbalanced problem, although it has many merits including understandability, solid theoretical basics, and, most importantly, high generalization ability. Unlike the traditional logistic discrimination, ILLD achieves high performance on imbalanced data by maximizing the proposed cost function APM (Accuracy-Precision Based Metric) which takes into account the accuracies of both positive class and negative class as well as the precision of positive class. Experimental results show that ILLD can much better boost the performance of logistic discrimination on measures of recall, -measure, -mean, and AUC while keeping its high performance on accuracy. Compared with other state-of-the-art classification methods, ILLD shows a much better performance.

The rest of this paper is organized as follows: after presenting related work in Section 2, Section 3 describes the proposed imbalanced learning method; Section 4 presents the experimental results; and, finally, Section 5 concludes this work.

2.1. Imbalanced Learning

Technically speaking, the data set which exhibits an unequal distribution between its classes can be considered imbalanced or skewed. However, in the community, only the data sets corresponding to the ones exhibiting extreme imbalances are treated as imbalanced data sets. There are two forms of imbalance, namely, within-class imbalance and between-class imbalance. For the within-class imbalance, some subconcepts exist in limited examples, which increase the difficulty of correctly classifying examples. With respect to the between-class imbalance, one class extremely out-represents another [1, 2]. Usually, the second form of imbalance is often discussed in community.

There are many factors that influence the modeling of a capable classifier when facing rare events. Examples include the skewed data distribution which is considered to be the most influential factor, small sample size, separability, and existence of within-class subconcepts [14].

The skewed data distribution is often denoted by imbalance degree which is the ratio of the sample size of the positive class to that of the negative class. Reported studies indicate that a relatively balanced distribution usually attains a better result. However, to what imbalance degree the class distribution deteriorates the classification performance cannot be stated explicitly, since other factors such as sample size and separability also affect performance [1, 2, 14].

Small sample size means the sample size is limited; uncovering regularities inherent in small class is unreliable. In [15], the authors suggest that the imbalanced class distribution may not be a hindrance to classification by providing a large enough data set.

The difficulty in separating the rare class from the prevalent class is the key issue of the imbalanced problem. Assuming that there exist highly discriminative patterns among each class, then not very sophisticated rules are required to distinguish class objects. However, if patterns among each class are overlapping, discriminative rules are hard to be induced [1, 2, 14].

Within-class concepts mean that a single class is composed of various subclusters or subconcepts. Instances of a class are collected from different subconcepts. These subconcepts do not always contain the same number of instances. The presence of within-class concepts worsens the imbalance distribution problem [14]. In general, we only consider imbalanced data distribution in imbalanced learning and fix other factors.

2.2. Logistic Discrimination

Logistic discrimination, also called logistic regression, is a typical probability statistical classification model [16], which has been widely used in many fields such as medical domain and social surveys because of its understandability, solid theoretical basics, and, most importantly, high generalization ability. For the two-class case, logistic discrimination is defined aswhere is the logistic sigmoid function defined asFor a given data set , where is the label associated with example , the likelihood function of this model can be written aswhere if and 1 otherwise. Defining a cost function by taking the negative logarithm of the likelihood, we have the cross-entropy error function in the formThe logistic discrimination uses (5) as the cost function; however, it is not suitable for the class imbalanced problem because the cross-entropy error function defined in (5) does not consider the importance of each class. To handle this problem, a novel cost function called APM (Accuracy-Precision Based Metric) is proposed, which takes into account the accuracies of both positive and negative classes as well as the precision of positive class. For more details refer to Section 3.

2.3. Strategies to Handle Imbalanced Problem

The imbalanced problems rise from the scarce representation of the most important examples, which leads to the fact that the learned models tend to focus more on normal examples, overlooking the rare class examples. Many approaches have been proposed to handle the problem, which can be mainly grouped into the following three categories.(i)Data preprocessing based strategy. These techniques preprocess the given imbalanced data set to change the data distribution such that standard learning algorithms focus more on the cases that are relevant for the user. Reported studies of preprocessing data sets can be categorized into three types: resampling, active learning, and weighting the data space. The object of resampling techniques is to rebalance the class distribution by resampling the data space. Commonly used resampling methods include randomly of informatively undersampling instances in negative class [6], randomly oversampling examples of positive class, oversampling based on cluster algorithm [17, 18], and oversampling the positive class by creating new synthetic instances [7]. Resampling data space technique is often used to deal with imbalanced learning problems, but the real class distribution is always unknown and differs from data to data. Active learning is to actively select the more valuable examples to learn models and leave the ones with less information to improve models’ performance by interacting with the user. Several approaches based on active learning have been proposed. For example, Ertekin [9] presented an adaptive oversampling algorithm called VIRTUAL (Virtual Instances Resampling Technique Using Active Learning) to generate synthetic examples for the positive class during the training process, Mi [19] developed a method that combines SMOTE and active learning with SVM, and so on. The strategies of weighting the data space aim to modify the training data set distribution using information concerning the misclassification costs, such that Wang and Japkowicz [10] combined an ensemble of SVM with asymmetric misclassification costs.(ii)Algorithm based strategy. It modifies existing classifier learning algorithms such that the learned models are biased towards the cases that are more concerned by the user. Many algorithms based imbalanced learning approaches have been proposed; for example, Cao et al. [20] presented a framework for improving the performance of cost-sensitive neural networks that adopts Particle Swarm Optimization for optimizing misclassification cost, feature subset, and intrinsic structure parameters; Alejo et al. [21] proposed two strategies for dealing with imbalanced domains using RBF neural networks which include a cost function in the training phase.(iii)Prediction postprocessing based strategy. The approaches of the strategy learn a standard model on the original data set and only modify the predictions of the learned model according to the user references and the imbalance of the data set. There exist two main types of solutions: threshold method and cost-sensitive postprocessing. For the former, each example is associated with a score which expresses the degree to which an example is a member of a class. Based on the scores, a threshold is used to generate different classifiers by varying the threshold for an example belonging to a class [12]. With respect to the latter, several methods exist for making models cost-sensitive in a post hoc manner. This type of strategy was mainly explored for classification tasks and aims at changing only the model predictions for making it cost-sensitive [13].

In this paper, we propose a novel algorithm based imbalanced learning method to improve the performance of the logistic discrimination. Besides, we apply sampling techniques to the logistic discrimination to enhance its performance. Two widely used sampling techniques are selected: random undersampling and oversampling. The corresponding experimental results are presented in Section 4.

3. Imbalanced Learning Based on Logistic Discrimination

3.1. Accuracy-Precision Based Metric

The traditional logistic discrimination learns its parameters by maximizing the cross-entropy error function defined in (5). However, this approach ignores the diverse costs of classes, which leads to the fact that the learned models have low performance on the positive classes. To tackle this problem, a novel cost function is proposed to guarantee that the learned models perform well on both positive class and negative class. The relevant symbols are defined as follows.

Define and as follows:where is defined by (1) or by (2). From (6), we have that is the estimation of the number of examples correctly classified as class (corresponding to ) and is the estimation of number of examples with class incorrectly classified. For two-class problem, we haveLet class “+” be the positive class as used before; then the cost function APM is defined asSince is the number estimation of examples being correctly classified as class and is that of the ones being incorrectly classified as aforementioned, APM is the estimation of the following equation:where is the accuracy (or recall) of positive class (+). Similarly, is the accuracy (or recall) of negative class (−) and is the precision of positive class (+). More details about these measures are discussed in Section 4.2. In this way, RPM considers all the three factors: the precision of minority class and the recall of both minority class and majority class.

Taking the gradient of APM (see (8)) with respect to results inwheresimilarly,Combining (11), (12), (13), and (10), we have that the gradient of APM defined by (8) isThe proposed method for the imbalanced problem uses a quasi-Newton method BFGS which uses (14) as base function for learning its parameters. For more details refer to Section 3.2.

3.2. Algorithm

Based on the cost function APM proposed in Section 3.1, a novel imbalanced learning approach called ILLD (Imbalanced Learning Based on Logistic Discrimination) is proposed to tackle data imbalance. ILLD uses quasi-Newton method BFGS [2224] to maximize the cost function to learn parameters, where BFGS is an iterative process. Formally, the iterative process is as follows:where is the step length along with the Newton direction of the th iteration and is the approximate Hessian matrix calculated bywhere

The details about the learning process of ILLD are shown in Algorithm 1. ILLD firstly initializes randomly and to be unit matrix of which the value of each diagonal element is equal to 1 and 0 for others (lines 1~2) and calculates using (11) based on and (lines 3~4). Then ILLD optimizes the cost function ARM to find out the best parameter vector (lines 4~11). Specifically, for the th iteration, ILLD calculates the gradients of as using (14) and, based on and , updates and using (17) (lines 8~9). Then, it updates using (16) (line 10) and, finally, updates using (15) (line 11). The convergence rate of ILLD is [2224] and the stopping condition is that the absolute of the difference between the values calculated by (15) for two consecutive iterations is not larger than 0.001 (line 13).

Input:
: training data set
: parameter greater than zero
Output:
learned parameters
Process:
(1)randomly initialize ;
(2) (unit matrix);
(3) APM // calculate the gradient of object function by (14)
(4) // update using (15)
(5);
(7)while  
(8) Calculate the gradients of as using (1)
(9) update and using (17);
(10)  update using (16);
(11)    ; // update using (15)
(12)
(13) return  
3.3. Discussion

Unlike traditional logistic discriminations which only consider the overall performances, ILLD takes into account more factors through the accuracy-precision based metric. Indeed, this criterion involves the accuracies (or recalls) of both positive class and negative class as well as the precision of positive class, which result from the prediction confusion matrix (discussed in Section 4.2). Thus ILLD considers not only the overall performance of logistic discrimination but also the performance on each class.

Considering only the former terms of ARM defined by (8), we haveSimilarly, considering only the former terms of (9), we have the evaluation measure of AUC [2] as shown in the following:Therefore, the proposed measure (without considering the last term) is the estimation of AUC. Besides, comparing (18), (19), and the evaluation of -mean defined aswe conclude that the proposed measure (without considering the last term) uses the arithmetic mean of accuracies (or recalls) of both positive class and negative class instead of the geometric mean as the cost function to supervise the learning process of the logistic discriminations.

Omitting the second term of both (11) and (12), thenWe observe from (21) that the proposed cost function is the metric that combines the accuracy (or recall) and precision of positive class together as -measure does.

4. Experiments

4.1. Data Sets and Experimental Setup

The 14 data sets utilized in this paper are randomly selected from the UCI repository [25]. Of these data sets, breast-Wisconsin, hepatitis, horse-colic, and ionosphere are imbalanced 2-class data sets. Others are 2-class imbalanced data sets derived from multiclass data sets by treating one class of a multiclass data set as the positive class while treating the union of all other classes as the negative class [26]. The imbalanced degree of these data sets varies from 0.0376 (highly imbalanced) to 0.3696 (only slightly imbalanced), where imbalanced degree is defined as the ratio of the sample size of the positive class to that of the negative class. The details about the data sets are shown in Table 1, where #Degree is the imbalance degree, #Exs is the size of data sets, #Attrs is the number of attributes, and #Cls is the number of classes. For each data set, a 5 × 2-fold cross-validation [27] is performed.

To evaluate the performance of ILLD, we compare it with LD, LD-US, and LD-OS, where LD denotes that traditional logistic discrimination (cross-entropy error function is treated as cost function) is simply applied to imbalanced problem and LD_US and LD_OS denote that LD runs on data sets obtained by undersampling and by oversampling the training data sets, respectively. Here, the prediction postprocessing approaches such as threshold method [12] are not used for comparisons, since the study in [28] concluded that the operations of moving the decision threshold, applying a sampling strategy, and adjusting the cost matrix produce classifiers with the same performance.

4.2. Evaluation Metrics

Evaluation metric is extremely essential to assessing the effectiveness of an algorithm and, traditionally, accuracy is the most frequently used one. Considering two-class classification problem and letting “+” and “−” be the positive and negative classes, respectively, as aforementioned, then examples can be categorized into four groups after a classification process as denoted in the confusion matrix presented in Table 2, and thus the accuracy is defined asHowever, the evaluation metrics used for the balanced problem is very different from that used for the imbalanced one, and accuracy is inadequate for imbalanced learning. In lieu of accuracy, other assessment metrics including recall, precision, -measure, and -mean are frequently adopted in the research community to evaluate the performance of models on imbalanced learning problems. These metrics are designed based on the accuracy of both positive class and negative class and the precision of negative class, specifically:Then, -measure and -mean are defined aswhere is a coefficient to adjust the relative importance of precision versus recall (usually, ).

From (24), -measure combines recall and precision as a measure of the effectiveness of classification in terms of a ratio of the weighted importance on either recall (accuracy+) or precision (precision+) as determined by the user. So, -measure represents a harmonic mean between recall and precision. Like -measure, -mean is also a metric which evaluates models’ performance by considering two metrics; specifically, -mean measures the balanced performance of a classifier using the geometric mean of the recall of positive class and that of negative class.

In the case of the soft-type classifiers, that is, classifiers that output a continuous numeric value to represent the confidence of an example belonging to the predicted class, AUC is a commonly used measure to evaluate models’ performances, which can be calculated byThe AUC allows the evaluation of the best model on average.

In this paper, we employ accuracy, recall, -measure, -mean, and AUC to evaluate the classification performance on imbalanced data sets. Though accuracy is inadequate to evaluate the classification performance, poor accuracy means a bad classifier. An efficient classifier should improve recall, -measure, -mean, or AUC without decreasing accuracy.

4.3. Experimental Results

To evaluate the performance of ILLD (the proposed method), ILLD is compared with LD, LD-US, and LD-OS (for more details about LD, LD-US, and LD-OS refer to Section 4.1). The corresponding results are reported in both tables and figures, where Tables 3, 4, 5, 6, and 7 report the results of the four comparing methods on the measures of accuracy, recall, -mean, -measure, and AUC and Figures 1, 2, 3, and 4 report the ranks of the methods on recall, -mean, -measure, and AUC. In these tables, a bullet (an open circle) next to a result indicates that ILLD significantly outperforms (is outperformed by) the respective method (column) for respective data set (row) in pairwise -test at 95% significance level. The last rows in these tables are the average results. The ranks of these methods shown in Figures 1, 2, 3, and 4 are calculated as follows [29, 30]: on a data set, the best performing algorithm gets the rank of 1.0, the second best performing algorithm gets the rank of 2.0, and so on. In case of ties, average ranks are assigned.

Table 3 reports the accuracies of ILLD, LD-US, LD-OS, and LD. As shown in Table 3, ILLD outperforms LD on three data sets and is outperformed by LD on four ones for -test at 95% significant level. Moreover, the average accuracy of ILLD is 1.19 percentage points lower than the one of LD. The results are acceptable although LD is better than ILLD since we focus on imbalanced learning of which accuracy is not an ideal metric to evaluate its performance. Compared to LD-US and LD-OS, ILLD shows significantly better performances on 13 and 7 out of the 14 data sets, respectively, and 11.47 and 2.74 percentage points higher performance on the average accuracies, respectively.

Table 4 and Figure 1 show the summarizing results and the ranks of the four comparing methods on measure of recall, respectively. From Table 4, ILLD significantly outperforms LD on 10 out of the 14 data sets, and the average recall of ILLD is 0.1454 higher than LD (recall ). These results indicate that the proposed cost function APM is appropriate for imbalanced problem and thus ILLD can improve the performance of logistic discrimination on positive class while keeping its high performance on the measure of accuracy. Also, ILLD performs comparable to LD-US and outperforms LD-OS. Specifically, ILLD significantly outperforms LD-US and LD-OS on 2 and 3 data sets, respectively, and is only outperformed by LD-US on one data set. Besides, from Figure 1, we can see that the average ranks of ILLD, LD-US, LD-OS, and LD are 1.61, 1.61, 3.0, and 3.78, respectively. Combining with the results in Table 3, we have that LD-US achieves a high recall by sacrificing the high performance of logistic discrimination on accuracy.

Table 5 and Figure 2 illustrate the summarizing results and the ranks of ILLD, LD-US, LD-OS, and LD on -measure, respectively. From Table 5, ILLD shows much better performance comparing to LD-US, LD-OS, and LD. Specifically, ILLD significantly outperforms them on 11, 7, and 6 out the 14 data sets, respectively, and is only outperformed by LD-OS on one data set. Moreover, Figure 2 shows that ILLD wins on 14, 12, and 11 out of the 14 data sets comparing to LD-US, LD-OS, and LD, respectively. Besides, the -measure of ILLD ranks first on 10 data sets.

-mean summaries and the corresponding ranks of ILLD, LD-US, LD-OS, and LD are reported in Table 6 and Figure 3. Similar to the results shown in Table 4 and Figure 2, Table 5 shows that ILLD significantly outperforms LD-US, LD-OS, and LD on 7, 6, and 10 out of 14 data sets, respectively, and Figure 3 shows that ILLD wins on 13, 11, and 14 data sets comparing to the four methods, respectively. Besides, ILLD ranks first with average rank of 1.36, followed by LD-OS (2.78), LD-US (2.5), and LD (3.36).

Table 7 and Figure 4 depict AUCs and the ranks of the four comparing methods, respectively. On the 14 data sets, ILLD wins (significantly wins) on 11 , 12 , and 11 out of 14 data sets comparing to LD-OS, LD-US, and LD. The average AUCs (ranks) of ILLD, LD-OS, LD-US, and LD are 0.8114 (1.29), 0.7559 (3.29), 0.7649 (2.43), and 0.7562 , respectively.

5. Conclusion

In this paper, we first construct a novel cost function called APM (Accuracy-Precision Based Metric) which considers the accuracies of both positive class and negative class as well as the precision of positive class and then propose a method called ILLD (Imbalanced Learning Based on Logistic Discrimination) to handle data imbalances. Also, we apply undersampling and oversampling to improve the performance of logistic discrimination on the imbalanced problem. Experimental results show that these methods can significantly improve the performance of logistic discrimination on positive class, and ILLD presents significantly better performances compared to other advanced methods on measures of recall, -measure, and -mean, while keeping the high performance of logistic discrimination on accuracy.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Grants no. 61472370, no. 61202207, no. 61501393, and no. 61402393) and by Science and Technology Research Key Project of the Education Department of Henan Province (Grants no. 15A520026 and no. 14A520016).