Computational Intelligence and Neuroscience

Volume 2016 (2016), Article ID 5423204, 10 pages

http://dx.doi.org/10.1155/2016/5423204

## Imbalanced Learning Based on Logistic Discrimination

^{1}School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China^{2}School of Information Engineering, Zhengzhou University, Zhengzhou 450000, China

Received 10 July 2015; Revised 23 October 2015; Accepted 26 October 2015

Academic Editor: José David Martín-Guerrero

Copyright © 2016 Huaping Guo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In recent years, imbalanced learning problem has attracted more and more attentions from both academia and industry, and the problem is concerned with the performance of learning algorithms in the presence of data with severe class distribution skews. In this paper, we apply the well-known statistical model logistic discrimination to this problem and propose a novel method to improve its performance. To fully consider the class imbalance, we design a new cost function which takes into account the accuracies of both positive class and negative class as well as the precision of positive class. Unlike traditional logistic discrimination, the proposed method learns its parameters by maximizing the proposed cost function. Experimental results show that, compared with other state-of-the-art methods, the proposed one shows significantly better performance on measures of recall, -mean, -measure, AUC, and accuracy.

#### 1. Introduction

Recently, class imbalance problem, also called skewed or rare class problem, has drawn a significant number of interests in academia, industry, and government. For the two-class case, this problem is characterized as having many more examples of one class (majority class or negative class) than the other (minority class or positive class) [1–3]. In many real-world applications, the correct prediction of examples in positive class is often more meaningful than the contrary case. For example, in cancer detection, most patients belong to common disease, rare patients may have cancer, and how to effectively recognize cancer patients is very meaningful. However, conventional classification methods such as C4.5, naive bayes, and neural network, try to pursue a high accuracy by assuming that all classes have similar size, which leads to the fact that the rare class examples are often overlooked and misclassified to majority class [4, 5].

Many approaches have been proposed to tackle this problem, which can be roughly categorized into three levels: data preprocessing level, algorithm learning level, and prediction postprocessing level. For the data preprocessing level, the algorithms focus more on examples with positive class through one of the three approaches: (1) the algorithms running on the rebalanced data sets obtained by manipulating the data space [6, 7] such as undersampling technique and oversampling one, (2) actively selecting the more valuable examples to learn models and leaving the ones with less information to improve models’ performance [8, 9], and (3) weighting data space using information concerning misclassification costs to avoid costly errors [10]. The approaches at the algorithm learning level try to adjust existing classifier learning algorithms such that the learned models are biased towards correctly classifying positive class examples, such as two-phase rule induction [11] and one-class learning. Existing approaches at prediction postprocessing level try to focus more on positive class by moving a decision threshold [12] or minimizing a cost function [13].

In this paper, we reconsider the imbalanced problem at algorithm level and propose a novel method called ILLD (Imbalanced Learning Based on Logistic Discrimination) to tackle the problem. The motivation is inspired by the following observation: there are very few researches studying the logistic discrimination on the class imbalanced problem, although it has many merits including understandability, solid theoretical basics, and, most importantly, high generalization ability. Unlike the traditional logistic discrimination, ILLD achieves high performance on imbalanced data by maximizing the proposed cost function APM (Accuracy-Precision Based Metric) which takes into account the accuracies of both positive class and negative class as well as the precision of positive class. Experimental results show that ILLD can much better boost the performance of logistic discrimination on measures of recall, -measure, -mean, and AUC while keeping its high performance on accuracy. Compared with other state-of-the-art classification methods, ILLD shows a much better performance.

The rest of this paper is organized as follows: after presenting related work in Section 2, Section 3 describes the proposed imbalanced learning method; Section 4 presents the experimental results; and, finally, Section 5 concludes this work.

#### 2. Related Work

##### 2.1. Imbalanced Learning

Technically speaking, the data set which exhibits an unequal distribution between its classes can be considered imbalanced or skewed. However, in the community, only the data sets corresponding to the ones exhibiting extreme imbalances are treated as imbalanced data sets. There are two forms of imbalance, namely, within-class imbalance and between-class imbalance. For the within-class imbalance, some subconcepts exist in limited examples, which increase the difficulty of correctly classifying examples. With respect to the between-class imbalance, one class extremely out-represents another [1, 2]. Usually, the second form of imbalance is often discussed in community.

There are many factors that influence the modeling of a capable classifier when facing rare events. Examples include the skewed data distribution which is considered to be the most influential factor, small sample size, separability, and existence of within-class subconcepts [14].

The skewed data distribution is often denoted by imbalance degree which is the ratio of the sample size of the positive class to that of the negative class. Reported studies indicate that a relatively balanced distribution usually attains a better result. However, to what imbalance degree the class distribution deteriorates the classification performance cannot be stated explicitly, since other factors such as sample size and separability also affect performance [1, 2, 14].

Small sample size means the sample size is limited; uncovering regularities inherent in small class is unreliable. In [15], the authors suggest that the imbalanced class distribution may not be a hindrance to classification by providing a large enough data set.

The difficulty in separating the rare class from the prevalent class is the key issue of the imbalanced problem. Assuming that there exist highly discriminative patterns among each class, then not very sophisticated rules are required to distinguish class objects. However, if patterns among each class are overlapping, discriminative rules are hard to be induced [1, 2, 14].

Within-class concepts mean that a single class is composed of various subclusters or subconcepts. Instances of a class are collected from different subconcepts. These subconcepts do not always contain the same number of instances. The presence of within-class concepts worsens the imbalance distribution problem [14]. In general, we only consider imbalanced data distribution in imbalanced learning and fix other factors.

##### 2.2. Logistic Discrimination

Logistic discrimination, also called logistic regression, is a typical probability statistical classification model [16], which has been widely used in many fields such as medical domain and social surveys because of its understandability, solid theoretical basics, and, most importantly, high generalization ability. For the two-class case, logistic discrimination is defined aswhere is the logistic sigmoid function defined asFor a given data set , where is the label associated with example , the likelihood function of this model can be written aswhere if and 1 otherwise. Defining a cost function by taking the negative logarithm of the likelihood, we have the cross-entropy error function in the formThe logistic discrimination uses (5) as the cost function; however, it is not suitable for the class imbalanced problem because the cross-entropy error function defined in (5) does not consider the importance of each class. To handle this problem, a novel cost function called APM (Accuracy-Precision Based Metric) is proposed, which takes into account the accuracies of both positive and negative classes as well as the precision of positive class. For more details refer to Section 3.

##### 2.3. Strategies to Handle Imbalanced Problem

The imbalanced problems rise from the scarce representation of the most important examples, which leads to the fact that the learned models tend to focus more on normal examples, overlooking the rare class examples. Many approaches have been proposed to handle the problem, which can be mainly grouped into the following three categories.(i)*Data preprocessing based strategy*. These techniques preprocess the given imbalanced data set to change the data distribution such that standard learning algorithms focus more on the cases that are relevant for the user. Reported studies of preprocessing data sets can be categorized into three types: resampling, active learning, and weighting the data space. The object of resampling techniques is to rebalance the class distribution by resampling the data space. Commonly used resampling methods include randomly of informatively undersampling instances in negative class [6], randomly oversampling examples of positive class, oversampling based on cluster algorithm [17, 18], and oversampling the positive class by creating new synthetic instances [7]. Resampling data space technique is often used to deal with imbalanced learning problems, but the real class distribution is always unknown and differs from data to data. Active learning is to actively select the more valuable examples to learn models and leave the ones with less information to improve models’ performance by interacting with the user. Several approaches based on active learning have been proposed. For example, Ertekin [9] presented an adaptive oversampling algorithm called VIRTUAL (Virtual Instances Resampling Technique Using Active Learning) to generate synthetic examples for the positive class during the training process, Mi [19] developed a method that combines SMOTE and active learning with SVM, and so on. The strategies of weighting the data space aim to modify the training data set distribution using information concerning the misclassification costs, such that Wang and Japkowicz [10] combined an ensemble of SVM with asymmetric misclassification costs.(ii)*Algorithm based strategy*. It modifies existing classifier learning algorithms such that the learned models are biased towards the cases that are more concerned by the user. Many algorithms based imbalanced learning approaches have been proposed; for example, Cao et al. [20] presented a framework for improving the performance of cost-sensitive neural networks that adopts Particle Swarm Optimization for optimizing misclassification cost, feature subset, and intrinsic structure parameters; Alejo et al. [21] proposed two strategies for dealing with imbalanced domains using RBF neural networks which include a cost function in the training phase.(iii)*Prediction postprocessing based strategy*. The approaches of the strategy learn a standard model on the original data set and only modify the predictions of the learned model according to the user references and the imbalance of the data set. There exist two main types of solutions: threshold method and cost-sensitive postprocessing. For the former, each example is associated with a score which expresses the degree to which an example is a member of a class. Based on the scores, a threshold is used to generate different classifiers by varying the threshold for an example belonging to a class [12]. With respect to the latter, several methods exist for making models cost-sensitive in a post hoc manner. This type of strategy was mainly explored for classification tasks and aims at changing only the model predictions for making it cost-sensitive [13].

In this paper, we propose a novel algorithm based imbalanced learning method to improve the performance of the logistic discrimination. Besides, we apply sampling techniques to the logistic discrimination to enhance its performance. Two widely used sampling techniques are selected: random undersampling and oversampling. The corresponding experimental results are presented in Section 4.

#### 3. Imbalanced Learning Based on Logistic Discrimination

##### 3.1. Accuracy-Precision Based Metric

The traditional logistic discrimination learns its parameters by maximizing the cross-entropy error function defined in (5). However, this approach ignores the diverse costs of classes, which leads to the fact that the learned models have low performance on the positive classes. To tackle this problem, a novel cost function is proposed to guarantee that the learned models perform well on both positive class and negative class. The relevant symbols are defined as follows.

Define and as follows:where is defined by (1) or by (2). From (6), we have that is the estimation of the number of examples correctly classified as class (corresponding to ) and is the estimation of number of examples with class incorrectly classified. For two-class problem, we haveLet class “+” be the positive class as used before; then the cost function APM is defined asSince is the number estimation of examples being correctly classified as class and is that of the ones being incorrectly classified as aforementioned, APM is the estimation of the following equation:where is the accuracy (or recall) of positive class (+). Similarly, is the accuracy (or recall) of negative class (−) and is the precision of positive class (+). More details about these measures are discussed in Section 4.2. In this way, RPM considers all the three factors: the precision of minority class and the recall of both minority class and majority class.

Taking the gradient of APM (see (8)) with respect to results inwheresimilarly,Combining (11), (12), (13), and (10), we have that the gradient of APM defined by (8) isThe proposed method for the imbalanced problem uses a quasi-Newton method BFGS which uses (14) as base function for learning its parameters. For more details refer to Section 3.2.

##### 3.2. Algorithm

Based on the cost function APM proposed in Section 3.1, a novel imbalanced learning approach called ILLD (Imbalanced Learning Based on Logistic Discrimination) is proposed to tackle data imbalance. ILLD uses quasi-Newton method BFGS [22–24] to maximize the cost function to learn parameters, where BFGS is an iterative process. Formally, the iterative process is as follows:where is the step length along with the Newton direction of the th iteration and is the approximate Hessian matrix calculated bywhere

The details about the learning process of ILLD are shown in Algorithm 1. ILLD firstly initializes randomly and to be unit matrix of which the value of each diagonal element is equal to 1 and 0 for others (lines 1~2) and calculates using (11) based on and (lines 3~4). Then ILLD optimizes the cost function ARM to find out the best parameter vector (lines 4~11). Specifically, for the th iteration, ILLD calculates the gradients of as using (14) and, based on and , updates and using (17) (lines 8~9). Then, it updates using (16) (line 10) and, finally, updates using (15) (line 11). The convergence rate of ILLD is [22–24] and the stopping condition is that the absolute of the difference between the values calculated by (15) for two consecutive iterations is not larger than 0.001 (line 13).