Abstract

To improve the classification performance of imbalanced learning, a novel oversampling method, immune centroids oversampling technique (ICOTE) based on an immune network, is proposed. ICOTE generates a set of immune centroids to broaden the decision regions of the minority class space. The representative immune centroids are regarded as synthetic examples in order to resolve the imbalance problem. We utilize an artificial immune network to generate synthetic examples on clusters with high data densities, which can address the problem of synthetic minority oversampling technique (SMOTE), which lacks reflection on groups of training examples. Meanwhile, we further improve the performance of ICOTE via integrating ENN with ICOTE, that is, ICOTE + ENN. ENN disposes the majority class examples that invade the minority class space, so ICOTE + ENN favors the separation of both classes. Our comprehensive experimental results show that two proposed oversampling methods can achieve better performance than the renowned resampling methods.

1. Introduction

The class imbalance problem typically occurs when, in a binary classification problem, there are more training examples of one class than those of the other class. This situation is known as the class imbalance problem [1]. Under the circumstances, most standard algorithms fail to properly represent the distributive characteristics of complex imbalanced datasets and then provide unfavorable accuracies across examples of two classes [2]. Furthermore, it is worth pointing out that the minority class is usually the one that has the highest interest from a learning point of view and it also implies a great cost when it is not well classified [3].

Standard classification learning algorithms are often biased towards the majority class (known as the negative class). Therefore, it is not unusual that there is a higher misclassification rate for the minority class (i.e., the positive class) instances. In order to deal with this problem, a large number of approaches have been proposed to counter the sparsity in the distribution. Among them, the “synthetic minority oversampling technique” (SMOTE) [4] has become one of the most renowned approaches in this area. It can be essential to provide new related information on the positive class to the learning algorithm, in addition to undersampling majority class, which is completely different from undersampling majority class. Batista et al. proposed an integration method called SMOTE + ENN [5], which uses Wilson’s edited nearest neighbor (denoted as ENN) rule [6] to remove examples whose classes differ from the classes of at least a half of their nearest neighbors. Han et al. present two new minority oversampling methods, borderline-SMOTE1 and borderline-SMOTE2 [7], in which only the minority class examples near the borderline are oversampled. Later, Bunkhumpornpat et al. published safe-level-SMOTE [8]. Their approach samples minority instances along the safe level that is computed using nearest neighboring minority instances. Ramentol et al. came up with another oversampling method with the application of an editing technique based on the rough set theory and the lower approximation of a subset [9].

In this paper we present an immune centroids oversampling technique (ICOTE) based on immune network theory. We utilize an aiNet model [10] to generate immune centroids of clusters of high data density. Our work resamples the minority class by introducing immune centroids of clusters of minority class examples. The resampling creates larger but less specific decision regions. Meanwhile, we also integrate our ICOTE with ENN together, that is, ICOTE + ENN. ICOTE + ENN can not only sample minority class examples but also dispose majority class examples that invade the minority class space. We expect that this hybrid method excels ICOTE in terms of the separation of both classes. Our experimental results show that both ICOTE and ICOTE + ENN achieve better performance in the application of three paradigms, comparing with the existing methods.

The rest of this paper is organized as follows. We review related work in Section 2. Section 3 presents our proposed oversampling methods ICOTE and ICOTE + ENN. Our experimental results and comparisons are shown in Section 4. Finally, we conclude this paper in Section 5.

In order to deal with imbalanced issues, some articles studied different resampling techniques, which change the class distribution. These articles empirically showed that applying a preprocessing step in order to balance the class distribution usually is a useful solution [5, 1113]. Furthermore, the main advantage of these techniques is that they are independent of the underlying classifier. Resampling techniques can be categorized into three groups or families:(1)undersampling methods, which create a subset of the original dataset by eliminating some instances (usually majority class instances),(2)oversampling methods, which create a superset of the original dataset by replicating some instances or creating new instances from the existing ones,(3)hybrids methods, which combine both sampling approaches from the above.López et al. [14] evaluated various sampling methodologies on a variety of datasets with different class distributions. They selected a collection of methods belonging to the three categories. They concluded that both SMOTE [4] and SMOTE + ENN [5] are more applicable and give very good results for datasets with various imbalanced rates. They also noted that the sophisticated sampling techniques did not give any clear advantage in the domains considered.

The SMOTE algorithm [4] oversamples the minority class. Specifically, it introduces synthetic examples along the line segments through joining some/all of the minority class nearest neighbors. Depending on the amount of oversampling required, neighbors from the -nearest neighbors are randomly chosen. Figures 1-2 illustrate the distribution change in the application of SMOTE [4].

SMOTE + ENN [5] uses ENN to remove examples from both classes. Since some majority class examples might invade the minority class space and vice versa, SMOTE + ENN [5] reduces the possibility of overfitting introduced by synthetic examples. The cleaning result of ENN is illustrated in Figure 3.

Jo and Japkowicz [15] discussed whether class imbalance is truly responsible for this degradation or whether it can be explained in some other ways. Their experiments suggest that the problem is not directly caused by class imbalance, but rather that class imbalances may yield small groups which, in turn, will cause this degradation. SMOTE [4] and its successors enrich the minority class space without considering data intrinsic characteristics such as small groups. The SMOTE-based methods might create the synthetic examples which underrepresent actual clusters or are attributed to noisy data. We will describe how our method overcomes the inherent drawback in the subsequent section.

3. Our Methods

In this section, we first briefly introduce the basic concepts and knowledge of immune systems. After that, we present our oversampling method ICOTE based on immune network theory and its improved version ICOTE + ENN.

3.1. Immune Systems

Before discussing our method, we sketch a few aspects of the human adaptive immune system. The immune systems guard our bodies against infections due to the attacks of antigens. The surface receptors on B-cells (one kind of lymphocyte) are able to recognize specific antigens. The response of a receptor to an antigen can activate its hosting B-cell. Activated B-cell then proliferates and differentiates into memory cells. Memory cells secrete antibodies to neutralize the pathogens through complementary pattern matching. During the proliferation of the activated B-cells, a mutation mechanism is employed to create diverse antibodies by altering the gene segments. Some of the mutants may be a better match for the corresponding antigen. In order to be protective, the immune system must learn to distinguish between our own (self) cells and malefic external (nonself) invaders. This process is called self/nonself discrimination: those cells recognized as self do not promote an immune response, and the system is said to be tolerant to them, while those that are not provoke a reaction resulting in their elimination.

The immune network theory, as originally proposed in [16], hypothesizes a novel viewpoint of lymphocyte activities, natural antibody production, preimmune repertoire selection, tolerance and self/nonself discrimination, memory, and the evolution of the immune system. It was suggested that the immune system is composed of a regulated network of cells and molecules that recognize one another. The immune cells can respond either positively or negatively to the recognition signal (antigen or another immune cell or molecule). A positive response would result in cell proliferation, cell activation, and antibody secretion, while a negative response would lead to tolerance and suppression.

Learning in the immune system involves raising the population size and affinity of those lymphocytes that have proven themselves valuable by having recognized any antigen. Burnet [17] introduced clonal selection theory by modifying Jerne’s theory. The theory stated that, in a preexisting group of lymphocytes (specifically B-cells), a specific antigen only activates (i.e., selection) its counter-specific cell so that a particular cell is induced to multiply (producing its clones) for antibody production. With repeated exposures to the same antigen, the immune system produces antibodies of successively greater affinities. A secondary response elicits antibodies with greater affinity than in a primary response. Based on the clonal selection principle, de Castro and von Zuben [18] proposed a computational implementation of the clonal selection principle that explicitly takes into account the affinity maturation of the immune response. He also defined aiNet (an artificial immune network model) for data analysis [10]. The aiNet is an edge-weighted graph, not necessarily fully connected, composed of a set of nodes, called antibodies, and sets of node pairs called edges with an assigned number called weight, or connection strength, associated with each connected edge. The aiNet clusters serve as internal images (mirrors) responsible for mapping the existing clusters in the dataset into network clusters. These clusters map those of the original dataset. The shape of the spatial distribution of antibodies follows the shape of the antigenic spatial distribution.

3.2. Immune Centroids Resampling

In this paper we present a resampling method based on immune network theory. We use the aiNet model [10] to generate antibody-derived synthetic examples and extend a training set to balance sample distribution. The immune synthetic examples represent internal images of original minority class examples, so we call the resampling method immune centroids oversampling technique (ICOTE).

Before explaining ICOTE, we introduce some notations to describe the resampling method. Given a set of labeled examples with input vectors and class label , we measure the affinity (complementarity level) of the antigen-antibody match using Euclidean distance. As we know, the Euclidean distance of two vectors iswhere is the dimension of each vector. The antigen-antibody affinity is inversely proportional to the Euclidean distance. The smaller the distance, the higher the affinity, and vice versa.

Our ICOTE includes four major steps as follows.

3.2.1. Attribute Selection

In order to reduce computational cost, we first remove the attributes whose values are constant:

3.2.2. Unit-Based Normalization

Then we adjust the values of attributes on different scales to a notionally common scale :

3.2.3. Immune Centroids Generation

There are three steps for generating immune centroids. First, the selected antibodies are going to proliferate (clone) proportionally to their antigenic affinity: the higher the affinity, the larger the clone size for each selected antibody:Next, each antibody from the clone set will suffer a mutation with a rate , which is inversely proportional to the antigenic affinity of its parent antibody : And then we eliminate the memory antibodies (denoted as ) with a low antigen-antibody affinity (clonal suppression) and a high antibody-antibody affinity (network suppression) :

3.2.4. Denormalization

Next, we denormalize memory antibodies and make synthetic examples identical to sample distribution:

3.2.5. Attribute Replacement

At the end, we put back constant-value attributes:Correspondingly, the algorithm is described as shown in Algorithm 1.

Input:
Sample array for original minority examples, number of original minority examples, initial antibody number,
maximum number of generations, array for memory antibodies, array for network antibodies,
array for antigens, Dist array for Euclidian distances
Output: Synthetic Array for synthetic minority examples
for to
  Compute: Sample = fselect(Sample) // Use formula (2)
end for
for to
  Compute: = norm(Sample) // Use formula (3)
end for
Generate random antibodies
while do
  Initialize memory antibodies
  for to
  for to
       Compute: Dist = dist(, ) // Use formula (1)
  end for
  Clone antibodies in proportion to antigen-antibody affinities // Use formula (4)
  Select a portion of antibodies to perform mutation // Use formula (5)
  Dispose antibodies with antigen-antibody affinity <0.05 // Use formula (6)
  Append antibodies to
  end for
  Compute: mnum = number of memory antibodies
  for to mnum
     Compute: Dist = dist() // Use formula (1)
  end for
  Dispose memory antibodies with antibody-antibody affinity <0.05 // Use formula (6)
  Fill with and new random antibodies
  
end for
Compute: mnum = number of memory antibodies
for to mnum
Compute: Synthetic = de-norm() // Use formula (7)
end for
for to mnum
Compute: Synthetic = de-fselect(Synthetic) // Use formula (8)
end for

ICOTE samples minority class examples to generate memory antibodies (immune centroids). The shape of the spatial distribution of the immune centroids follows that of the minority class examples. Therefore, it avoids more small groups or outliers introduced by oversampling. For instance, we depict the immune centroids in Figure 4. Intuitively, each of them in star shares the same group with one or several neighboring minority class examples. Introducing immune centroids for learning not only creates larger and less specific decision regions but also decreases the likelihood of overfitting occurring, which is the major drawback of SMOTE [4].

We also propose an integrated method called ICOTE + ENN, integrating ICOTE with the Wilson’s edited nearest neighbor rule (i.e., ENN) [6]. In this integrated method, ICOTE oversamples minority class examples, and ENN discards “dirty” examples deviating from the majority class space. When the class of an example differs from the class of more than a half of the nearest neighbors, the example will be removed from the training set. The result of the integrated method is illustrated in Figure 5. Figure 5 shows that the hybrid method makes the two class spaces separated. In the next section, we will show our empirical results for our two methods.

4. Experiments

In this section, we will investigate the performance of our proposed oversampling methods ICOTE and ICOTE + ICOTE and compare them with the existing well-known oversampling methods.

4.1. Experimental Settings

Our experiments are conducted based on three base classifiers: , C4.5, and SVM. We use these algorithms, since they are available within the KEEL software tool [19]. In the experiments, the parameter values are set based on the recommendations from the corresponding authors. The specific settings are as follows.(1)Instance based learning () [20]: in this algorithm, we set and use the Euclidean distance metric.(2)C4.5 decision tree [21]: for C4.5, we set a confidence level as 0.25 and the minimum number of item sets per leaf as 2 and use pruning.(3)Support vector machines (SVM) [22]: for SVM, we choose Polykernel reference functions, setting an internal parameter 1.0 for the exponent of each kernel function and a penalty parameter of the error term as 1.0.We conduct experiments on 38 datasets from the KEEL dataset repository [23], whose characteristics are summarized in Table 1, namely, the number of examples (#Ex.), number of attributes (#Atts.), and instance ratio (IR). The experiments are evaluated in terms of one of the popular metrics, the area under the ROC curve (AUC) [24, 25]. The experimental results are obtained based on 5-fold cross-validation. We choose 5-fold cross-validation, because it can keep sufficient positive class instances in different folds. Thus, we can avoid additional problems in the data distribution which were discussed in [26, 27], especially for highly imbalanced datasets.

We must point out that the dataset partitions employed in this paper are available from the KEEL dataset repository [23], so that researchers can use the same data partitions for comparisons.

4.2. Evaluation in Imbalanced Domains

In imbalanced domains, a well-known approach to unify these measures and to produce an evaluation criterion is to use the receiver operating characteristic (ROC) graphic [24]. This graphic allows the visualization of the trade-off between the benefits and costs , as it evidences that any classifier cannot increase the number of true positives without also increasing the false positives. The area under the ROC curve (AUC) [25] corresponds to the probability of correctly identifying which one of the two stimuli is noise and which one is signal plus noise. The AUC provides a single measure of a classifier’s performance for evaluating which model is better on average. The AUC measure is computed just by obtaining the area of the graphic:AUC combines the individual measures of both the positive and negative classes so that we can utilize it to measure quality results of different paradigms for imbalanced data.

4.3. Experimental Results

In this section, we investigate the performance of the resampling methods on the imbalanced datasets listed in Table 1.

As shown in the previous work [14, Table 4] on the keel datasets, SMOTE [4] and SMOTE + ENN [5] have the highest rank for the three classification algorithms (, C4.5, and SVM) used in their study, and both ADASYS [28] and SL-SMOTE [29] achieve the 2nd highest AUC values. So we select these four resampling algorithms and compare our ICOTE and ICOTE + ENN with them. The average AUC results of different resampling methods with three base learners , C4.5, and SVM over all 38 datasets are shown in Table 2. Besides, we also have the experimental results obtained based on the three base learners directly without using resampling techniques, which is denoted as “none” in Table 2. Please note that our experimental results on each dataset are shown in the Appendix.

From Table 2, we can see that our methods ICOTE and ICOTE + ENN perform much better than the other four resampling methods, on all three base learners. And ICOTE + ENN does improve the performance of ICOTE. Our experimental results also show that SMOTE and SMOTE + ENN perform better than SL-SMOTE and ADASYN. SMOTE + ENN does improve the performance of SMOTE a little on all the three base learners. Between SL-SMOTE and ADASYN, SL-SMOTE performs better. That is, ADASYN is the worst among the six resampling methods.

Besides the average results shown in Table 2, we also rank the resampling methods on each dataset with each base learner. The average ranks of each method with each base learner are shown in Figure 6. From Figure 6, we can see that the average rank of ICOTE + ENN is the best under any one of the three base learners. ICOTE ranks the second consistently. SMOTE + ENN always ranks the third. SMOTE always ranks the fourth. It is obvious that “none” (without using resampling techniques) performs the worst when either C4.5 or SVM is used as the base learner. Between SL-SMOTE and ADASYN, SL-SMOTE always performs better than ADASYN. These conclusions are consistent with the conclusions we made based on the average AUC, shown in Table 2.

For the sake of finding out which algorithms are distinctive among the pair comparisons of these methods, we carry out a Shaffer post hoc test [30], which is shown in Tables 35. In these tables, a “+” symbol implies that the algorithm in the row is statistically better than the one in the column, “−” implies the contrary, and “=” means that the two algorithms compared show no significant difference. In brackets, the unadjusted value associated with each comparison is also presented. Shaffer’s procedure rejects those hypotheses that have an unadjusted value .

In order to explain why ICOTE and ICOTE + ENN obtain the highest performance, we may emphasize two feasible reasons. The first one is related to the addition of significant information within the minority class examples by including immune centroids of clusters. These immune centroids allow the formation of larger clusters to help the classifiers to separate both classes, and its cleaning procedure also adds benefits to the generalization ability during learning. The second reason is that the immune centroids represent inherent clusters of the minority class examples and overcome the limitation that synthetic examples form new clusters or outliers.

5. Conclusions

In this paper we present two overresampling methods based on immune network theory. We draw the conclusions based on our experimental results and analyses as follows.(1)ICOTE samples minority class examples to generate immune centroids as synthetic examples, which is far different from renowned resampling methods without considering sample architecture.(2)ICOTE introduces minority class examples and ENN disposes majority class examples in the minority data space. ICOTE + ENN favors separating both classes.(3)We compare our proposed methods ICOTE and ICOTE + ENN with representative resampling methods. Our experimental results showed that our approaches make significant improvement.

Appendix

See Tables 6, 7, and 8.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was partially supported by the Natural Science Foundation of China under Grants nos. 61170020, 61402311, and 61440053, Jiangsu Province Colleges and Universities Natural Science Research Project under Grant no. 13KJB520021, Jiangsu Province Technology Innovation Fund Project for Science and Technology Enterprises under Grant no. BC2013124, 2013 Suzhou Municipal Special Fund Project for Speeding up the Information Construction, the US National Science Foundation (IIS-1115417), and Jiangsu Province Postgraduate Cultivation and Innovation Project under Grant no. ZY32001814.