Abstract

Rough set theory is a powerful mathematical tool introduced by Pawlak to deal with imprecise, uncertain, and vague information. The Neighborhood-Based Rough Set Model expands the rough set theory; it could divide the dataset into three parts. And the boundary region indicates that the majority class samples and the minority class samples are overlapped. On the basis of what we know about the distribution of original dataset, we only oversample the minority class samples, which are overlapped with the majority class samples, in the boundary region. So, the NRSBoundary-SMOTE can expand the decision space for the minority class; meanwhile, it will shrink the decision space for the majority class. After conducting an experiment on four kinds of classifiers, NRSBoundary-SMOTE has higher accuracy than other methods when C4.5, CART, and KNN are used but it is worse than SMOTE on classifier SVM.

1. Introduction

The imbalanced dataset problem in classification domains occurs when the number of instances that represent one class is much larger than that of the other classes. The minority class is usually more interesting from the point of view of the learning task. There are many situations in which imbalance occurs between classes, such as satellite image classification [1], risk management [2], and medical diagnosis [3, 4]. When studying problems with imbalanced data, using the classifiers produced by standard machine learning algorithms without adjusting the output threshold may well be a critical mistake [5].

At present, the solutions for the problem of imbalanced dataset classification are developed at both the data and algorithmic levels [6]. At the data level, the objective is to rebalance the class distribution by resampling the data space, such as oversampling the minority class and undersampling the prevalent class. At the algorithm level, solutions try to adapt existing classifier learning algorithms to strengthen learning with regard to the minority class, such as cost-sensitive learning and ensemble learning. Resampling is convenient and effective; therefore, it is an often-used method in dealing with the class imbalance problem.

Previous research improved resampling methods in many aspects and proposed some effective resampling algorithms. SMOTE is an intelligent oversampling algorithm that was proposed by Chawla et al. [7]. Its main idea is to form new minority class samples by interpolating between several minority class samples that lie together. Thus, the overfitting problem is avoided and the decision space for the minority class spread further; meanwhile, it reduces the decision space for the majority class, so many researchers proposed different improved methods. Dong and Wang [8] proposed the Random-SMOTE, which is different from SMOTE, which obtained new minority class samples by interpolating among three minority class samples. Yang et al. [9] proposed ASMOTE algorithm which chose not only the minority class samples but also the majority class samples that are near to minority class sample, avoiding synthetic sample overlapping the majority class samples. Han et al. [10] proposed the Borderline-SMOTE. Their study considered the borderline minority samples which were most easily misclassified. They found out the borderline minority samples and thereby generated synthetic samples from them. Compared with SMOTE, Borderline-SMOTE maintained the decision space for the majority class and enlarged the decision space for the minority class. However, when the number of minority class samples is particularly smaller than the one of majority class samples, most of the minority class samples are regarded as noise. Thus, few synthetic samples are generated which makes the method improve little accuracy. For these reasons, it is urgent to study more effective oversampling methods to generate high quality synthetic samples, particularly giving a better way to distinguish the borderline minority class samples.

It is important to find an effective mathematical theory to express and process the uncertainty of the minority class samples. Rough set theory is a powerful mathematical tool introduced by Pawlak [1114] to deal with imprecise, uncertain, and vague information. It has been successfully applied to such fields as machine learning, data mining, intelligent data analysis, and control algorithm acquiring. Basically, the idea is to approximate a concept by three description sets, namely, the lower approximation, upper approximation, and boundary region. The rough set theory put the uncertain samples in the boundary region, and the boundary region can be calculated by upper approximation minus lower approximation, and they all can be calculated. Until now, there are many researchers who brought rough set theory to process imbalanced data. Liu et al. [15] proposed the weighted rough set model to process imbalanced data. It gave the minority class samples a higher weight to let the classifier focus on them. Ramentol et al. [16] introduced a hybrid preprocess approach by combining SMOTE with upper approximation. This method filtered the generated synthetic samples by comparing them with the majority class samples in upper approximation. Once the synthetic sample was similar to the majority class samples in upper approximation, they removed it to ensure that the synthetic samples approximate the minority class samples. Grzymala-Busse et al. [17] altered LEM2 algorithm by strengthening the rules to improve the classification of minority class samples.

The remainder of this paper is organized as follows. The basic concepts on neighborhood rough set models are shown in Section 2. By using the oversampling strategy of minority class samples in boundary region, the NRSBoundary-SMOTE algorithm is developed in Section 3. Section 4 presents the experimental evaluation on 15 imbalanced UCI datasets [18] by 10-fold cross validation, which shows the validity of the proposed method. The paper is concluded in Section 5.

2. Neighborhood-Based Rough Set Model

Neighoborhoods and neighborhood relations are a class of important concepts in topology. Lin [19] pointed out that neighborhood spaces are more general topological spaces than equivalence spaces and introduced neighborhood relation into rough set methodology. Hu et al. [20] discussed the properties of neighborhood approximation spaces and proposed the neighborhood-based rough set model. And then they used the model to build a uniform theoretic framework for neighborhood based classifiers.

For the convenience of description, some basic concepts of the neighborhood rough set model are introduced here at first.

Definition 1 (see [20]). Given arbitrary and , the neighborhood of in the subspace is defined as where is a metric function. , it satisfies(1);(2)  if and only if  ;(3);(4).

Consider that and are two objects. is a sample-dimensional space, where denotes the value of sample on the th dimension . Then, a general metric, named Minkowsky distance, is defined as When , it is the Euclidean distance .

But Euclidean distance can only be used to compute continuous features; the nominal features are invalid. Here, we compute them by using Value Difference Metric (VDM) proposed by Stanfill and Waltz [21] in 1986. The distance between two corresponding feature values is defined as follows:

In the previous equation, and are the two corresponding feature values. is the total number of occurrences of feature value , and is the number of occurrences of feature value for class . A similar convention can also be applied to and . is a constant, which is usually set to 1.

Definition 2 (see [20]). Given a set of samples , is a neighborhood relation on , and is the family of neighborhood granules. Then, we call a neighborhood approximation space.

Definition 3 (see [20]). Given , for arbitrary , two subsets of objects, called lower approximation and upper approximation of in terms of relation , are defined as The boundary region in the approximation space is formulated as

Definition 4 (see [20]). Given a neighborhood decision table, , are the object subset with decisions 1 to and is the neighborhood information granules including and generated by attributes . Then the lower and upper approximations of the decision with respect to attributes are defined as where The decision boundary region of with respect to attributes is defined as

Decision boundary region is the object subset whose neighborhoods come from more than one decision class. On the other hand, the lower approximation of the decision, also called the positive region of decision, denoted by , is the subset of objects whose neighborhoods decision only belongs to one of the decision classes.

To explain the samples in lower approximation of decision and boundary region, here we give a sample in Figure 1.

Example 5. Figure 1 gives a sample of binary classification in 2D space, where represent the majority class samples which are labeled by box and represent the minority class samples which are labeled by circle. Consider samples , , and ; we assign circle neighborhoods to these samples. We can find , , while . According to the aforementioned definitions, , , and .

3. Neighborhood Rough Set Boundary SMOTE Algorithm

3.1. SMOTE Algorithm

SMOTE, proposed by Chawla et al., is a popular oversampling method. Its main idea is to construct new minority class samples by interpolating and selecting a near minority class neighbor randomly. The method can be described as follows. Firstly, for each minority class sample , one gets its -nearest neighbors from other minority class samples. Secondly, one chooses one minority class sample among the neighbors. Finally, one generates the synthetic sample by interpolating between and as follows: where refers to a random number between 0 and 1.

In view of of geometry, SMOTE can be regarded as interpolating between two minority class samples. The decision space for the minority class is expanded that allows the classifier to have a higher prediction on unknown minority class samples.

The SMOTE algorithm is simple and effective while generating synthetic samples, and the overfitting problem is avoided. It expands the decision space for the minority class, but it may shrink the decision space for the majority class with high confidence in the meanwhile. Thereby, it will lead to poor prediction on the unknown majority class samples. Now, we will give an example to illustrate the drawback of SMOTE (see Figure 2).

In Figure 2, we apply SMOTE to generate synthetic samples for the minority class sample . We generate randomly nearest minority class samples of denoted by , , , , and . According to Definition 3, , , and belong to lower approximation of the decision. Furthermore, and are farther from than , , . If ones generates synthetic samples between and or between and , the synthetic samples (such as points and ) will overlap with (or very close to) the majority class samples. Thereby, misclassification will occur easily. Therefore, it is important to find the rational neighborhoods of minority class samples while oversampling.

3.2. Neighborhood Rough Set Boundary SMOTE Algorithm

In order to solve the aforementioned problem, we propose a new oversampling method, namely, Neighborhood Rough Set Boundary SMOTE (NRSBoundary SMOTE). The proposed method consists of three steps. First, we compute the minority class samples in boundary region and the majority class samples in lower approximation of decision. Second, for every minority class sample, we generate synthetic samples by calling SMOTE algorithm. Third, we select the rational synthetic samples without affecting the decision space of the majority class samples in lower approximation of decision.

In Figure 3, an example is given to explain NRSBoundary SMOTE further. The samples in the ellipse all belong to boundary region, while the ones outside belong to the lower approximation of decision. Now, we choose the minority samples in the boundary region for oversampling. We also find their nearest neighbors of , namely, . Assume that the synthetic samples are , , , , and , respectively. Obviously, ; that is, there is a risk that (a majority sample) can be classified into minority classification. Therefore, some effective methods should be adopted to avoid the risk.

It is an effective way that the synthetic sample cannot be in the neighborhood of any majority sample while oversampling. How to measure the neighborhood radius of sample is a primary issue. According to Definition 1, we should obtain the threshold firstly. Here we compute as follows [20]: where is a training sample, denotes the minimal value of distance between and the remaining samples excluding , and denotes the value domain of . In this case, is dynamically generated in terms with the whole training samples. In Section 4, we can afford a value domain of , combined with the experimental analysis.

Here we give the NRSBoundary SMOTE (see Algorithm 1) as follows.

Input: the training sample set: , the radius of neighborhood: w.
Output: new training sample set: .
Step  1: (Initialization)
; // is the generated synthetic sample set.
; // is the minority class sample set in boundary
       region which needs over-sampling.
;  // is the majority class sample set in lower
       approximation of decision.
Step  2: (Compute the majority class sample set and minority class sample set)
 According to the decision values 1 to N, divide into subsets: ;
 Compute the minority class sample set ;
 Compute the majority class sample set ;
Step  3: (Compute boundary region and lower approximation of decision)
 FOR each   in DO
  According to formulas (2) and (3), compute the distance between
   and the other sample in ;
   ;
   ;
  According to formula (10), compute the threshold of ;
  Compute the neighborhood of , ;
  IF //minority class sample which
                belongs to boundary region.
   THEN ;
  ELSE IF //majority class sample which
                    belongs to lower approximation of decision.
   THEN ;
  END IF
 END FOR
Step  4: (Generate synthetic samples from BoundSet)
 FOR each   in DO
  BOOL ;
  Compute 's nearest neighborhoods with the same classification: ;
   ;
  WHILE DO
   Choose one sample denoted by randomly;
    ;
   //Generate a synthetic sample.
    ;
   //Judge whether affects the lower approximation of decision.
   FOR each   in DO
    IF THEN
       ;
      BREAK;
    END IF
   END FOR
   //Add to SampleSet
   IF THEN
     ;
   END IF
  END WHILE
 END FOR
Step  5: (Return)
;
 RETURN .

Time Complexity Analysis of Algorithm 1. Assume that and the number of features is . The time complexity of step 1 is . The time complexity of step 2 is . The time complexity of step 3 is . The time complexity of step 4 is . The time complexity of step 5 is . So the time complexity of Algorithm 1 is .

Space Complexity Analysis of Algorithm 1. The space complexity of Algorithm 1 is .

4. Experimental Designing and Analysis

In this section, we first present the experimental setup, including the UCI datasets and the evaluation in imbalanced domains. Then we introduce the experimental analysis, which is divided into two parts: first we carry out an analysis of the parameters for our method, and then we develop the comparative analysis with other oversampling methods and some classifiers.

4.1. Datasets

In order to test the proposed algorithm, 15 UCI datasets are downloaded from the machine learning data repository, University of California at Irvine, with different imbalanced rates that from 0.20 to 0.804. There are four multiclass datasets and eleven two-class datasets. Multiclass datasets are modified to obtain two-class imbalance problems, by the union of one or more classes of the minority class and the union of one or more of the remaining classes which are labeled as the majority class. For the missing values, if they are continuous features, we fill them with average values; if they are nominal features, we fill them with values that appear most frequently. The datasets are outlined in Table 1 and sorted by imbalanced rates from low to high.

4.2. Experimental Evaluation in Imbalanced Domains

The traditional evaluation usually uses Confusion Matrix, showed in Table 2, where TP means the number of positive samples that are classified into positive, TN means the number of negative samples that are classified into negative, FN means the number of positive samples that are misclassified, and FP means the number of negative samples that are misclassified.

From Table 2, one could get some useful evaluation as follows.

; ; , where and refer to and , respectively.

There are three evaluations as the formulas called Precision, Recall, and -value. Precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while Recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. From the previous formulas, we can decrease FP to increase Precision and increase TP to increase Recall. But in fact they are conflicted. So we use the -value to consider them comprehensively. Only when Precision and Recall are both higher, -value will be higher.

Another appropriate metric that could be used to measure the performance of classification over imbalanced datasets is the Receiver Operating Characteristic (ROC) graphics [22]. In these graphics, the tradeoff between the benefits (TP) and costs (FP) can be visualized, and it acknowledges the fact that the capacity of any classifier cannot increase the number of true positives without also increasing the false positives. The area under the ROC curve (AUC) [23] corresponds to the probability of correctly identifying which of the two stimuli is noise and which is signal plus noise. AUC provides a single-number summary of the performance of learning algorithms.

4.3. The Experimental Results and Analysis

In this paper, we use Recall, -value, and AUC to evaluate our algorithm. The oversampling method SMOTE [7] and the classifiers (such as C4.5, KNN, CART, and SVM [24]) are used in our experiment, whose source codes are afforded by Weka software [25]. We also use Java programming language to implement some other oversampling methods, such as ASMOTE [9], Borderline-SMOTE [10], and SMOTE-RSB* [16]. For the objective comparison, the minority class was over-sampled at 100% and the value of is set to 5 like SMOTE. All results are computed by 10-fold cross validation.

(1) NRSBoundary SMOTE: Parameter Analysis. In the NRSBoundary SMOTE algorithm, it is important to set a proper value of . Here, we conduct a series of experiments to find the optimal parameter which is used to control the radius of the neighborhood. We try from 0 to 0.2 with step 0.01 to compute the -value by using the 10-fold cross validation. Figure 4 presents the -value curves varying with for some datasets: Pima, VC, Haberman, Transfusion, Colic, and CMC. From Figure 4, we can find that there is a similar trend in these curves: -value increases at first and decreases after a threshold. So we recommend that should take values in the range [0.01, 0.05].

(2) Comparative Analysis on C4.5. Tables 3, 4, and 5 give the comparative results of Recall, -value, and AUC which are computed in different oversampling methods, respectively. Furthermore, none represents the original dataset without resampling.

From Tables 1, 2, and 3, we can figure out that the NRSBoundary-SMOTE has higher accuracy for most datasets. The average value of Recall increases to 0.7182 while the values of the others are between 0.6130 and 0.6886. The average value of -value increases to 0.6978 while the values of the others are between 0.6505 and 0.6638. The average value of the AUC is up to 0.7882 while the values of the others are between 0.7615 and 0.7695. The NRSBoundary-SMOTE has higher accuracy on three evaluations than the others when the classifier is decision tree C4.5. It shows that our method is feasible by oversampling and strengthening the minority class samples in boundary region.

SMOTE over-samples for all the minority class samples. It can expand the decision space of minority class, but it will decrease the decision space of majority class. Although it can improve Recall of minority class, many majority class samples will be misclassified as minority class, which thereby results in the decreasing of Precision. Thus, the results of -value have not been improved too much.

ASMOTE, similar to SMOTE, considers the near neighborhood of majority class. It can reduce the confliction between synthetic samples and majority class samples and also expand the coverage space of minority class samples. However, some of the synthetic samples are similar to majority class samples, so the decision space of majority class samples decreases as well.

Both Borderline-SMOTE and SMOTE-RSB* sift the synthetic samples more strict than SMOTE, because few synthetic samples are generated when datasets are highly imbalanced. Thus, compared with SMOTE, its improvement is not obvious.

NRSBoundary-SMOTE uses the neighborhood rough set model, which emphasizes oversampling the minority class samples in boundary region, and thereby expands the coverage space of minority class samples in boundary region. Furthermore, it can improve the confidence degree of decision rules by minority class samples in boundary region (or uncertain area). What is more, it has little influence on the majority class samples in lower approximation of decision; in other words, it has little influence on changing the decision space of majority class. Thus, the results of -value have been improved.

(3) Comparative Analysis on KNN, CART, and SVM. In addition, in order to test the validity of the proposed method on different classifiers, KNN (), CART, and SVM are adopted on 15 UCI datasets. The experimental results of -value are shown in Figure 5, where the -value is the average value of -values on 15 UCI datasets.

From Figure 5, we find out that NRSBoundary-SMOTE has higher accuracy than other methods when C4.5, CART, and KNN are used. On the contrary, it is worse than SMOTE on SVM. In the course of classification of C4.5, CART, and KNN, they are all based on measuring the distance between the unknown samples and one of the train samples or rules; at the same time, the process of computing neighborhoods in NRSBoundary-SMOTE is similar to these classifiers. Therefore, NRSBoundary-SMOTE can perform better. But SVM works by constructing a separating hyperplane with the maximal margin, which has not been taken into consideration by NRSBoundary-SMOTE algorithm; it has no better effect on SVM.

In NRSBoundary-SMOTE algorithm, one can expand the decision space of the minority samples in boundary region by oversampling them. The boundary region of the training sample set is computed based on neighborhood-based rough set model. Furthermore, the distance between two samples is regarded as an important factor for classification, which is suitable for C4.5, CART, and KNN. However, SVM works by constructing a separating hyperplane. The distance from one sample to the hyperplane is the main factor for classification, not the distance between two samples. That is, the computation of boundary region of the sample set has nothing to do with the hyperplane of SVM. Only when two samples are in the same side of the hyperplane, they will be classified into the same category. But, two closed samples with different classifications will be regarded as the same classification in neighborhood-based rough set model. For example, for a minority sample near to the hyperplane of SVM, let be the synthetic sample around . Since is near to , should be denoted as the minority classification by our algorithm. In fact, may be a majority sample which is in the other side of the hyperplane. Obviously, it is wrong due to denoting error classification for the synthetic sample. Therefore, the proposed algorithm is not suitable for SVM, because the hyperplane is not considered in oversampling.

5. Conclusions

In this paper, we present a new oversampling method, called NRSBoundary-SMOTE, to process imbalanced dataset. In this method, only the minority class samples in the boundary region should be over-sampled. It can expand the decision space of minority class samples, while it has little influence on the decision space of majority class samples. The experimental evaluation on 15 UCI datasets with different imbalanced rates shows that the proposed method has better performance than SMOTE, when combining with C4.5, CART, and KNN. But SMOTE is better than NRSBoundary-SMOTE when using SVM. The proposed method is an effective method for oversampling. However, it will spend more time to filter the synthetic samples. Thus, it will be difficult to process large dataset, due to the long running time. Studying and developing new fast algorithms for oversampling will be our future work.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (NSFC) under Grant nos. 61073146, 61272060, 61203308, 61309014, and 61379114, Natural Science Foundation Project of CQ CSTC under Grant nos. cstc2012jjA40032, cstc2012jjA40047, cstc2013jcyjA40063, and cstc2013jcyjA40009, and Doctor Foundation of Chongqing University of Posts and Telecommunications under Grant no. A2012-08.