Complex System Modelling in Engineering Under Industry 4.0View this Special Issue
HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into different data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the oversampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean.
In the era of big data, tremendous amount of data generated by various real-world applications brings the challenges to data mining. Among the challenges, classification of imbalanced datasets has drawn interest in various application areas. A dataset is imbalanced when the number of samples in one category is much less than the number of samples in other categories. If the samples come from two classes, the data of the larger number is called the majority class, and the data of the smaller number is called the minority class. Our research focuses on the binary classification problem (the two-class classification) and the prediction of minority samples is more important, because the cost of misclassification for minority samples is greater than the cost of misclassification for majority samples. The issue of binary classification of imbalanced data exists in various applications, such as medical diagnosis.
Most classifiers aim at maximizing the overall classification accuracy of a dataset. Therefore, when classifying imbalanced data, the classifier is biased to meet the classification accuracy of the majority samples, causing low classification accuracy over the minority class. In addition, an imbalanced dataset combination with other difficulty factors such as class overlapping, presence of outliers, and small disjunctions will be more difficult for the classifier to predict minority class . Figure 1(a) shows the skewness distribution between classes. Figure 1(b) shows class overlapping, and Figure 1(c) shows the small disjunctions of minority class. Therefore, how to improve the classification accuracy of minority samples while ensuring the overall classification performance of the classifier for imbalance data is an urgent problem to be solved.
The remainder of this paper is organized as follows. Section 2 presents related works. Section 3 describes the proposed HSDP method. Section 4 introduces the experimental settings. Section 5 presents the experimental results and compares our approach with some typical techniques. Finally, the conclusion is drawn in Section 6.
2. Related Works
The techniques proposed to improve classification for imbalanced data can be categorized into two major groups: data-level methods and algorithm-level methods. The algorithm-level methods modify the classifier in order to improve the accuracy of imbalanced data. The algorithm-level methods mainly include cost-sensitive methods and ensemble learning methods. The data-level methods mainly include undersampling for the majority class , oversampling for the minority class. In contrast, the data-level methods are conducive to enhancing the generalization ability of the model, and the oversampling methods have more advantages because they do not lose the data sample information [3, 4]. SMOTE is the one of most popular oversampling algorithms . SMOTE first selects a random seed x from the minority samples and then randomly selects sample y among its k neighbors in the same class, Finally, a new synthetic sample s is generated by the linear interpolation. This can be expressed as where gap is a random number between 0 and 1.
Although the SMOTE algorithm has shown successful performance in various classification scenes, the SMOTE algorithm also has some weaknesses: (1) if the noisy sample is selected, oversampling process may generate more noisy samples. (2) It does not consider the data distribution when generating the synthetic sample, thereby increasing the overlaps between different classes . (3) It oversamples uninformative minority samples because it chooses a minority sample seed to oversample with uniform probability. However, those minority samples on the boundary area contain more information than ones far from the boundary . Therefore, researchers have proposed some improved versions of SMOTE. The Borderline-SMOTE algorithm  oversamples the borderline minority samples. However, sometimes the Borderline-SMOTE generates new synthetic samples in unsuitable areas, such as noise regions and overlapped areas. ADASYN algorithm  pays more attention to the minority samples that are difficult to learn. It can adaptively generate minority samples according to the ratio of samples of majority class in the neighborhood samples. The K-means-SMOTE algorithm  combats between-class imbalance and within-class imbalance. But it does not provide a strategy for determining the optimal number of clusters k, which has a great impact on the performance of oversampling. The MWMOTE technique  analyzes the hard-to-learn minority samples and assign them weights according to their importance in learning.
In summary, the methods above have mitigated some of the problems of SMOTE, but neither of them has effectively solved all the three problems. So, the proposed hybrid sampling method based on data partition attempts to overcome all three problems. It is able to select proper minority samples for oversampling and improve the synthetic sample generation scheme. The generation scheme includes the size of synthetic samples for selected minority samples and the control of the location of the generated samples in data space.
3. Proposed Method
The data samples present different distribution characteristics in data space, and the data distribution can be considered when undersampling or oversampling. Different sampling methods are used in different regions that may improve classification performance, and we propose the hybrid sampling method of imbalanced data based on data partitioning (HSDP). The method consists of four stages: (1) partitioning space of the input imbalanced data into five regions; (2) removing the samples in the noise minority samples region; (3) using agglomerative hierarchical clustering method to cluster the minority samples; (4) oversampling process. In the first stage, the data space is divided into five regions: the boundary minority samples region, the noise minority samples region, the safe minority samples region, the boundary majority samples region, and the safe majority samples region. And the first two stages are performed because our aim is to oversample the borderline minority samples while ignoring the noisy minority samples. The basic idea is that the borderline samples are apt to be misclassified. In the third stage, clustering the minority samples is to ensure that the generated samples must be inside the minority class regions. In the fourth stage, the oversampling process is performed, which adaptively generates synthetic samples for borderline minority samples in the same cluster of the oversampling seed. The oversampling process in the same cluster ensures that the generated samples locate inside the minority class regions.
3.2. Data Partition
According to the proportion of minority samples in neighborhoods of each minority sample, the data space is divided into five regions : the boundary minority samples region, the noise minority samples region, the safe minority samples region, the boundary majority samples region, and the safe majority samples region, as shown in Figure 2.
Given an imbalanced training dataset S and minority class label class (min), the training dataset is divided into majority class set and minority class set firstly. Then, for each sample xi in the minority class set , we calculate k neighbors around xi through the K-nearest neighbor algorithm. Next, in these k neighbors, the number of the minority class samples is computed and the majority class samples are put into the boundary majority samples region . Finally, by judging , each sample is added to corresponding region. If , the minority class sample is added to the noise minority samples region . If , the minority class sample is added to the safe minority samples region . If , the minority class sample is added to the boundary minority samples region . The safe majority samples region and the set with the sample in removed are determined at the end. The DP algorithm for the data partitioning is described as follows (Algorithm 1):
3.3. Clustering Minority Class Samples Based on Hierarchical Clustering
Most of the existing oversampling methods are the K-NN based approach. To generate a synthetic sample from the minority class sample B and k = 5, sample A may be chosen (as shown in Figure 3). By this way, the generation of a synthetic sample (shown by square) may locate in the majority class region.
Our proposed method chooses the sample from the same cluster (Cluster1) of B. It ensures that A will not be chosen, because B and A are not in the same cluster. Thus, the oversampling process is performed in a safe range and the generation of minority samples must locate inside the minority class region.
The hierarchical clustering algorithm is used to cluster the minority class samples in this work. And, the key steps of the agglomerative hierarchical clustering algorithm are described as follows:(1)Assign each data sample to a cluster initially.(2)Find the two closest clusters and merge them into a single cluster. And, this will reduce the total number of clusters by one.(3)Compute the distance between the newly generated cluster and all the previous clusters.(4)Repeat steps 2-3 until a certain termination condition is reached. The termination condition is the number of clusters set in advance or distance threshold.
However, using agglomerative hierarchical clustering algorithm to cluster the minority class samples, whether the two minority clusters are merged or not, not only the distance between the minority clusters but also the distribution of the majority class samples should be examined. If the distance between two minority class clusters is , the distances from a certain majority class cluster to these two minority clusters are d1 and d2, respectively; where and . Then, these two minority class clusters cannot been merged.
Therefore, modifications to agglomerative hierarchical clustering algorithm have been made. First, agglomerative hierarchical clustering algorithm is used to cluster the majority class samples to obtain the majority cluster set . Then, the minority class samples are clustered. The minority class cluster algorithm based on hierarchical clustering algorithm  (MDH) is described below (Algorithm 2).
Because the distance threshold is the termination condition of the clustering process, the setting is particularly critical. In this work, is computed as follows:
The parameter represents the average of the distance from each minority class sample to any other minority samples in the Set S. The parameter r is used to tune the output distance of the cluster algorithm. And the specific value analysis of r is discussed in Section 5.
3.4. Description of Hybrid Sampling Algorithm Based on Data Partition
We proposed a hybrid sampling algorithm based on data partition. Firstly, the boundary region can be obtained by DP algorithm. Then, the total number of synthetic samples generated in the boundary region is calculated. Next, the weight of each sample in the boundary minority samples region is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Finally, for each sample xi in the boundary minority samples region , synthetic samples should be generated in the same cluster of . The hybrid sampling algorithm based on data partition (HSDP) is implemented as follows (Algorithm 3):
3.5. The Time Complexity Analysis of HSDP Algorithm
In the DP algorithm, supposing the number of minority samples is and the number of majority samples is , each minority sample needs to calculate the distance from other samples to find neighbor samples. Therefore, the time complexity of calculating the distance between samples is
In the MDH algorithm, the distance between each minority cluster and the majority clusters needs to be calculated. Suppose that the current number of minority clusters is and the number of majority clusters is . The time complexity of calculating the distance between minority clusters is , and find the two minority clusters with the smallest distance. Then, calculate the distance between the majority cluster and the two minority clusters with the smallest distance. So, the calculation time complexity to get the distance between the majority cluster and the minority cluster is , and then determine whether to merge the two minority clusters according to the distance between the clusters, and the possible number of merger time is . Therefore, the time complexity of the MDH algorithm is .
In the HSDP algorithm, assuming that the number of boundary minority samples is , the number of minority class samples is , and the number of majority class samples is , and the time complexity the step of determining the k neighbors of boundary minority samples is . In the step of sample generation, the computational time complexity is .
According to the analysis of the above steps, the time complexity of the HSDP algorithm should be .
4. Experiment Setup
4.1. Dataset Description
We test our algorithm on datasets from various filed, including 8 imbalanced datasets. All these datasets are available from KEEL Repository and UCI Repository. Table 1 describes the information of these datasets.
In this study, we research the binary classification. In the two-classification problem, the majority of samples are usually also marked as negative samples, and the minority samples are also marked as positive samples.
4.2. Evaluation Metrics
For the classification problem of imbalanced data, the overall classification accuracy is not suitable for evaluation of classifiers performance, because sometimes a classification algorithm with a better overall accuracy may be at the expense of large prediction error over the minority class. Therefore, F-measure and G-mean are usually used to evaluate the performance of imbalanced classification algorithms.
F-measure and G-mean are calculated based on the confusion matrix, as shown in Table 2.
Based on the confusion matrix, the following equations are derived:
F-measure is computed as shown in formula (6). F-measure is the harmonic mean between the Recall and Precision. The higher F-measure can ensure that both Recall and Precision are higher, where β is a coefficient to indicate the relative importance of Recall and Precision (usually β = 1). G-mean is calculated as shown in formula (7). G-mean is the geometric mean of the minority class accuracy and majority class accuracy, and it assigns equal importance to performance of the classifier on minority class and majority class.
5. Experiment and Result Analysis
The experiment platform is Anaconda. Since the purpose is evaluating the proposed sampling method, we do not choose any special classifier; rather, we apply several of them such as KNN and RF. In order to compare the performance of our proposed hybrid sampling method (HSDP) against the other techniques, comparative experiments were carried out, including SMOTE, ADASYN, and Borderline-SMOTE.
5.1. Analysis of Experimental Results
The best results of F-measure and G-mean are bold faced on each dataset in the above tables. It is evident that the KNN and RF combined with the sampling method are better than themselves without combining sampling method in most cases, On the F-measure, the HSDP algorithm obtained 5 best results on 8 datasets, and 6 best results on G-mean value. This shows that the HSDP algorithm proposed in this paper can improve the classification effect of minority class.
Compared with the SMOTE method and the ADASYN method, (1) the HSDP method does not oversample for all minority class samples but focuses on the minority samples in the boundary area that are more important in classification and (2) the HSDP method removes the noise data, thus avoiding the noisy samples generation.
In contrast to Borderline-SMOTE, our proposed HSDP method not only considers the importance of minority class samples in boundary area but also considers the distribution characteristics of data samples, avoiding any wrong synthetic sample generation.
5.2. Analyzing the Influence of the Parameter Value Used in HSDP Algorithm
The parameters involved in the proposed method (HSDP) include the number of neighbor samples k and the distance adjustment factor r.
The value of k cannot be too small, because this will take the boundary minority class samples as noisy data and delete them by mistake.
The value of r is used to control the number of clusters. With smaller r value, the number of clusters increases and the number of samples decreases in the clusters, which will result in a decrease in diversity when synthesizing samples.
In order to determine the optimal value range of r and k, we use Pima, Glass5, and Yeast3 as the test datasets. For k value (k = 3, 5, 7, 9 and 11), the G-mean are given as shown in Table 7. For pima dataset, G-mean obtains the maximum value when k is 5. When k is 9, the glass5 achieves the maximum G-mean value. And Yeast3 achieved the maximum G-mean value when k is 7. It is evident that the value of k is appropriate in the range of 5–9. For r value (r = 0.6, 0.8, 1.0, 1.2, and 1.6), the G-mean are given as shown in Table 8. The Pima dataset achieves the maximum G-mean value when r is 0.8. Glass5 achieved the maximum G-mean value when r was 1.0. And Yeast3 achieved the maximum G-mean value when r was 1.2. It can be seen that the value of k is appropriate in the range of 0.8–1.2.
Data resampling method is one of the effective methods to deal with imbalanced data classification. Aiming at the problems of undersampling method and oversampling method, this paper proposes a hybrid sampling method, HSDP, based on data partition. This method uses the appropriate sampling methods for samples in different regions. And, it assigns reasonable weight to the boundary minority samples. Furthermore, it is able to oversample the selected samples inside the minority class area in the data space. The effectiveness of proposed method for the imbalanced data classification was confirmed by experiments, yet the values of the parameters used in the algorithm are selected through experiments many times. The future research direction is how to determine values of the parameters adaptively of HSDP for different datasets.
We use datasets from KEEL Repository and UCI Repository; our method and related parameters are provided in our paper.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported by the Natural Science Foundation of Department of Education of Anhui Province, China (no. KJ2017A452).