Abstract
Aiming at low classification accuracy of imbalanced datasets, an oversampling algorithm—AGNESSMOTE (Agglomerative NestingSynthetic Minority Oversampling Technique) based on hierarchical clustering and improved SMOTE—is proposed. Its key procedures include hierarchically cluster majority samples and minority samples, respectively; divide minority subclusters on the basis of the obtained majority subclusters; select “seed sample” based on the sampling weight and probability distribution of minority subcluster; and restrict the generation of new samples in a certain area by centroid method in the sampling process. The combination of AGNESSMOTE and SVM (Support Vector Machine) is presented to deal with imbalanced datasets classification. Experiments on UCI datasets are conducted to compare the performance of different algorithms mentioned in the literature. Experimental results indicate AGNESSMOTE excels in synthesizing new samples and improves SVM classification performance on imbalanced datasets.
1. Introduction
Imbalanced dataset is featured with having fewer instances of some classes than others in a dataset. In the biclass cases, one class with fewer samples is referred to as a minority class, and the other class with more samples is the majority class [1]. In reality, there are many scenarios of imbalanced data classification, such as credit card fraud detection, information retrieval and filtering, and market analysis [2]. Conventional classifiers typically favor the majority class, giving rise to classification errors. The imbalance of sample sizes between two different classes is regarded as betweenclass imbalance, and the imbalanced data distribution density within one class is withinclass imbalance. Withinclass imbalance forms multiple subclasses with the same class but different data distribution [3, 4]. Both the two abovementioned imbalances will cause classification errors. In addition, oversampling algorithms often cause problems such as synthetic samples overlap [5] and samples distributed “marginally” [6], which reduce classification performance. Therefore, how to improve conventional algorithms to solve the imbalanced classification of datasets and promote classification performance becomes the research focus of data mining and machine learning.
Researches on imbalanced datasets mainly include data processing and classification algorithm [7, 8]. Costsensitive learning [9] and integrated learning [10] are representative classification algorithms. The most frequently used methods to process data are oversampling and undersampling methods, which balance two classes by increasing minority samples and decreasing majority samples, respectively. Sampling methods based on data are usually simple and intuitive. Undersampling method usually causes information loss while oversampling method tends to balance the original dataset. Thus, the latter one is often adopted in data classification.
At present, the most frequently used oversampling method is SMOTE algorithm proposed by Chawla’s team [11] in 2002. It created new synthetic samples by linear interpolation of sample x and sample y, in which x referred to an existing minority sample and another minority sample y was picked up randomly from the nearest neighbors of x. This algorithm neglected uneven data distribution in minority class and the possibility of samples overlap when synthesizing samples. Han Hui’s team [12] suggested BorderlineSMOTE algorithm in 2005, which divided minority samples into boundary area, safe area, and dangerous area. This algorithm synthesized samples by selecting samples from the boundary area, which avoided selecting minority samples indiscriminately and produced lots of redundant new samples caused by SMOTE algorithm. ADASYN algorithm, proposed by He’s team, [13] indicated that the samples size needed to be generated by each minority sample was automatically determined based on data distribution. Minority samples with more neighboring majority samples generated more samples. Compared with SMOTE, it divided the sample distribution exhaustively. ClusterSMOTE [14] adopted the Kmeans algorithm to cluster minority samples, found minority subclusters, and, then, applied SMOTE algorithm, respectively. However, this algorithm did not determine the optimal size of subclusters and did not calculate the sample size generated by each subcluster. KmeansSMOTE [15] combined Kmeans clustering algorithm with SMOTE algorithm. Compared with ClusterSMOTE, KmeansSMOTE clustered the entire datasets, found the overlap and avoided oversampling in unsafe areas, restricted the synthetic samples in the target area, and eliminated withinclass and betweenclass imbalances. Meanwhile, it avoided noise samples and attained good results. CBSO [16] combined clustering with the data generation mechanism of the existing oversampling technology to ensure that the generated synthetic samples were always in the minority class area and avoided generating erroneous samples. Although the abovementioned oversampling methods indeed improve classification accuracy to a certain extent, they have the following deficiencies: (1) when oversampling, much attention has been paid to solving betweenclass imbalance while has been paid less attention to withinclass imbalance. (2) Clustering can address the issue of betweenclass and withinclass imbalance, but two classes aliasing are exacerbated, leading to generating new overlapping synthetic samples. The conventional kmeans clustering algorithm needs to set k value when clustering, which is more effective for spherical datasets and is more complex. (3) The minority boundary is not maximized, affecting synthetic samples quality. (4) No restrictions on destination area of synthetic samples result in synthetic samples distributed marginally. (5) Noise samples interfere.
Based on the above discussion, this paper offers an improved oversampling method—AGNESSMOTE. Its procedures are listed as follows: filter noise samples, adopt the AGNES algorithm to cluster minority samples, form the minority subclusters iteratively, and consider the majority samples distribution during the merging process to avoid generating overlapping synthetic samples. Repeat this operation until the distance between the two closest minority subclusters exceeds the set valvevalue. Then, determine sampling weights according to sample size in minority subcluster, calculate the probability distribution of each minority subcluster according to the distance between the minority samples and their neighbor majority samples, and combine the two to select “seed sample” for oversampling. Restrict the generation of new samples in a certain area by centroid method in the synthesizing process. Select a sample from all “seed samples,” randomly select two neighboring minority samples from the subcluster where the selected “seed sample” is located, form a triangle with the three selected samples, and synthesize new samples on the line from the three samples to the centroid. Compared with other algorithms, AGNESSMOTE attains a better result in the experiment.
2. Preliminary Theory
2.1. SMOTE Algorithm
SMOTE algorithm alleviates the problem of data imbalance by artificially synthesizing new minority samples and calculates the distance between one sample and the other sample by the Euclidean distance. The subscript numbers are sample dimension values. The Euclidean distance D between sample X and sample Y is
For each sample X in minority class, search for its nearest neighbor samples K and randomly select N samples from these nearest neighbor datasets. For each original sample, select N samples from Knearest neighbor samples, and then perform interpolation between the original samples and their nearest neighbor samples. The formula is described as follows:where i is 1, 2, …, N; X_{new} is the new synthetic samples; X is the selected original sample; is a random number between 0 and 1; and Y_{i} is N samples selected from the nearest K samples of the original sample X.
2.2. AGNES Algorithm
The conventional AGNES algorithm is all about hierarchical clustering. It treats every data as a cluster and gradually merges those clusters according to some certain criteria. For example, if the distance between the two data objects in different clusters is the smallest, the two clusters may be merged. The merging is repeated until a certain termination condition is met. In AGNES, the distance between clusters is attained by calculating the distance between the closest data objects in two clusters, so a cluster can be represented by all objects in the cluster.
Compared with aggregating samples with the conventional centroid method, the AGNES algorithm is more accessible, independent of the selected initial values, and free from the samples’ distribution shape. It also can aggregate all samples together. Considering the influence of betweenclass and withinclass samples imbalance on model performance, the AGNES algorithm is more applicable to deal with unbalanced data distribution of withinclass imbalance.
3. Improved SMOTE Algorithm
3.1. Divide Minority Clusters
The AGNESSMOTE algorithm is proposed in this paper to refine SMOTE and its improved algorithm. The newly proposed algorithm filters noise samples first, uses AGNES to cluster samples, and divides datasets into subclusters. In the clustering process, this paper uses the average distance method to calculate the distance between two subclusters. Merge the two closest subclusters to form a new subcluster. Reduce the size of the subclusters by one. Then, continue to merge the two closest subclusters. Stop clustering until the distance between the subclusters exceeds the set valvevalue. To avoid generating overlapping samples, majority samples distribution needs to be considered.
Before clustering minority samples with AGNES, cluster majority samples first to get majority subclusters set. The subclusters in the set represent the majority class. Then, judge the distance between the majority class and minority class. If the distances between majority samples and any two minority subclusters are less than the minimum distances between two minority subclusters, the merged minority subclusters will produce overlapping samples and the two minority subclusters should not be merged. The specific steps to classify clusters are listed as follows: Step 1. Given the original dataset I, use Knearest neighbor to filter noise samples in dataset I. Set K = 5 to traverse samples in dataset I. If more than 4/5 sample classes of Knearest neighbors in dataset I are opposite to the selected sample class, judge the selected sample as noise sample and eliminate it. The remaining samples constitute the sample set . Step 2. Cluster majority samples in , treat each sample as an independent subcluster, use formula (3) to calculate the distance between the subclusters, merge the two closest subclusters, repeat the above procedures until the distance reaches the preset valvevalue T_{h}, and obtain some majority subcluster sets : In this formula, p and q are samples in subclusters C_{i} and C_{j}, respectively. C_{i} and C_{j} represent their sample sizes. Step 3. Divide minority samples according to the obtained majority subcluster set C^{maj}; treat each minority sample as a separate subcluster, and obtain minority subcluster set: . Step 4. Calculate the distance between two minority subclusters with formula (3), make , and record the minimum distance D_{min} and its corresponding subcluster numbers i and j. Step 5. Traverse the majority subclusters in the set C^{maj}. If there is a majority subcluster and the distances from it to minority subclusters and are both less than the distance between the two minority subclusters, the minority subclusters and will not be merged, and the minimum distance D_{min} will be set large to avoid being considered when remerging. Otherwise, if the minority subclusters and are merged into a new minority subcluster , the size of minority subclusters will be reduced by one. Step 6. When new minority subcluster is merged, recalculate the distance between in minority subcluster set C^{min} and the remaining minority subclusters with formula (3). Repeat Step 3 to Step 5 until the distance between the nearest minority subclusters exceeds the set valvevalue T_{h}; then, stop merging minority subclusters. Get the final minority subcluster set .
The valvevalue is the key condition for merging subcluster. For better estimating valvevalue T_{h}, define a value d_{avg} first:
In the formula, x_{a} and x_{b} are samples in minority subcluster is the sample size of this subcluster. Suppose d represents the median distance between a sample in minority subcluster and the rest of the samples. d_{avg} represents the average of these median distances. Taking the average of the median distance as the reference value can avoid noise samples interference. Redefine the valvevalue T_{h} as follows:
Parameter f is the distance adjustment factor, which can adjust valvevalue T_{h}. The value range of parameter f will be discussed later.
3.2. Determine Sampling Weight and Probability Distribution
In classification tasks, the imbalances of withinclass samples and betweenclass samples will affect model performance. The density of each subcluster varies with its sample size. The sampling weight of each minority subcluster is determined by its denseness. Set small weight for dense subcluster and large weight for sparse subcluster to avoid overfitting. Thus, sampling weights assigned to minority subclusters vary with their sizes, denoted as W_{i}:
N represents the size of minority subclusters and num_{i} represents the sample size of i^{th} minority subcluster. From formula (6), it is known that the larger the sample size in a certain minority subcluster is, the larger the proportion of the sample sizes in the total minority subclusters is and the smaller W_{i} will be; namely, the assigned weight and synthetic samples size both become smaller, eventually achieving balanced sample distribution in the same class.
As shown in formula (7), the sampling size of each minority subcluster can be determined by W_{i} (sampling weights of each subcluster) and (the difference between the sizes of majority sample and minority sample after excluding noise samples):
In addition, when classifying samples, the minority samples closer to the decision boundary are more prone to be misclassified, increasing the learning difficulty of minority samples. Therefore, it is necessary to select samples for oversampling. To ensure the quality of synthetic samples, the probability distribution of minority subclusters is introduced to select “seed samples” from minority samples with important but difficult information. The probability of each sample being selected is set as D_{i}:
The probability distribution of minority subclusters is
In this equation, y_{b} is x’s b^{th} majority sample’s neighbor_{.}_{.} denotes the Euclidean distance between sample x in minority subcluster and majority sample y_{b}. i represents one sample in minority subcluster, n is the sample size of a certain minority subcluster, and k signifies neighbors’ size. It can be reckoned from the formula that the probability of each sample being selected is determined by the distance between this sample and majority class boundary; the probability of minority samples closer to the majority class boundary being selected is higher than that of samples far away; the probability of each sample being selected constitutes the probability distribution of minority subclusters. In this way, the distribution characteristics of samples are considered and the minority class decision boundaries are extended effectively.
3.3. Restrict the Generation of New Samples in a Certain Area
Determine the synthetic samples size of each minority subcluster, and select the “seed sample” according to the probability distribution of each minority subcluster. Consider the generation of new samples in a certain area to improve classifier performance and prevent synthetic samples from being distributed marginally. Therefore, when synthesizing samples, the new generated samples distribution needs to be taken into account. Select a sample from “seed samples,” randomly select two neighboring minority samples from the subcluster where the selected “seed sample” is located, and form a triangle with the three selected samples as vertexes. Synthesize new samples on the line from the three vertexes to the centroid, respectively. One triangle generates three new synthetic samples. The centroid method is adopted to restrict the generation of new samples in a certain area. Set three samples distribution as X_{1}, X_{2}, and X_{3}; their centroid X_{T} can be calculated by the following formula:where X_{i} represents the horizontal coordinates of three vertexes and Y_{i} represents the vertical coordinates of three vertexes. This method makes the new samples move closer to the centroid, which addresses the issue of the marginal distribution of new samples caused by SMOTE. When synthesizing new samples, restrict the generation of new samples in a certain area. Those synthetic samples will move closer to the centroid.
3.4. AGNESSMOTE Algorithm
The procedures of the AGNESSMOTE algorithm are depicted below. Use Knearest neighbor to filter noise samples in the original dataset. Adopt the AGNES algorithm to cluster majority samples and divide them into several majority subclusters. Cluster minority samples and merge the two closest subclusters on the basis of the obtained majority subclusters and keep clustering until the distance between two minority subclusters exceeds the set valvevalue; then obtain minority subclusters. Assign weight to each minority subcluster and calculate the probability distribution of each minority subcluster, and combine the two to oversample samples in minority subcluster. Restrict the generation area of synthetic samples by the centroid method. The detailed Algorithm 1 is as follows:(1)Delete noise samples from original datasets to obtain ClearData datasets, and then split ClearData into majority sample group and minority sample group. Use AGNES to cluster majority sample group to obtain majority subclusters. Then, cluster minority sample group. When clustering, determine whether there exist majority samples between the two nearest minority subclusters. If no majority samples exist, merge minority subclusters (line 1 to line 10).(2)Calculate sample size of the obtained minority subcluster, assign sampling weight to minority subcluster, calculate the size of samples needed to be synthesized, and then calculate the probability distribution of each minority subcluster (lines 15 to 23).(3)Finally, in each minority subcluster, select “seed samples” based on the size and the probability distribution of samples needed to be synthesized. Select a sample from all “seed samples,” randomly select two neighboring minority samples from the subcluster where the selected “seed sample” is located, form a triangle with the three selected samples as vertexes, and synthesize new samples on the line from the three samples to the centroid, respectively. Then, add new synthetic samples to the synthetic samples group (lines 24 to 36).

4. Experimental Design and Result Analysis
4.1. Evaluation Index
The conventional classification algorithms use the confusion matrix to perform the evaluation, as shown in Table 1 [17]. In this paper, the minority class is defined as a positive class, and the majority class is a negative class. In the confusion matrix, TN (True Negatives) is the number of negative examples rightly classified, FP (False Positives) is the number of negative examples wrongly classified as positive, FN (False Negatives) is the number of positive examples wrongly classified as negative, and TP (True Positives) is the number of positive examples rightly classified [11].
The classifier uses precision and recall [18] as two basic indicators for classification, defined as follows:
In processing imbalanced data, three commonly used indicators, Fmeasure, Gmean, and AUC, are generally used to evaluate the performance of classification algorithms. Fmeasure is the harmonic mean of accuracy and recall, and is set 1 in the experiment. Gmean combines the accuracy of the classifier on majority sample and minority sample. AUC represents the sum of the areas under the ROC curve. N and M, respectively, represent the size of minority samples and majority samples in datasets. Fmeasure, Gmean, and AUC are defined as follows [19]:
4.2. Experimental Analysis
4.2.1. Datasets
In this paper, nine UCI datasets groups [20] are selected for the experiment, whose structures are listed in Table 2.
The hierarchical random division is adopted in this paper to ensure the consistent imbalance ratio of samples in the training set and test set. 50% crossvalidation is used as an evaluation method. Each dataset is divided into 10 parts. Select one part as verification set in turn and the remaining nine parts as the training set. Obtain the average of 10 results. The parameters of the SVM classifier are set as follows: the kernel function is a Gaussian radial basis and the penalty factor C is 10.
4.2.2. Determine Parameters f
The performance of the AGNESSMOTE algorithm is affected by the parameters to some extent. The distance adjustment factor f is used to control subcluster merging when clustering. If f value is too small, the size of minority subclusters will be too large while the size of samples in each subcluster will be too small, which reduce the diversity of synthetic samples and cause overfitting. If f value is too large, merged clusters will contain majority samples, resulting in overlapping when synthesizing.
As shown in Table 3, five datasets are used as test datasets to determine the range of parameter f. f = 1.0 indicates there is no need to adjust the valvevalue. Then, f = 1.0 is used as the axis to select f values. After testing, the results show that when f = 1.0, 3 datasets obtain maximum Fmeasure value; when f = 0.6, 1 dataset obtains maximum Fmeasure value; and when f = 1.5, 1 dataset obtains maximum Fmeasure value. Therefore, the reference range of parameter f should be between 0.3 and 1.5. When f > 2.5, Fmeasure values will be similar because when parameter f becomes larger, all subclusters will eventually merge into one.
4.2.3. Experimental Results and Analysis
(1)Analysis of synthetic data distribution results: this paper uses artificial datasets to verify and compare synthetic samples distribution of the new proposedly algorithm and SMOTE. The results are shown in the following figures, in which the red dots represent majority samples and the black crosses represent minority samples and their synthetic samples. Compared with Figure 1, the synthetic samples sampled by the SMOTE algorithm are more distributed in the edge area and even mixed into majority samples which cause overlapping. As the new synthetic samples are highly similar and repeated, withinclass imbalance in original dataset has not been improved. In view of the shortcomings in Figure 2, AGNESSMOTE effectively filters noise samples; when clustering, divide minority subclusters in consideration of majority samples distribution to avoid new synthetic samples mixing into majority sample area and reduce noise impact. Assign sampling weights to minority subclusters to achieve withinclass balance of minority subcluster. Sample more marginal samples susceptible to be misclassified on the basis of the probability distribution to form an obvious boundary between two sample classes. For samples distributed marginally, the centroid method is used to restrict the generation of new samples in a certain area, which further guarantees the quality and diversity of synthetic samples. The data distribution is shown in Figure 3.(2)Analysis of actual dataset results: compare AGNES−SMOTE with SMOTE, KmeansSMOTE, and ClusterSMOTE in the experiments. The AUC values of the above sampling algorithms on datasets are shown in Table 4.
The experimental results in Table 4 indicate that AGNESSMOTE has better AUC values on datasets Ecoli, Libra, Yeast1, Optical_digits, and Abalone than other sampling algorithms. Besides, AGNESSMOTE has large AUC values on datasets Libra and Optical_digits because of their large imbalance ratios and rich features; thus more samples are needed to be synthesized. AGNESSMOTE considers withinclass imbalance, selects samples, restricts the generation area of synthetic samples, and reduces the overlap of synthetic samples to ensure synthetic samples quality and provide various information for the classifier. AGNESSMOTE has low AUC values on datasets Haberman, Yeast1, and Liver due to their smaller imbalance ratio and fewer features.
The Fmeasure values and Gmean values with SMOTE, KmeansSMOTE, ClusterSmote, and AGNESSMOTE on each dataset are listed in Tables 5 and 6.
Tables 5 and 6 indicate that the AGNESSMOTE algorithm attains good Fmeasure values and Gmean values on most datasets. It greatly improves Fmeasure values and Gmean values on datasets Ecoli, Yeast1, Haberman, Optical_digits, Abalone, and LEV, among which Fmeasure highest value reaches 96.70% and Gmean highest value reaches 97.53%. On dataset Libra, Gmean value by AGNESSMOTE improves greatly but is still slightly lower than that by ClusterSMOTE; however, Fmeasure value by AGNESSMOTE on dataset Libra increases by 14.25%. On dataset Satimage, Fmeasure value and Gmean value by AGNESSMOTE are slightly lower than those by SMOTE, since this dataset has many overlapping data and interference samples affect classification performance. On dataset Liver, Fmeasure value and Gmean value by AGNESSMOTE are similar to those by SMOTE algorithm because the data distribution in the original dataset is also relatively concentrated. Generally speaking, in dealing with imbalanced data, the AGNESSMOTE algorithm improves classification performance through reducing noise interference, reducing synthetic samples overlap, selecting marginal samples susceptible to be misclassified, and considering withinclass imbalance and generated samples distribution.
5. Conclusion
Regarding imbalanced datasets classification, the existing oversampling algorithms mainly deal with betweenclass imbalance and neglect withinclass imbalance. Some problems are ignored, such as samples being oversampled are not selected, noise is not removed, synthetic samples will overlap, and samples will be distributed “marginally.” To solve the abovementioned problems, an oversampling algorithm—AGNESSMOTE—is presented in this paper, which is based on the hierarchical clustering and improved SMOTE. This algorithm follows the following procedures: filter noise samples in the dataset; cluster majority samples and minority samples through the AGNES algorithm, respectively; divide minority subclusters in the light of the obtained majority subclusters; select samples for oversampling based on sampling weight and the probability distribution of minority subclusters; restrict the generation of new samples in a certain area by the centroid method. Comparative experiments on data processing with different algorithms have been conducted. Experimental results indicate that AGNESSMOTE improves the classification accuracy of minority samples and the overall classification performance. However, the new oversampling algorithm proposed in this paper is only available for biclass cases. In practice, most data fall into multiple categories, so optimized oversampling algorithms for multiclass data classification will be expected in the future.
Data Availability
The data used to support the results of this study are available on the website: https://archive.ics.uci.edu/ml/index.php.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
The work was supported by the Natural Science Foundation of Guangxi (2019GXNSFAA245053), the Guangxi Science and Technology Major Project (AA19254016), and the Major Research Plan Integration Project of NSFC (91836301).