Novel Advances in the Development of Machine Learning Solutions for Scientific Programming
View this Special IssueResearch Article  Open Access
Aytuğ Onan, "Consensus ClusteringBased Undersampling Approach to Imbalanced Learning", Scientific Programming, vol. 2019, Article ID 5901087, 14 pages, 2019. https://doi.org/10.1155/2019/5901087
Consensus ClusteringBased Undersampling Approach to Imbalanced Learning
Abstract
Class imbalance is an important problem, encountered in machine learning applications, where one class (named as, the minority class) has extremely small number of instances and the other class (referred as, the majority class) has immense quantity of instances. Imbalanced datasets can be of great importance in several realworld applications, including medical diagnosis, malware detection, anomaly identification, bankruptcy prediction, and spam filtering. In this paper, we present a consensus clustering basedundersampling approach to imbalanced learning. In this scheme, the number of instances in the majority class was undersampled by utilizing a consensus clusteringbased scheme. In the empirical analysis, 44 smallscale and 2 largescale imbalanced classification benchmarks have been utilized. In the consensus clustering schemes, five clustering algorithms (namely, kmeans, kmodes, kmeans++, selforganizing maps, and DIANA algorithm) and their combinations were taken into consideration. In the classification phase, five supervised learning methods (namely, naïve Bayes, logistic regression, support vector machines, random forests, and knearest neighbor algorithm) and three ensemble learner methods (namely, AdaBoost, bagging, and random subspace algorithm) were utilized. The empirical results indicate that the proposed heterogeneous consensus clusteringbased undersampling scheme yields better predictive performance.
1. Introduction
Class imbalance is an important research problem in machine learning, where the proportion of instances belonging to one class (referred as, the minority class) is extremely small, whereas the proportion of instances of the other class or classes (referred as, the majority class) is extremely high. Imbalanced datasets pose several challenges to the conventional supervised learning methods. Conventional supervised learning methods (such as support vector machines and decision trees) can build viable classification models for balanced datasets. Since imbalanced datasets suffer from outnumbering the instances of majority class and underrepresenting the instances of minority class, skewed distributions may lead to degradation of predictive performance [1, 2]. Supervised learning process is based on the use of global evaluation measures (such as classification accuracy). Hence, learning from imbalanced datasets can have bias towards the majority class, and classification models may tend to misclassify the instances of minority class [3]. Supervised learning algorithms may regard the instances of minority class as noise or outlier, and noisy data and outlier may be regarded as the instances of minority class [4]. In addition, classification models for datasets with skewed sample distributions may be challenging to learn due to the overlapping nature of the instances of minority class with the instances of other classes [5].
Imbalanced datasets can be encountered in several realworld problems and applications, including software fault identification [6], medical diagnosis [7], malware detection [8], anomaly identification [9], bankruptcy prediction [10], and spam filtering [11]. For data mining problems mentioned in advance, the number of instances for minority class is scarce. However, the identification of the instances of minority class may be more critical. For instance, the misclassification of cancerous (malignant) tumors as noncancerous (benign) in medical diagnosis can have severe effects. Similarly, the number of instances for fraudulent transactions can be scarce. However, it is critical to build prediction models that can identify fraudulent transactions in finance. Hence, handling imbalanced datasets properly is an important research problem in machine learning.
To deal efficiently with the datasets with imbalanced distribution and to build robust and efficient classification schemes, data preprocessing methods have been utilized in conjunction with machine learning algorithms. The methods utilized to tackle with class imbalance problem can be mainly divided into four categories as algorithm level approaches, datalevel approaches, costsensitive approaches, and ensemble learningbased approaches [12]. Algorithm level approaches seek to adapt supervised learning algorithms to bias learning towards the instances of minority class [13]. Datalevel approaches seek to rebalance the instances of the imbalanced dataset so that the effects of skewed distributions can be eliminated in the learning process [14]. In order to do so, datalevel approaches utilize resampling on the training datasets. Costsensitive approaches aim at minimizing total cost of errors for minority and majority classes by defining misclassification costs [15]. In addition, ensemble learningbased approaches have been also utilized for class imbalance. Ensemble classifiers aim at enhancing the predictive performance of a single learning algorithm by combining the predictions of several learning algorithms. In ensemble approaches to imbalanced learning, several strategies (such as bagging and undersampling, undersampling and costsensitive learning, boosting and resampling) have been combined [12]. In datalevel approaches, data preprocessing and learning process of supervised learning algorithm are handled independently. In addition, compared to the costsensitive approaches, which involve to set cost matrix for imbalanced datasets, datalevel preprocessing (resampling) is a viable tool to apply for researchers who are not expert in the field [1]. Hence, regarding different approaches to imbalanced learning, datalevel approaches, which are based on resampling the imbalanced datasets, are frequently employed. The two main directions on datalevel approaches are undersampling and oversampling. In order to obtain a dataset with balanced class distribution, the original imbalanced dataset can be resampled by oversampling the minority class or undersampling the majority class [16, 17]. In addition, there are several hybrid approaches, which combine undersampling and oversampling methods, such as SMOTEBoost, OverBagging, and UnderBagging [18–20]. Compared to the oversampling, undersampling yields better predictive performance [21]. However, undersampling may result in elimination of some useful representative instances of majority class [22]. Hence, the identification of useful representative instances in undersampling is of great performance to the predictive performance of supervised learning algorithms on imbalanced learning. In response, clustering methods can be utilized to identify useful representative instances of majority class in undersampling for imbalanced learning [23–25].
In this paper, we present a consensus clusteringbased undersampling approach to imbalanced learning. In this scheme, the number of instances in the majority class was undersampled by utilizing a consensus clusteringbased scheme. There are a large number of clustering algorithms in the literature. However, there is no single clustering algorithm that can yield the best clustering results under all scenarios, as the no free lunch theorem claims [26]. In this regard, the presented scheme aims at combining the decisions of different clustering algorithms, to overcome the limitations of individual clustering algorithms to achieve more robust/efficient clustering results. In this way, the presented scheme aims at identifying better representative instances of majority class in undersampling for imbalanced learning. In the empirical analysis, 44 smallscale and 2 largescale imbalanced classification (with imbalance ratios ranged between 1.8 and 163.19) were utilized. In the empirical analysis, the predictive performances of two clusteringbased framework (namely, homogeneous and heterogeneous consensus clustering schemes) were compared with three datalevel methods (namely, SMOTEBoost algorithm [16], RUSBoost [27], and underBagging algorithm [28, 29]). In the consensus clustering schemes, five clustering algorithms (namely, kmeans, kmodes [30], kmeans++ [31], selforganizing maps [32], and DIANA algorithm [33] and their combinations were taken into consideration. In the classification phase, five supervised learning methods (namely, naïve Bayes, logistic regression, support vector machines, random forests, and knearest neighbor algorithm) and three ensemble learner methods (namely, AdaBoost, bagging, and random subspace algorithm) were utilized. The empirical results indicate that the proposed heterogeneous consensus clusteringbased undersampling scheme yields better predictive performance. To the best of our knowledge, the presented scheme is the first to use the paradigm of consensus clustering for imbalanced learning. The remainder of this paper is organized as follows. Section 2 briefly reviews the state of the art in imbalanced learning. Section 3 presents the proposed consensus clustering basedundersampling schemes. Section 4 presents the empirical analysis results, and Section 5 presents the concluding remarks.
2. Related Works
Imbalanced learning has attracted great research interest. As mentioned in advance, the methods to deal with imbalanced datasets can be broadly categorized as datalevel methods, algorithm level methods, costsensitive methods, and ensemble learningbased methods. Compared to the other approaches, datalevel approaches have greater potential use on imbalanced learning since they seek to improve the distribution of datasets, rather than relying on supervised learningbased enhancements [34]. This section briefly reviews the related work on imbalanced learning with emphasis on datalevel approaches. Datalevel approaches (sampling methods) can be mainly divided into two categories as undersampling and oversampling. Oversampling and undersampling approaches can be employed effectively for class imbalance.
Oversampling approaches aim at obtaining a balanced dataset by generating synthetic instances for the minority class. In contrast, undersampling approaches aim at obtaining a balanced dataset by removing the instances of the majority class from the training set. For instance, Anand et al. [35] introduced a distancebased undersampling approach for class imbalance. Supervised learning methods can easily construct learning models for instances that are far from the decision boundaries. In response, the presented scheme aims at eliminating the instances of majority class that are far from decision boundaries, while preserving the instances near to the decision boundaries in the training set. In this way, the balanced training set was constructed and the balanced dataset was utilized in conjunction with the weighted support vector machines. Similarly, Li et al. [36] utilized vector quantization algorithm to decrease the instances of majority class. The presented scheme employed support vector machines for imbalanced learning. In another study, Kumar et al. [37] empirically examined the effect of undersampling on the performance of clustering algorithms. In another study, Sun et al. [22] presented an ensemble classification scheme based on undersampling for imbalanced learning. In the presented scheme, the instances of majority class were first divided into several partitions with similar number of instances with the minority class. In this way, balanced datasets were generated. The balanced datasets were trained on binary classifiers to build classification models. Finally, the predictions of binary classifiers were combined by an ensemble scheme to identify the final outcome. In another study, D’Addabbo and Maglietta [38] presented a selective samplingbased approach for imbalanced learning. Based on the observation that the instances near to decision boundaries are relevant/critical, the instances of majority class near to decision boundaries are preserved. In another study, Ha and Lee [39] presented an evolutionary undersampling scheme for class imbalance. In this scheme, genetic algorithm was utilized to select the informative instances of majority class by minimizing the loss between the distributions between original and balanced datasets. In another study, Lin et al. [24] introduced two clusteringbased undersampling schemes for imbalanced learning. In this scheme, the number of clusters was determined based on the number of instances of minority class, and kmeans algorithm was employed to undersample the instances of majority class. More recently, Shobana and Battula [40] presented an undersampling scheme based on diversified distribution and clustering for imbalanced learning. In this scheme, kmeans algorithm was employed to identify and remove rare instances and outliers.
In a recent study, Guo and Wei [41] presented a hybrid scheme based on clustering and logistic regression for imbalanced learning. In the presented scheme, clustering was utilized to partition instances of the majority class into clusters. Similarly, Douzas et al. [42] integrated kmeans clustering algorithm and synthetic minority oversampling technique to eliminate noisy data and to effectively obtain a balanced dataset within classes. Recently, Han et al. [43] presented a distributionbased approach for imbalanced learning. In the presented scheme, the instances of minority class were divided into groups as noisy instances, unstable instances, boundary instances, and stable instances based on the location information for the instances. The presented scheme has been utilized to improve the predictive performance on medical diagnosis. In another study, Tsai et al. [44] introduced an undersampling approach for imbalanced learning, which integrates clustering analysis and instance selection.
As mentioned in advance, undersampling is a simple resampling strategy to deal with class imbalance problem. However, undersampling may remove potentially useful/informative instances of the majority class, which may lead to the degradation of the predictive performance of classification schemes. In this paper, a consensus clusteringbased framework is presented to identify the informative instances of majority class through the use of a cluster ensemble method.
3. Proposed Consensus ClusteringBased Undersampling Framework
Undersampling and oversampling methods can be successfully employed for class imbalance. In order to obtain a robust classification scheme with high predictive performance, undersampling methods should retain useful and informative representative instances of the majority class in the training set. Clustering (cluster analysis) is an unsupervised technique which assigns similar instances (objects) into the same cluster in terms of their proximity or similarity. Hence, clustering algorithms can be employed to identify useful instances of majority class in undersampling. With the use of clustering on undersampling, the majority class yields a distribution of instances into clusters such that similar instances are grouped together within the same cluster. One of the main problems encountered in applying clustering algorithms is the selection of an appropriate algorithm for a given problem. Each clustering algorithm has strong and weak characteristics, and the results obtained by clustering algorithms are greatly influenced based on the characteristics of dataset, parameters of algorithm, etc. The clustering algorithms suffer from instability, and the same clustering algorithm can yield a particularly different partition for different parameter settings. One possible solution to this problem is to use multiple clustering algorithms on the same dataset and to combine the outputs of individual clustering algorithms. The process is referred as consensus clustering (or cluster ensembles). Consensus clustering aims at combining the clustering results of different clustering algorithms so that a final clustering with better clustering quality can be obtained [45]. In this paper, two ensemble generation schemes are presented to undersample the instances of majority class based on consensus clustering, namely, homogeneous and heterogeneous ensemble schemes are introduced.
3.1. Consensus Function
Consensus clustering involves a staged procedure: in Stage 1, cluster ensemble is generated, and in Stage 2, consensus function is utilized to obtain the final partition from the individual clustering algorithms. There are direct approaches (such as simple voting, incremental voting, and label correspondence search), feature based approaches (such as iterative voting consensus, mixture model, clustering aggregation, and quadratic mutual information), pairwise similaritybased approaches (such as agglomerative hierarchical models), and graphbased approaches (such as clusterbased similarity partitioning algorithm and shared nearest neighborsbased combiner) [45]. Motivated by the success of clustering algorithms on imbalanced learning [24] and the enhanced clustering quality obtained by consensus clustering schemes [46], we seek to find an efficient consensus clusteringbased scheme for imbalanced learning. In this regard, we have conducted an experimental analysis with several different consensus functions. Since the highest predictive performance is obtained by direct approaches, of the wide range of consensus functions available, three consensus functions were chosen for the study.
3.1.1. Simple Voting Function (SV)
Let π_{r} denote the reference partition and let denote to be relabelled partitions, a contingency matrix Ω ∈ R^{K × K} is obtained, in which K corresponds to the number of clusters. The contingency matrix entries (Ω(l, l′)) are filled by cooccurrence statistics computed based on the following equation [45,43]:where if and otherwise. Based on the label correspondence obtained based on equation (1), the aim of the simple voting consensus is to maximize the objective function, given bywhere is a label correspondence matrix amongst the labels of partitions π_{r} and . First, the reference partition (π_{r}) is randomly selected among the partitions of the cluster ensemble. Then, the remaining partitions are relabelled based on the reference partition by following the procedure outlined above. Finally, a majority voting scheme is employed to identify the consensus label of each instance.
3.1.2. Incremental Voting Function (IV)
In incremental voting scheme (IV), data partitions are repeatedly added to the cluster ensemble. Let ∈ R^{N × K} denote partition takes the value of 1 if a data point belongs to cluster . Otherwise, it takes the value of 0. Let denote the matrix of intermediate partitions and denote the number of partitions in which label is corresponds to data point . The process of incremental votingbased consensus is initialized with the construction of contingency matrix Ω ∈ R^{K × K}. The contingency matrix entries are filled by the following equation [48]:where if . Otherwise, it takes the value of 0. After obtaining the contingency matrix, the entries of matrix for the partition (denoted by V_{g + 1}) are computed as given by
Based on the incremental combinations of M data partitions, the consensus label of each data point is determined based on following equation [45]:
3.1.3. Label Correspondence Search
In label correspondence search (LCS), the problem of correspondence is modelled as an optimization problem [49]. The aim of the method is to obtain a consensus partition such that overall agreement among the different partitions is maximized. Let R_{{c,s}} denote the vector representation of cluster c of system s. The element of R_{{c,s}} represents the posterior probabilities of cluster c for the data points. The agreement between clusters {c, s} and {c′, s′} can be defined as given by the following equation:
If a cluster c of system s is assigned to metacluster m, takes the value of 1 and it takes the value of 0 otherwise. denotes the reward of assigning cluster c to metacluster m, and it can be defined as given by the following equation:
Based on equations (6) and (7), the objective of label correspondence is to maximize the argument defined in the following equation [49]:subject to
3.2. Homogeneous Consensus ClusteringBased Undersampling Framework
Let D denote an imbalanced dataset with two classes, where there is one class (referred as, the minority class) containing the small number of instances and there is another class (referred as, the majority class) containing extremely high quantity of instances. Let us denote the number of instances corresponding to majority and minority classes as n and m, respectively. Initially, kfold crossvalidation scheme is utilized for dividing the imbalanced dataset into subsets as training and test sets. Then, the number of instances in the majority class (n) is undersampled so that it contains equal number of instances to the minority class (m). In the undersampling, homogeneous consensus clustering scheme is utilized to undersample the majority class. Clustering algorithms require the number of clusters as the input parameter. We adopted the clustering framework presented in [24]. Hence, the number of instances in the minority class (m) is taken as the number of clusters (k). In homogeneous consensus clustering scheme, the same clustering algorithm is utilized as the base clustering algorithm, with different parameter settings. In this scheme, five clustering algorithms (namely, kmeans, kmodes, kmeans++, selforganizing maps, and DIANA algorithm) are utilized as the base clustering algorithms.
In this way, diversified partitions are obtained by the base clustering algorithms. The partitions obtained by the base clustering algorithms are combined by consensus function to obtain the final partition. For obtaining final partition with consensus function, three consensus functions (namely, simple voting function, incremental voting function, and label correspondence search algorithm) are utilized. The center of each cluster of the final partition is selected as the instance for the majority class. In this way, a balanced training set is obtained. The balanced training set is utilized to train supervised learning algorithms (namely, naïve Bayes, logistic regression, support vector machines, random forests, and knearest neighbor algorithm) and ensemble learning methods (namely, AdaBoost, bagging, and random subspace algorithm). The general stages of this scheme is depicted in Figure 1. In Figure 2, the general steps of homogeneous consensus clusteringbased undersampling scheme (CONS1) are outlined.
3.3. Heterogeneous Consensus ClusteringBased Undersampling Framework
In heterogeneous consensus clustering scheme (CONS2), diversity among the clustering algorithms is achieved with the use of different clustering algorithms as the base clustering algorithms. As stated in advance, each clustering algorithm has its own strengths and weaknesses and can yield promising results on different datasets. The partitions obtained by different clustering algorithms may complement each other and can yield higher clustering quality. The heterogeneous consensus clusteringbased undersampling framework follows the same stages as outlined in Figure 1. The only difference is that the heterogeneous consensus clustering framework utilizes 5 different clustering algorithms, as the base clustering algorithms, whereas the homogeneous consensus clustering framework utilizes the same clustering algorithm with different parameter settings, as the base clustering algorithms. The general structure of heterogeneous consensus clusteringbased undersampling scheme is summarized in Figure 3. In the heterogeneous consensus clusteringbased undersampling scheme, kfold crossvalidation is employed for dividing the imbalanced dataset into training set and test set. Then, the number of instances in the majority class is undersampled with the use of heterogeneous consensus clustering scheme. In this scheme, different clustering algorithms are utilized as the base clustering algorithms. The presented scheme can be configured with different clustering algorithms, yet, we have combined the five base clustering algorithms (namely, Kmeans, Kmodes, Kmeans++, selforganizing maps, and DIANA algorithm). The partitions obtained by different clustering algorithms are combined by the consensus function. The center of each cluster of the final partition is selected as the instance for the majority class. In this way, a balanced training set is obtained. The predictive performance of undersampling scheme is examined with the use of supervised learning methods and ensemble learning methods.
4. Experimental Analysis and Results
This section presents the empirical analysis of the proposed consensus clusteringbased undersampling schemes.
4.1. Datasets
To examine the effectiveness of the proposed undersampling approaches, we have utilized 44 smallscale and 2 largescale imbalanced classification benchmarks. The imbalanced classification benchmarks were utilized in Galar et al. [12]. The imbalance ratios of smallscale benchmarks range from 1.8 to 129, and the number of instances ranges from 130 to 5500. The imbalance ratios of largescale benchmarks range from 111.46 to 163.19, and the number of instances ranges from 102294 to 145751. For obtaining test and training sets for the supervised learning methods, we utilized kfold crossvalidation scheme, where we were partitioned the 80% and 20% training and testing sets with 5fold crossvalidation scheme. The basic descriptive information regarding the imbalanced classification benchmarks is presented in Table 1.

4.2. Experimental Procedure
In the empirical analysis, the presented consensus clusteringbased undersampling schemes have been compared by seven stateoftheart methods. The utilized methods in the analysis include UnderBagging4 (UB4), UnderBagging24 (UB24), RusBoost1 (Rus1), SMOTEBagging4 (SBAG4), UnderBagging1 (UB1), clusteringbased undersampling based on cluster centers (Centers), and clusteringbased undersampling based on the nearest neighbors of cluster centers (Centers_NN) [12, 24]. In order to examine the predictive performance changes obtained by data balancing strategies, the results obtained by C4.5 algorithm without data balancing have also been presented as the baseline results. In the consensus clustering schemes, five clustering algorithms (namely, kmeans, kmodes, kmeans++, selforganizing maps, and DIANA algorithm) and their combinations were taken into consideration. In the classification phase, five supervised learning methods (namely, naïve Bayes, logistic regression, support vector machines, random forests, and knearest neighbor algorithm) and three ensemble learner methods (namely, AdaBoost, bagging, and random subspace algorithm) were utilized. In the empirical analysis, area under roc curve was utilized as the evaluation metric. For the supervised learning methods and stateoftheart data preprocessing methods, the default parameters were employed. For the homogeneous consensus clusteringbased undersampling scheme, i parameter (the number of base clustering algorithms) is taken as five.
4.3. Experimental Results and Discussions
In Table 2, average AUC values of the stateoftheart methods and conventional clustering algorithms (namely, Kmeans, Kmeans++, Kmodes, selforganizing maps, and DIANA algorithm) are presented. As it can be observed from the results presented in Table 2, the application of data balancing strategies enhance the predictive performance in terms of AUC values. The lowest average AUC values obtained by C4.5 algorithm without data balancing have been applied. The highest average AUC values are generally obtained by UnderBagging4 algorithm, and the second highest average AUC values are generally obtained by UnderBagging24 algorithm. In the empirical analysis, five base clustering algorithms have been taken into consideration. Among the base clustering algorithms, the highest average AUC values are obtained by DIANA clustering algorithm.

The homogeneous consensus clustering scheme utilizes a single clustering algorithm (of the same type) as the base clustering method. In the empirical analysis, five clustering algorithms (namely, kmeans, kmodes, kmeans++, selforganizing maps, and DIANA algorithm) are considered as the base clustering methods. For aggregating the clustering results of individual clustering results, we considered three consensus functions (namely, simple voting function, incremental voting function, and label correspondence search algorithm). In this way, 15 different homogeneous consensus clusteringbased schemes are evaluated for imbalanced learning. In Table 3, average AUC values obtained by homogeneous consensus clustering schemes are presented. Compared to the results presented in Table 2 for conventional datalevel methods and conventional clusteringbased schemes, homogeneous consensus clustering schemes yield better predictive performance in terms of AUC values. Among the compared homogeneous consensus clustering schemes, the highest predictive performance is obtained by utilizing selforganizing map algorithm as the base clustering algorithm. In this scheme, simple voting function is employed as the consensus function.

For the heterogeneous consensus clustering scheme, kmeans, kmodes, kmeans++, selforganizing maps, and DIANA algorithm methods were utilized to identify individual partitions. Similar to the homogeneous scheme, we considered three consensus functions (namely, simple voting function, incremental voting function, or label correspondence search algorithm). In this way, 3 different heterogeneous consensus clusteringbased schemes are taken into consideration. In Table 4, average AUC values obtained by heterogeneous consensus clustering schemes are presented. As it can be observed from the results listed in Table 4, heterogeneous consensus clustering schemes outperform homogeneous consensus clustering schemes, conventional datalevel methods, and conventional clusteringbased schemes. Regarding the average AUC values analyzed in the empirical analysis, the highest predictive performance is obtained by heterogeneous clustering scheme with label correspondence searchbased consensus function. The second highest predictive performance is obtained by heterogeneous clustering scheme with simple votingbased consensus function.

In the classification phase, five supervised learning methods (namely, naïve Bayes, logistic regression, support vector machines, random forests, and knearest neighbor algorithm) and three ensemble learner methods (namely, AdaBoost, bagging, and random subspace algorithm) were utilized. In order to summarize the main findings of the empirical analysis, boxplots for undersampling methods and supervised learning methods are presented in Figures 4 and 5, respectively.
As it can be observed from Figure 4, average AUC values obtained from the presented heterogeneous clustering scheme is higher compared to the conventional datalevel methods (). In Figure 5, the predictive performance analysis of conventional supervised learning methods and their ensembles are taken into consideration. As it can be observed, ensemble learning methods yield higher predictive performance in terms of AUC values compared to the conventional supervised learning methods. The highest predictive performance for supervised learning methods is achieved by random subspace ensemble of random forest, and the second highest predictive performance is obtained by random subspace ensemble of support vector machines (). Regarding the predictive performance of conventional clustering algorithms, naïve Bayes demonstrated the lowest predictive performance, whereas random forest algorithm demonstrated the best (the highest) predictive performance ().
In Figure 6, the confidence intervals for the mean values of average AUC values obtained by the compared algorithms for a confidence level of 95% are presented. Based on the statistical significances between the compared results, Figure 6 is divided into two regions denoted by red dashed line. As it can be observed from Figure 6, the predictive performance differences obtained by the proposed consensus clusteringbased schemes are statistically significant.
5. Conclusion
Class imbalance is an important problem of machine learning. Imbalanced datasets can be seen in a wide variety of applications, including medical diagnosis, malware detection, anomaly identification, bankruptcy prediction, and spam filtering. In order to build efficient and robust classification schemes, data preprocessing methods can be utilized in conjunction with supervised learning methods. Undersampling and oversamplingbased methods can be successfully utilized for class imbalance. However, the identification of informative instances to be included in the training set is a critical issue for undersampling. In this regard, this paper empirically examines the predictive performance of two consensus clusteringbased undersampling schemes for imbalanced learning. In the empirical analysis, 44 smallscale and 2 largescale imbalanced classification benchmarks (with imbalance ratios ranged between 1.8 and 163.19) were utilized. The experimental analysis indicates that clusteringbased undersampling schemes can outperform conventional datalevel preprocessing methods for class imbalance. In addition, consensus clustering, which aggregates the partitions of individual clustering algorithms, can further enhance the predictive performance of clusteringbased undersampling schemes.
There are a number of issues that should be beneficial to extend in the future. The presented consensus clustering based undersampling scheme utilizes five clustering algorithms (namely, kmeans, kmodes, kmeans++, selforganizing maps, and DIANA algorithm). The clustering algorithms have been integrated with the use of three consensus functions, namely, simple votingbased consensus function, incremental voting function, and label correspondence search. Hence, the predictive performance of other conventional and swarmbased clustering algorithms (such as ant clustering, particle swarmbased clustering, firefly clustering) can be examined for imbalanced learning. In addition, recent proposals on the field indicate that imbalancing schemes which integrate instance selection and clustering may yield higher predictive performance. Hence, the performance of consensus clusteringbased undersampling scheme should be taken into consideration in conjunction with conventional instance selection methods.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Disclosure
The study was performed as part of the employment of the author at Izmir Katip Celebi University.
Conflicts of Interest
The author declares that there are no conflicts of interest.
References
 G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from classimbalanced data: review of methods and applications,” Expert Systems with Applications, vol. 73, pp. 220–239, 2017. View at: Publisher Site  Google Scholar
 V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics,” Information Sciences, vol. 250, pp. 113–141, 2013. View at: Publisher Site  Google Scholar
 G. M. Weiss, “Mining with rarity,” ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 7–19, 2004. View at: Publisher Site  Google Scholar
 C. Beyan and R. Fisher, “Classifying imbalanced data sets using similarity based hierarchical decomposition,” Pattern Recognition, vol. 48, no. 5, pp. 1653–1672, 2015. View at: Publisher Site  Google Scholar
 M. Denil and T. Trappenberg, “Overlap versus imbalance,” in Proceedings of Canadian Conference on Artificial Intelligence, pp. 220–231, Springer, Ottawa, Canada, May 2010. View at: Google Scholar
 D. Rodriguez, I. Herraiz, R. Harrison, J. Dolado, and J. C. Riquelme, “Preliminary comparison of techniques for dealing with imbalance in software defect prediction,” in Proceedings of 18th International Conference on Evaluation and Assessment in Software Engineering, p. 43, ACM, London, UK, May 2014. View at: Google Scholar
 R. Akbani, S. Kwek, and N. Japkowicz, “Applying support vector machines to imbalanced datasets,” in Proceedings of European Conference on Machine Learning ECML 2004, pp. 39–50, Prague, Czech Republic, September 2004. View at: Publisher Site  Google Scholar
 N. Peiravian and X. Zhu, “Machine learning for android malware detection using permission and api calls,” in Proceedings of IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 300–305, IEEE, Herndon, VA, USA, November 2013. View at: Google Scholar
 W. Khreich, E. Granger, A. Miri, and R. Sabourin, “Iterative Boolean combination of classifiers in the ROC space: an application to anomaly detection with HMMs,” Pattern Recognition, vol. 43, no. 8, pp. 2732–2752, 2010. View at: Publisher Site  Google Scholar
 M.J. Kim, D.K. Kang, and H. B. Kim, “Geometric mean based boosting algorithm with oversampling to resolve data imbalance problem for bankruptcy prediction,” Expert Systems with Applications, vol. 42, no. 3, pp. 1074–1082, 2015. View at: Publisher Site  Google Scholar
 T. R. Hoens, R. Polikar, and N. V. Chawla, “Learning from streaming data with concept drift and imbalance: an overview,” Progress in Artificial Intelligence, vol. 1, no. 1, pp. 89–101, 2012. View at: Publisher Site  Google Scholar
 M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A review on ensembles for the class imbalance problem: bagging, boosting, and hybridbased approaches,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463–484, 2012. View at: Publisher Site  Google Scholar
 B. Liu, Y. Ma, and C. K. Wong, “Improving an association rule based classifier,” in Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery, pp. 504–509, Springer, Lyon, France, September 2000. View at: Google Scholar
 G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 20–29, 2004. View at: Publisher Site  Google Scholar
 N. V. Chawla, D. A. Cieslak, L. O. Hall, and A. Joshi, “Automatically countering imbalance and its empirical relationship to cost,” Data Mining and Knowledge Discovery, vol. 17, no. 2, pp. 225–252, 2008. View at: Publisher Site  Google Scholar
 N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority oversampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002. View at: Publisher Site  Google Scholar
 N. Japkowicz, “The class imbalance problem: significance and strategies,” in Proceedings of International Conference on Artificial Intelligence, Las Vegas, NV, USA, June 2000. View at: Google Scholar
 R. Barandela, R. M. Valdovinos, and J. S. Sánchez, “New applications of ensembles of classifiers,” Pattern Analysis and Applications, vol. 6, no. 3, pp. 245–256, 2003. View at: Publisher Site  Google Scholar
 N. V. Chawla, N. Japkowicz, and A. Kolcz, “Workshop learning from imbalanced data sets II,” in Proceedings of International Conference on Machine Learning, Washington, DC, USA, August 2003. View at: Google Scholar
 S. Wang and X. Yao, “Diversity analysis on imbalanced data sets by using ensemble models,” in Proceedings of IEEE Symposium on Computational Intelligence and Data Mining CIDM'09, pp. 324–331, IEEE, Nashville, TN, USA, March 2009. View at: Google Scholar
 J. Błaszczyński and J. Stefanowski, “Neighbourhood sampling in bagging for imbalanced data,” Neurocomputing, vol. 150, pp. 529–542, 2015. View at: Publisher Site  Google Scholar
 Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, and Y. Zhou, “A novel ensemble method for classifying imbalanced data,” Pattern Recognition, vol. 48, no. 5, pp. 1623–1637, 2015. View at: Publisher Site  Google Scholar
 J. Kwak, T. Lee, and C. O. Kim, “An incremental clusteringbased fault detection algorithm for classimbalanced process data,” IEEE Transactions on Semiconductor Manufacturing, vol. 28, no. 3, pp. 318–328, 2015. View at: Google Scholar
 W.C. Lin, C.F. Tsai, Y.H. Hu, and J.S. Jhang, “Clusteringbased undersampling in classimbalanced data,” Information Sciences, vol. 409410, pp. 17–26, 2017. View at: Publisher Site  Google Scholar
 V. Vigneron and H. Chen, “A multiscale seriation algorithm for clustering sparse imbalanced data: application to spike sorting,” Pattern Analysis and Applications, vol. 19, no. 4, pp. 885–903, 2016. View at: Publisher Site  Google Scholar
 D. H. Wolpert and W. G. Macready, “No free lunch theorems for search,” vol. 10, Santa Fe Institute, Santa Fe, NM, USA, 1995, Technical Report SFITR9502010. View at: Google Scholar
 C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, “RUSBoost: a hybrid approach to alleviating class imbalance,” IEEE Transactions on Systems, Man, and CyberneticsPart A: Systems and Humans, vol. 40, no. 1, pp. 185–197, 2010. View at: Publisher Site  Google Scholar
 S. Wang, K. Tang, and X. Yao, “Diversity exploration and negative correlation learning on imbalanced data sets,” in Proceedings of International Joint Conference on Neural Networks, IJCNN 2009, pp. 3259–3266, IEEE, Atlanta, GA, USA, June 2009. View at: Google Scholar
 N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “SMOTEBoost: improving prediction of the minority class in boosting,” in Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery, pp. 107–119, Springer, CavtatDubrovnik, Croatia, September 2003. View at: Google Scholar
 Z. Huang, “A fast clustering algorithm to cluster very large categorical data sets in data mining,” DMKD, vol. 3, no. 8, pp. 34–39, 1997. View at: Google Scholar
 D. Arthur and S. Vassilvitskii, “kmeans++: the advantages of careful seeding,” in Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms, pp. 1027–1035, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, January 2007. View at: Google Scholar
 T. Kohonen, SelfOrganising Maps Berlin, Springer, Berlin, Germany, 2001.
 H. Chipman and R. Tibshirani, “Hybrid hierarchical clustering with applications to microarray data,” Biostatistics, vol. 7, no. 2, pp. 286–301, 2005. View at: Publisher Site  Google Scholar
 S. Barua, M. M. Islam, X. Yao, and K. Murase, “MWMOTE—majority weighted minority oversampling technique for imbalanced data set learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 2, pp. 405–425, 2014. View at: Publisher Site  Google Scholar
 A. Anand, G. Pugalenthi, G. B. Fogel, and P. N. Suganthan, “An approach for classification of highly imbalanced data using weighting and undersampling,” Amino Acids, vol. 39, no. 5, pp. 1385–1391, 2010. View at: Publisher Site  Google Scholar
 Q. Li, B. Yang, Y. Li, N. Deng, and L. Jing, “Constructing support vector machine ensemble with segmentation for imbalanced datasets,” Neural Computing and Applications, vol. 22, no. S1, pp. 249–256, 2013. View at: Publisher Site  Google Scholar
 N. S. Kumar, K. N. Rao, A. Govardhan, K. S. Reddy, and A. M. Mahmood, “Undersampled Kmeans approach for handling imbalanced distributed data,” Progress in Artificial Intelligence, vol. 3, no. 1, pp. 29–38, 2014. View at: Publisher Site  Google Scholar
 A. D’Addabbo and R. Maglietta, “Parallel selective sampling method for imbalanced and large data classification,” Pattern Recognition Letters, vol. 62, pp. 61–67, 2015. View at: Publisher Site  Google Scholar
 J. Ha and J. S. Lee, “A new undersampling method using genetic algorithm for imbalanced data classification,” in Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication, p. 95, ACM, Danang, Vietnam, January 2016. View at: Google Scholar
 G. Shobana and B. P. Battula, “An under sampled kmeans approach for handlingimbalanced data using diversified distribution,” International Journal of Engineering and Technology (UAE), vol. 7, no. 1.8, pp. 113–117, 2018. View at: Publisher Site  Google Scholar
 H. Guo and T. Wei, “Logistic regression for imbalanced learning based on clustering,” International Journal of Computational Science and Engineering, vol. 18, no. 1, pp. 54–64, 2019. View at: Publisher Site  Google Scholar
 G. Douzas, F. Bacao, and F. Last, “Improving imbalanced learning through a heuristic oversampling method based on kmeans and SMOTE,” Information Sciences, vol. 465, pp. 1–20, 2018. View at: Publisher Site  Google Scholar
 W. Han, Z. Huang, S. Li, and Y. Jia, “Distributionsensitive unbalanced data oversampling method for medical diagnosis,” Journal of medical Systems, vol. 43, no. 2, p. 39, 2019. View at: Publisher Site  Google Scholar
 C.F. Tsai, W.C. Lin, Y.H. Hu, and G.T. Yao, “Undersampling class imbalanced datasets by combining clustering analysis and instance selection,” Information Sciences, vol. 477, pp. 47–54, 2019. View at: Publisher Site  Google Scholar
 T. Boongoen and N. IamOn, “Cluster ensembles: a survey of approaches with recent extensions and applications,” Computer Science Review, vol. 28, pp. 1–25, 2018. View at: Publisher Site  Google Scholar
 N. Nguyen and R. Caruana, “Consensus clusterings,” in Proceedings of Seventh IEEE International Conference on Data Mining ICDM 2007, pp. 607–612, IEEE, Omaha, NE, USA, October 2007. View at: Google Scholar
 A. P. Topchy, M. H. Law, A. K. Jain, and A. L. Fred, “Analysis of consensus partition in cluster ensemble,” in Proceedings of Fourth IEEE International Conference on Data Mining ICDM’04, pp. 225–232, IEEE, Brighton, UK, November 2004. View at: Google Scholar
 H. G. Ayad and M. S. Kamel, “Cumulative voting consensus method for partitions with variable number of clusters,” IEEE Transactions on Pattern Analysis and Machine Intellzigence, vol. 30, no. 1, pp. 160–173, 2008. View at: Publisher Site  Google Scholar
 C. Boulis and M. Ostendorf, “Combining multiple clustering systems,” in Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery, pp. 63–74, Springer, CavtatDubrovnik, Croatia, September 2003. View at: Google Scholar
Copyright
Copyright © 2019 Aytuğ Onan. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.