Table of Contents Author Guidelines Submit a Manuscript
Scientific Programming
Volume 2019, Article ID 5901087, 14 pages
https://doi.org/10.1155/2019/5901087
Research Article

Consensus Clustering-Based Undersampling Approach to Imbalanced Learning

İzmir Katip Çelebi University, Faculty of Engineering and Architecture, Department of Computer Engineering, 35620 İzmir, Turkey

Correspondence should be addressed to Aytuğ Onan; moc.liamg@nanogutya

Received 11 November 2018; Revised 15 January 2019; Accepted 10 February 2019; Published 3 March 2019

Guest Editor: Vicente García-Díaz

Copyright © 2019 Aytuğ Onan. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Class imbalance is an important problem, encountered in machine learning applications, where one class (named as, the minority class) has extremely small number of instances and the other class (referred as, the majority class) has immense quantity of instances. Imbalanced datasets can be of great importance in several real-world applications, including medical diagnosis, malware detection, anomaly identification, bankruptcy prediction, and spam filtering. In this paper, we present a consensus clustering based-undersampling approach to imbalanced learning. In this scheme, the number of instances in the majority class was undersampled by utilizing a consensus clustering-based scheme. In the empirical analysis, 44 small-scale and 2 large-scale imbalanced classification benchmarks have been utilized. In the consensus clustering schemes, five clustering algorithms (namely, k-means, k-modes, k-means++, self-organizing maps, and DIANA algorithm) and their combinations were taken into consideration. In the classification phase, five supervised learning methods (namely, naïve Bayes, logistic regression, support vector machines, random forests, and k-nearest neighbor algorithm) and three ensemble learner methods (namely, AdaBoost, bagging, and random subspace algorithm) were utilized. The empirical results indicate that the proposed heterogeneous consensus clustering-based undersampling scheme yields better predictive performance.

1. Introduction

Class imbalance is an important research problem in machine learning, where the proportion of instances belonging to one class (referred as, the minority class) is extremely small, whereas the proportion of instances of the other class or classes (referred as, the majority class) is extremely high. Imbalanced datasets pose several challenges to the conventional supervised learning methods. Conventional supervised learning methods (such as support vector machines and decision trees) can build viable classification models for balanced datasets. Since imbalanced datasets suffer from outnumbering the instances of majority class and underrepresenting the instances of minority class, skewed distributions may lead to degradation of predictive performance [1, 2]. Supervised learning process is based on the use of global evaluation measures (such as classification accuracy). Hence, learning from imbalanced datasets can have bias towards the majority class, and classification models may tend to misclassify the instances of minority class [3]. Supervised learning algorithms may regard the instances of minority class as noise or outlier, and noisy data and outlier may be regarded as the instances of minority class [4]. In addition, classification models for datasets with skewed sample distributions may be challenging to learn due to the overlapping nature of the instances of minority class with the instances of other classes [5].

Imbalanced datasets can be encountered in several real-world problems and applications, including software fault identification [6], medical diagnosis [7], malware detection [8], anomaly identification [9], bankruptcy prediction [10], and spam filtering [11]. For data mining problems mentioned in advance, the number of instances for minority class is scarce. However, the identification of the instances of minority class may be more critical. For instance, the misclassification of cancerous (malignant) tumors as noncancerous (benign) in medical diagnosis can have severe effects. Similarly, the number of instances for fraudulent transactions can be scarce. However, it is critical to build prediction models that can identify fraudulent transactions in finance. Hence, handling imbalanced datasets properly is an important research problem in machine learning.

To deal efficiently with the datasets with imbalanced distribution and to build robust and efficient classification schemes, data preprocessing methods have been utilized in conjunction with machine learning algorithms. The methods utilized to tackle with class imbalance problem can be mainly divided into four categories as algorithm level approaches, data-level approaches, cost-sensitive approaches, and ensemble learning-based approaches [12]. Algorithm level approaches seek to adapt supervised learning algorithms to bias learning towards the instances of minority class [13]. Data-level approaches seek to rebalance the instances of the imbalanced dataset so that the effects of skewed distributions can be eliminated in the learning process [14]. In order to do so, data-level approaches utilize resampling on the training datasets. Cost-sensitive approaches aim at minimizing total cost of errors for minority and majority classes by defining misclassification costs [15]. In addition, ensemble learning-based approaches have been also utilized for class imbalance. Ensemble classifiers aim at enhancing the predictive performance of a single learning algorithm by combining the predictions of several learning algorithms. In ensemble approaches to imbalanced learning, several strategies (such as bagging and undersampling, undersampling and cost-sensitive learning, boosting and resampling) have been combined [12]. In data-level approaches, data preprocessing and learning process of supervised learning algorithm are handled independently. In addition, compared to the cost-sensitive approaches, which involve to set cost matrix for imbalanced datasets, data-level preprocessing (resampling) is a viable tool to apply for researchers who are not expert in the field [1]. Hence, regarding different approaches to imbalanced learning, data-level approaches, which are based on resampling the imbalanced datasets, are frequently employed. The two main directions on data-level approaches are undersampling and oversampling. In order to obtain a dataset with balanced class distribution, the original imbalanced dataset can be resampled by oversampling the minority class or undersampling the majority class [16, 17]. In addition, there are several hybrid approaches, which combine undersampling and oversampling methods, such as SMOTEBoost, OverBagging, and UnderBagging [1820]. Compared to the oversampling, undersampling yields better predictive performance [21]. However, undersampling may result in elimination of some useful representative instances of majority class [22]. Hence, the identification of useful representative instances in undersampling is of great performance to the predictive performance of supervised learning algorithms on imbalanced learning. In response, clustering methods can be utilized to identify useful representative instances of majority class in undersampling for imbalanced learning [2325].

In this paper, we present a consensus clustering-based undersampling approach to imbalanced learning. In this scheme, the number of instances in the majority class was undersampled by utilizing a consensus clustering-based scheme. There are a large number of clustering algorithms in the literature. However, there is no single clustering algorithm that can yield the best clustering results under all scenarios, as the no free lunch theorem claims [26]. In this regard, the presented scheme aims at combining the decisions of different clustering algorithms, to overcome the limitations of individual clustering algorithms to achieve more robust/efficient clustering results. In this way, the presented scheme aims at identifying better representative instances of majority class in undersampling for imbalanced learning. In the empirical analysis, 44 small-scale and 2 large-scale imbalanced classification (with imbalance ratios ranged between 1.8 and 163.19) were utilized. In the empirical analysis, the predictive performances of two clustering-based framework (namely, homogeneous and heterogeneous consensus clustering schemes) were compared with three data-level methods (namely, SMOTEBoost algorithm [16], RUSBoost [27], and underBagging algorithm [28, 29]). In the consensus clustering schemes, five clustering algorithms (namely, k-means, k-modes [30], k-means++ [31], self-organizing maps [32], and DIANA algorithm [33] and their combinations were taken into consideration. In the classification phase, five supervised learning methods (namely, naïve Bayes, logistic regression, support vector machines, random forests, and k-nearest neighbor algorithm) and three ensemble learner methods (namely, AdaBoost, bagging, and random subspace algorithm) were utilized. The empirical results indicate that the proposed heterogeneous consensus clustering-based undersampling scheme yields better predictive performance. To the best of our knowledge, the presented scheme is the first to use the paradigm of consensus clustering for imbalanced learning. The remainder of this paper is organized as follows. Section 2 briefly reviews the state of the art in imbalanced learning. Section 3 presents the proposed consensus clustering based-undersampling schemes. Section 4 presents the empirical analysis results, and Section 5 presents the concluding remarks.

2. Related Works

Imbalanced learning has attracted great research interest. As mentioned in advance, the methods to deal with imbalanced datasets can be broadly categorized as data-level methods, algorithm level methods, cost-sensitive methods, and ensemble learning-based methods. Compared to the other approaches, data-level approaches have greater potential use on imbalanced learning since they seek to improve the distribution of datasets, rather than relying on supervised learning-based enhancements [34]. This section briefly reviews the related work on imbalanced learning with emphasis on data-level approaches. Data-level approaches (sampling methods) can be mainly divided into two categories as undersampling and oversampling. Oversampling and undersampling approaches can be employed effectively for class imbalance.

Oversampling approaches aim at obtaining a balanced dataset by generating synthetic instances for the minority class. In contrast, undersampling approaches aim at obtaining a balanced dataset by removing the instances of the majority class from the training set. For instance, Anand et al. [35] introduced a distance-based undersampling approach for class imbalance. Supervised learning methods can easily construct learning models for instances that are far from the decision boundaries. In response, the presented scheme aims at eliminating the instances of majority class that are far from decision boundaries, while preserving the instances near to the decision boundaries in the training set. In this way, the balanced training set was constructed and the balanced dataset was utilized in conjunction with the weighted support vector machines. Similarly, Li et al. [36] utilized vector quantization algorithm to decrease the instances of majority class. The presented scheme employed support vector machines for imbalanced learning. In another study, Kumar et al. [37] empirically examined the effect of undersampling on the performance of clustering algorithms. In another study, Sun et al. [22] presented an ensemble classification scheme based on undersampling for imbalanced learning. In the presented scheme, the instances of majority class were first divided into several partitions with similar number of instances with the minority class. In this way, balanced datasets were generated. The balanced datasets were trained on binary classifiers to build classification models. Finally, the predictions of binary classifiers were combined by an ensemble scheme to identify the final outcome. In another study, D’Addabbo and Maglietta [38] presented a selective sampling-based approach for imbalanced learning. Based on the observation that the instances near to decision boundaries are relevant/critical, the instances of majority class near to decision boundaries are preserved. In another study, Ha and Lee [39] presented an evolutionary undersampling scheme for class imbalance. In this scheme, genetic algorithm was utilized to select the informative instances of majority class by minimizing the loss between the distributions between original and balanced datasets. In another study, Lin et al. [24] introduced two clustering-based undersampling schemes for imbalanced learning. In this scheme, the number of clusters was determined based on the number of instances of minority class, and k-means algorithm was employed to undersample the instances of majority class. More recently, Shobana and Battula [40] presented an undersampling scheme based on diversified distribution and clustering for imbalanced learning. In this scheme, k-means algorithm was employed to identify and remove rare instances and outliers.

In a recent study, Guo and Wei [41] presented a hybrid scheme based on clustering and logistic regression for imbalanced learning. In the presented scheme, clustering was utilized to partition instances of the majority class into clusters. Similarly, Douzas et al. [42] integrated k-means clustering algorithm and synthetic minority oversampling technique to eliminate noisy data and to effectively obtain a balanced dataset within classes. Recently, Han et al. [43] presented a distribution-based approach for imbalanced learning. In the presented scheme, the instances of minority class were divided into groups as noisy instances, unstable instances, boundary instances, and stable instances based on the location information for the instances. The presented scheme has been utilized to improve the predictive performance on medical diagnosis. In another study, Tsai et al. [44] introduced an undersampling approach for imbalanced learning, which integrates clustering analysis and instance selection.

As mentioned in advance, undersampling is a simple resampling strategy to deal with class imbalance problem. However, undersampling may remove potentially useful/informative instances of the majority class, which may lead to the degradation of the predictive performance of classification schemes. In this paper, a consensus clustering-based framework is presented to identify the informative instances of majority class through the use of a cluster ensemble method.

3. Proposed Consensus Clustering-Based Undersampling Framework

Undersampling and oversampling methods can be successfully employed for class imbalance. In order to obtain a robust classification scheme with high predictive performance, undersampling methods should retain useful and informative representative instances of the majority class in the training set. Clustering (cluster analysis) is an unsupervised technique which assigns similar instances (objects) into the same cluster in terms of their proximity or similarity. Hence, clustering algorithms can be employed to identify useful instances of majority class in undersampling. With the use of clustering on undersampling, the majority class yields a distribution of instances into clusters such that similar instances are grouped together within the same cluster. One of the main problems encountered in applying clustering algorithms is the selection of an appropriate algorithm for a given problem. Each clustering algorithm has strong and weak characteristics, and the results obtained by clustering algorithms are greatly influenced based on the characteristics of dataset, parameters of algorithm, etc. The clustering algorithms suffer from instability, and the same clustering algorithm can yield a particularly different partition for different parameter settings. One possible solution to this problem is to use multiple clustering algorithms on the same dataset and to combine the outputs of individual clustering algorithms. The process is referred as consensus clustering (or cluster ensembles). Consensus clustering aims at combining the clustering results of different clustering algorithms so that a final clustering with better clustering quality can be obtained [45]. In this paper, two ensemble generation schemes are presented to undersample the instances of majority class based on consensus clustering, namely, homogeneous and heterogeneous ensemble schemes are introduced.

3.1. Consensus Function

Consensus clustering involves a staged procedure: in Stage 1, cluster ensemble is generated, and in Stage 2, consensus function is utilized to obtain the final partition from the individual clustering algorithms. There are direct approaches (such as simple voting, incremental voting, and label correspondence search), feature -based approaches (such as iterative voting consensus, mixture model, clustering aggregation, and quadratic mutual information), pairwise similarity-based approaches (such as agglomerative hierarchical models), and graph-based approaches (such as cluster-based similarity partitioning algorithm and shared nearest neighbors-based combiner) [45]. Motivated by the success of clustering algorithms on imbalanced learning [24] and the enhanced clustering quality obtained by consensus clustering schemes [46], we seek to find an efficient consensus clustering-based scheme for imbalanced learning. In this regard, we have conducted an experimental analysis with several different consensus functions. Since the highest predictive performance is obtained by direct approaches, of the wide range of consensus functions available, three consensus functions were chosen for the study.

3.1.1. Simple Voting Function (SV)

Let πr denote the reference partition and let denote to be relabelled partitions, a contingency matrix Ω ∈ RK × K is obtained, in which K corresponds to the number of clusters. The contingency matrix entries (Ω(l, l′)) are filled by co-occurrence statistics computed based on the following equation [45,43]:where if and otherwise. Based on the label correspondence obtained based on equation (1), the aim of the simple voting consensus is to maximize the objective function, given bywhere is a label correspondence matrix amongst the labels of partitions πr and . First, the reference partition (πr) is randomly selected among the partitions of the cluster ensemble. Then, the remaining partitions are relabelled based on the reference partition by following the procedure outlined above. Finally, a majority voting scheme is employed to identify the consensus label of each instance.

3.1.2. Incremental Voting Function (IV)

In incremental voting scheme (IV), data partitions are repeatedly added to the cluster ensemble. Let  ∈ RN × K denote partition takes the value of 1 if a data point belongs to cluster . Otherwise, it takes the value of 0. Let denote the matrix of intermediate partitions and denote the number of partitions in which label is corresponds to data point . The process of incremental voting-based consensus is initialized with the construction of contingency matrix Ω ∈ RK × K. The contingency matrix entries are filled by the following equation [48]:where if . Otherwise, it takes the value of 0. After obtaining the contingency matrix, the entries of matrix for the partition (denoted by Vg + 1) are computed as given by

Based on the incremental combinations of M data partitions, the consensus label of each data point is determined based on following equation [45]:

3.1.3. Label Correspondence Search

In label correspondence search (LCS), the problem of correspondence is modelled as an optimization problem [49]. The aim of the method is to obtain a consensus partition such that overall agreement among the different partitions is maximized. Let R{c,s} denote the vector representation of cluster c of system s. The element of R{c,s} represents the posterior probabilities of cluster c for the data points. The agreement between clusters {c, s} and {c′, s′} can be defined as given by the following equation:

If a cluster c of system s is assigned to metacluster m, takes the value of 1 and it takes the value of 0 otherwise. denotes the reward of assigning cluster c to metacluster m, and it can be defined as given by the following equation:

Based on equations (6) and (7), the objective of label correspondence is to maximize the argument defined in the following equation [49]:subject to

3.2. Homogeneous Consensus Clustering-Based Undersampling Framework

Let D denote an imbalanced dataset with two classes, where there is one class (referred as, the minority class) containing the small number of instances and there is another class (referred as, the majority class) containing extremely high quantity of instances. Let us denote the number of instances corresponding to majority and minority classes as n and m, respectively. Initially, k-fold cross-validation scheme is utilized for dividing the imbalanced dataset into subsets as training and test sets. Then, the number of instances in the majority class (n) is undersampled so that it contains equal number of instances to the minority class (m). In the undersampling, homogeneous consensus clustering scheme is utilized to undersample the majority class. Clustering algorithms require the number of clusters as the input parameter. We adopted the clustering framework presented in [24]. Hence, the number of instances in the minority class (m) is taken as the number of clusters (k). In homogeneous consensus clustering scheme, the same clustering algorithm is utilized as the base clustering algorithm, with different parameter settings. In this scheme, five clustering algorithms (namely, k-means, k-modes, k-means++, self-organizing maps, and DIANA algorithm) are utilized as the base clustering algorithms.

In this way, diversified partitions are obtained by the base clustering algorithms. The partitions obtained by the base clustering algorithms are combined by consensus function to obtain the final partition. For obtaining final partition with consensus function, three consensus functions (namely, simple voting function, incremental voting function, and label correspondence search algorithm) are utilized. The center of each cluster of the final partition is selected as the instance for the majority class. In this way, a balanced training set is obtained. The balanced training set is utilized to train supervised learning algorithms (namely, naïve Bayes, logistic regression, support vector machines, random forests, and k-nearest neighbor algorithm) and ensemble learning methods (namely, AdaBoost, bagging, and random subspace algorithm). The general stages of this scheme is depicted in Figure 1. In Figure 2, the general steps of homogeneous consensus clustering-based undersampling scheme (CONS1) are outlined.

Figure 1: Homogeneous consensus clustering-based undersampling scheme (CONS1).
Figure 2: The general structure of the homogeneous consensus clustering-based undersampling scheme (CONS1).
3.3. Heterogeneous Consensus Clustering-Based Undersampling Framework

In heterogeneous consensus clustering scheme (CONS2), diversity among the clustering algorithms is achieved with the use of different clustering algorithms as the base clustering algorithms. As stated in advance, each clustering algorithm has its own strengths and weaknesses and can yield promising results on different datasets. The partitions obtained by different clustering algorithms may complement each other and can yield higher clustering quality. The heterogeneous consensus clustering-based undersampling framework follows the same stages as outlined in Figure 1. The only difference is that the heterogeneous consensus clustering framework utilizes 5 different clustering algorithms, as the base clustering algorithms, whereas the homogeneous consensus clustering framework utilizes the same clustering algorithm with different parameter settings, as the base clustering algorithms. The general structure of heterogeneous consensus clustering-based undersampling scheme is summarized in Figure 3. In the heterogeneous consensus clustering-based undersampling scheme, k-fold cross-validation is employed for dividing the imbalanced dataset into training set and test set. Then, the number of instances in the majority class is undersampled with the use of heterogeneous consensus clustering scheme. In this scheme, different clustering algorithms are utilized as the base clustering algorithms. The presented scheme can be configured with different clustering algorithms, yet, we have combined the five base clustering algorithms (namely, K-means, K-modes, K-means++, self-organizing maps, and DIANA algorithm). The partitions obtained by different clustering algorithms are combined by the consensus function. The center of each cluster of the final partition is selected as the instance for the majority class. In this way, a balanced training set is obtained. The predictive performance of undersampling scheme is examined with the use of supervised learning methods and ensemble learning methods.

Figure 3: The general structure of the heterogeneous consensus clustering-based undersampling scheme (CONS2).

4. Experimental Analysis and Results

This section presents the empirical analysis of the proposed consensus clustering-based undersampling schemes.

4.1. Datasets

To examine the effectiveness of the proposed undersampling approaches, we have utilized 44 small-scale and 2 large-scale imbalanced classification benchmarks. The imbalanced classification benchmarks were utilized in Galar et al. [12]. The imbalance ratios of small-scale benchmarks range from 1.8 to 129, and the number of instances ranges from 130 to 5500. The imbalance ratios of large-scale benchmarks range from 111.46 to 163.19, and the number of instances ranges from 102294 to 145751. For obtaining test and training sets for the supervised learning methods, we utilized k-fold cross-validation scheme, where we were partitioned the 80% and 20% training and testing sets with 5-fold cross-validation scheme. The basic descriptive information regarding the imbalanced classification benchmarks is presented in Table 1.

Table 1: Descriptive information for the datasets [12, 24].
4.2. Experimental Procedure

In the empirical analysis, the presented consensus clustering-based undersampling schemes have been compared by seven state-of-the-art methods. The utilized methods in the analysis include UnderBagging4 (UB4), UnderBagging24 (UB24), RusBoost1 (Rus1), SMOTEBagging4 (SBAG4), UnderBagging1 (UB1), clustering-based undersampling based on cluster centers (Centers), and clustering-based undersampling based on the nearest neighbors of cluster centers (Centers_NN) [12, 24]. In order to examine the predictive performance changes obtained by data balancing strategies, the results obtained by C4.5 algorithm without data balancing have also been presented as the baseline results. In the consensus clustering schemes, five clustering algorithms (namely, k-means, k-modes, k-means++, self-organizing maps, and DIANA algorithm) and their combinations were taken into consideration. In the classification phase, five supervised learning methods (namely, naïve Bayes, logistic regression, support vector machines, random forests, and k-nearest neighbor algorithm) and three ensemble learner methods (namely, AdaBoost, bagging, and random subspace algorithm) were utilized. In the empirical analysis, area under roc curve was utilized as the evaluation metric. For the supervised learning methods and state-of-the-art data preprocessing methods, the default parameters were employed. For the homogeneous consensus clustering-based undersampling scheme, i parameter (the number of base clustering algorithms) is taken as five.

4.3. Experimental Results and Discussions

In Table 2, average AUC values of the state-of-the-art methods and conventional clustering algorithms (namely, K-means, K-means++, K-modes, self-organizing maps, and DIANA algorithm) are presented. As it can be observed from the results presented in Table 2, the application of data balancing strategies enhance the predictive performance in terms of AUC values. The lowest average AUC values obtained by C4.5 algorithm without data balancing have been applied. The highest average AUC values are generally obtained by UnderBagging4 algorithm, and the second highest average AUC values are generally obtained by UnderBagging24 algorithm. In the empirical analysis, five base clustering algorithms have been taken into consideration. Among the base clustering algorithms, the highest average AUC values are obtained by DIANA clustering algorithm.

Table 2: Average AUC values of state-of-the-art methods with C4.5 classifier.

The homogeneous consensus clustering scheme utilizes a single clustering algorithm (of the same type) as the base clustering method. In the empirical analysis, five clustering algorithms (namely, k-means, k-modes, k-means++, self-organizing maps, and DIANA algorithm) are considered as the base clustering methods. For aggregating the clustering results of individual clustering results, we considered three consensus functions (namely, simple voting function, incremental voting function, and label correspondence search algorithm). In this way, 15 different homogeneous consensus clustering-based schemes are evaluated for imbalanced learning. In Table 3, average AUC values obtained by homogeneous consensus clustering schemes are presented. Compared to the results presented in Table 2 for conventional data-level methods and conventional clustering-based schemes, homogeneous consensus clustering schemes yield better predictive performance in terms of AUC values. Among the compared homogeneous consensus clustering schemes, the highest predictive performance is obtained by utilizing self-organizing map algorithm as the base clustering algorithm. In this scheme, simple voting function is employed as the consensus function.

Table 3: Average AUC values of homogeneous clustering schemes with C4.5 classifier.

For the heterogeneous consensus clustering scheme, k-means, k-modes, k-means++, self-organizing maps, and DIANA algorithm methods were utilized to identify individual partitions. Similar to the homogeneous scheme, we considered three consensus functions (namely, simple voting function, incremental voting function, or label correspondence search algorithm). In this way, 3 different heterogeneous consensus clustering-based schemes are taken into consideration. In Table 4, average AUC values obtained by heterogeneous consensus clustering schemes are presented. As it can be observed from the results listed in Table 4, heterogeneous consensus clustering schemes outperform homogeneous consensus clustering schemes, conventional data-level methods, and conventional clustering-based schemes. Regarding the average AUC values analyzed in the empirical analysis, the highest predictive performance is obtained by heterogeneous clustering scheme with label correspondence search-based consensus function. The second highest predictive performance is obtained by heterogeneous clustering scheme with simple voting-based consensus function.

Table 4: Average AUC values of heterogeneous clustering schemes with C4.5 classifier.

In the classification phase, five supervised learning methods (namely, naïve Bayes, logistic regression, support vector machines, random forests, and k-nearest neighbor algorithm) and three ensemble learner methods (namely, AdaBoost, bagging, and random subspace algorithm) were utilized. In order to summarize the main findings of the empirical analysis, boxplots for undersampling methods and supervised learning methods are presented in Figures 4 and 5, respectively.

Figure 4: Boxplot distributions of AUC values for conventional data balancing methods and the proposed scheme.
Figure 5: Boxplot distributions of AUC values for supervised learning methods and ensemble methods.

As it can be observed from Figure 4, average AUC values obtained from the presented heterogeneous clustering scheme is higher compared to the conventional data-level methods (). In Figure 5, the predictive performance analysis of conventional supervised learning methods and their ensembles are taken into consideration. As it can be observed, ensemble learning methods yield higher predictive performance in terms of AUC values compared to the conventional supervised learning methods. The highest predictive performance for supervised learning methods is achieved by random subspace ensemble of random forest, and the second highest predictive performance is obtained by random subspace ensemble of support vector machines (). Regarding the predictive performance of conventional clustering algorithms, naïve Bayes demonstrated the lowest predictive performance, whereas random forest algorithm demonstrated the best (the highest) predictive performance ().

In Figure 6, the confidence intervals for the mean values of average AUC values obtained by the compared algorithms for a confidence level of 95% are presented. Based on the statistical significances between the compared results, Figure 6 is divided into two regions denoted by red dashed line. As it can be observed from Figure 6, the predictive performance differences obtained by the proposed consensus clustering-based schemes are statistically significant.

Figure 6: Interval plots for the compared algorithms.

5. Conclusion

Class imbalance is an important problem of machine learning. Imbalanced datasets can be seen in a wide variety of applications, including medical diagnosis, malware detection, anomaly identification, bankruptcy prediction, and spam filtering. In order to build efficient and robust classification schemes, data preprocessing methods can be utilized in conjunction with supervised learning methods. Undersampling- and oversampling-based methods can be successfully utilized for class imbalance. However, the identification of informative instances to be included in the training set is a critical issue for undersampling. In this regard, this paper empirically examines the predictive performance of two consensus clustering-based undersampling schemes for imbalanced learning. In the empirical analysis, 44 small-scale and 2 large-scale imbalanced classification benchmarks (with imbalance ratios ranged between 1.8 and 163.19) were utilized. The experimental analysis indicates that clustering-based undersampling schemes can outperform conventional data-level preprocessing methods for class imbalance. In addition, consensus clustering, which aggregates the partitions of individual clustering algorithms, can further enhance the predictive performance of clustering-based undersampling schemes.

There are a number of issues that should be beneficial to extend in the future. The presented consensus clustering based undersampling scheme utilizes five clustering algorithms (namely, k-means, k-modes, k-means++, self-organizing maps, and DIANA algorithm). The clustering algorithms have been integrated with the use of three consensus functions, namely, simple voting-based consensus function, incremental voting function, and label correspondence search. Hence, the predictive performance of other conventional and swarm-based clustering algorithms (such as ant clustering, particle swarm-based clustering, firefly clustering) can be examined for imbalanced learning. In addition, recent proposals on the field indicate that imbalancing schemes which integrate instance selection and clustering may yield higher predictive performance. Hence, the performance of consensus clustering-based undersampling scheme should be taken into consideration in conjunction with conventional instance selection methods.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Disclosure

The study was performed as part of the employment of the author at Izmir Katip Celebi University.

Conflicts of Interest

The author declares that there are no conflicts of interest.

References

  1. G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: review of methods and applications,” Expert Systems with Applications, vol. 73, pp. 220–239, 2017. View at Publisher · View at Google Scholar · View at Scopus
  2. V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics,” Information Sciences, vol. 250, pp. 113–141, 2013. View at Publisher · View at Google Scholar · View at Scopus
  3. G. M. Weiss, “Mining with rarity,” ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 7–19, 2004. View at Publisher · View at Google Scholar
  4. C. Beyan and R. Fisher, “Classifying imbalanced data sets using similarity based hierarchical decomposition,” Pattern Recognition, vol. 48, no. 5, pp. 1653–1672, 2015. View at Publisher · View at Google Scholar · View at Scopus
  5. M. Denil and T. Trappenberg, “Overlap versus imbalance,” in Proceedings of Canadian Conference on Artificial Intelligence, pp. 220–231, Springer, Ottawa, Canada, May 2010.
  6. D. Rodriguez, I. Herraiz, R. Harrison, J. Dolado, and J. C. Riquelme, “Preliminary comparison of techniques for dealing with imbalance in software defect prediction,” in Proceedings of 18th International Conference on Evaluation and Assessment in Software Engineering, p. 43, ACM, London, UK, May 2014.
  7. R. Akbani, S. Kwek, and N. Japkowicz, “Applying support vector machines to imbalanced datasets,” in Proceedings of European Conference on Machine Learning ECML 2004, pp. 39–50, Prague, Czech Republic, September 2004. View at Publisher · View at Google Scholar
  8. N. Peiravian and X. Zhu, “Machine learning for android malware detection using permission and api calls,” in Proceedings of IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 300–305, IEEE, Herndon, VA, USA, November 2013.
  9. W. Khreich, E. Granger, A. Miri, and R. Sabourin, “Iterative Boolean combination of classifiers in the ROC space: an application to anomaly detection with HMMs,” Pattern Recognition, vol. 43, no. 8, pp. 2732–2752, 2010. View at Publisher · View at Google Scholar · View at Scopus
  10. M.-J. Kim, D.-K. Kang, and H. B. Kim, “Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction,” Expert Systems with Applications, vol. 42, no. 3, pp. 1074–1082, 2015. View at Publisher · View at Google Scholar · View at Scopus
  11. T. R. Hoens, R. Polikar, and N. V. Chawla, “Learning from streaming data with concept drift and imbalance: an overview,” Progress in Artificial Intelligence, vol. 1, no. 1, pp. 89–101, 2012. View at Publisher · View at Google Scholar
  12. M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463–484, 2012. View at Publisher · View at Google Scholar · View at Scopus
  13. B. Liu, Y. Ma, and C. K. Wong, “Improving an association rule based classifier,” in Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery, pp. 504–509, Springer, Lyon, France, September 2000.
  14. G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 20–29, 2004. View at Publisher · View at Google Scholar
  15. N. V. Chawla, D. A. Cieslak, L. O. Hall, and A. Joshi, “Automatically countering imbalance and its empirical relationship to cost,” Data Mining and Knowledge Discovery, vol. 17, no. 2, pp. 225–252, 2008. View at Publisher · View at Google Scholar · View at Scopus
  16. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002. View at Publisher · View at Google Scholar
  17. N. Japkowicz, “The class imbalance problem: significance and strategies,” in Proceedings of International Conference on Artificial Intelligence, Las Vegas, NV, USA, June 2000.
  18. R. Barandela, R. M. Valdovinos, and J. S. Sánchez, “New applications of ensembles of classifiers,” Pattern Analysis and Applications, vol. 6, no. 3, pp. 245–256, 2003. View at Publisher · View at Google Scholar · View at Scopus
  19. N. V. Chawla, N. Japkowicz, and A. Kolcz, “Workshop learning from imbalanced data sets II,” in Proceedings of International Conference on Machine Learning, Washington, DC, USA, August 2003.
  20. S. Wang and X. Yao, “Diversity analysis on imbalanced data sets by using ensemble models,” in Proceedings of IEEE Symposium on Computational Intelligence and Data Mining CIDM'09, pp. 324–331, IEEE, Nashville, TN, USA, March 2009.
  21. J. Błaszczyński and J. Stefanowski, “Neighbourhood sampling in bagging for imbalanced data,” Neurocomputing, vol. 150, pp. 529–542, 2015. View at Publisher · View at Google Scholar · View at Scopus
  22. Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, and Y. Zhou, “A novel ensemble method for classifying imbalanced data,” Pattern Recognition, vol. 48, no. 5, pp. 1623–1637, 2015. View at Publisher · View at Google Scholar · View at Scopus
  23. J. Kwak, T. Lee, and C. O. Kim, “An incremental clustering-based fault detection algorithm for class-imbalanced process data,” IEEE Transactions on Semiconductor Manufacturing, vol. 28, no. 3, pp. 318–328, 2015. View at Google Scholar
  24. W.-C. Lin, C.-F. Tsai, Y.-H. Hu, and J.-S. Jhang, “Clustering-based undersampling in class-imbalanced data,” Information Sciences, vol. 409-410, pp. 17–26, 2017. View at Publisher · View at Google Scholar · View at Scopus
  25. V. Vigneron and H. Chen, “A multi-scale seriation algorithm for clustering sparse imbalanced data: application to spike sorting,” Pattern Analysis and Applications, vol. 19, no. 4, pp. 885–903, 2016. View at Publisher · View at Google Scholar · View at Scopus
  26. D. H. Wolpert and W. G. Macready, “No free lunch theorems for search,” vol. 10, Santa Fe Institute, Santa Fe, NM, USA, 1995, Technical Report SFI-TR-95-02-010. View at Google Scholar
  27. C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, “RUSBoost: a hybrid approach to alleviating class imbalance,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 40, no. 1, pp. 185–197, 2010. View at Publisher · View at Google Scholar · View at Scopus
  28. S. Wang, K. Tang, and X. Yao, “Diversity exploration and negative correlation learning on imbalanced data sets,” in Proceedings of International Joint Conference on Neural Networks, IJCNN 2009, pp. 3259–3266, IEEE, Atlanta, GA, USA, June 2009.
  29. N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “SMOTEBoost: improving prediction of the minority class in boosting,” in Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery, pp. 107–119, Springer, Cavtat-Dubrovnik, Croatia, September 2003.
  30. Z. Huang, “A fast clustering algorithm to cluster very large categorical data sets in data mining,” DMKD, vol. 3, no. 8, pp. 34–39, 1997. View at Google Scholar
  31. D. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful seeding,” in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, January 2007.
  32. T. Kohonen, Self-Organising Maps Berlin, Springer, Berlin, Germany, 2001.
  33. H. Chipman and R. Tibshirani, “Hybrid hierarchical clustering with applications to microarray data,” Biostatistics, vol. 7, no. 2, pp. 286–301, 2005. View at Publisher · View at Google Scholar · View at Scopus
  34. S. Barua, M. M. Islam, X. Yao, and K. Murase, “MWMOTE—majority weighted minority oversampling technique for imbalanced data set learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 2, pp. 405–425, 2014. View at Publisher · View at Google Scholar · View at Scopus
  35. A. Anand, G. Pugalenthi, G. B. Fogel, and P. N. Suganthan, “An approach for classification of highly imbalanced data using weighting and undersampling,” Amino Acids, vol. 39, no. 5, pp. 1385–1391, 2010. View at Publisher · View at Google Scholar · View at Scopus
  36. Q. Li, B. Yang, Y. Li, N. Deng, and L. Jing, “Constructing support vector machine ensemble with segmentation for imbalanced datasets,” Neural Computing and Applications, vol. 22, no. S1, pp. 249–256, 2013. View at Publisher · View at Google Scholar · View at Scopus
  37. N. S. Kumar, K. N. Rao, A. Govardhan, K. S. Reddy, and A. M. Mahmood, “Undersampled K-means approach for handling imbalanced distributed data,” Progress in Artificial Intelligence, vol. 3, no. 1, pp. 29–38, 2014. View at Publisher · View at Google Scholar
  38. A. D’Addabbo and R. Maglietta, “Parallel selective sampling method for imbalanced and large data classification,” Pattern Recognition Letters, vol. 62, pp. 61–67, 2015. View at Publisher · View at Google Scholar · View at Scopus
  39. J. Ha and J. S. Lee, “A new under-sampling method using genetic algorithm for imbalanced data classification,” in Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication, p. 95, ACM, Danang, Vietnam, January 2016.
  40. G. Shobana and B. P. Battula, “An under sampled k-means approach for handlingimbalanced data using diversified distribution,” International Journal of Engineering and Technology (UAE), vol. 7, no. 1.8, pp. 113–117, 2018. View at Publisher · View at Google Scholar
  41. H. Guo and T. Wei, “Logistic regression for imbalanced learning based on clustering,” International Journal of Computational Science and Engineering, vol. 18, no. 1, pp. 54–64, 2019. View at Publisher · View at Google Scholar
  42. G. Douzas, F. Bacao, and F. Last, “Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE,” Information Sciences, vol. 465, pp. 1–20, 2018. View at Publisher · View at Google Scholar · View at Scopus
  43. W. Han, Z. Huang, S. Li, and Y. Jia, “Distribution-sensitive unbalanced data oversampling method for medical diagnosis,” Journal of medical Systems, vol. 43, no. 2, p. 39, 2019. View at Publisher · View at Google Scholar
  44. C.-F. Tsai, W.-C. Lin, Y.-H. Hu, and G.-T. Yao, “Under-sampling class imbalanced datasets by combining clustering analysis and instance selection,” Information Sciences, vol. 477, pp. 47–54, 2019. View at Publisher · View at Google Scholar
  45. T. Boongoen and N. Iam-On, “Cluster ensembles: a survey of approaches with recent extensions and applications,” Computer Science Review, vol. 28, pp. 1–25, 2018. View at Publisher · View at Google Scholar · View at Scopus
  46. N. Nguyen and R. Caruana, “Consensus clusterings,” in Proceedings of Seventh IEEE International Conference on Data Mining ICDM 2007, pp. 607–612, IEEE, Omaha, NE, USA, October 2007.
  47. A. P. Topchy, M. H. Law, A. K. Jain, and A. L. Fred, “Analysis of consensus partition in cluster ensemble,” in Proceedings of Fourth IEEE International Conference on Data Mining ICDM’04, pp. 225–232, IEEE, Brighton, UK, November 2004.
  48. H. G. Ayad and M. S. Kamel, “Cumulative voting consensus method for partitions with variable number of clusters,” IEEE Transactions on Pattern Analysis and Machine Intellzigence, vol. 30, no. 1, pp. 160–173, 2008. View at Publisher · View at Google Scholar · View at Scopus
  49. C. Boulis and M. Ostendorf, “Combining multiple clustering systems,” in Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery, pp. 63–74, Springer, Cavtat-Dubrovnik, Croatia, September 2003.