Abstract

We have presented a classification framework that combines multiple heterogeneous classifiers in the presence of class label noise. An extension of m-Mediods based modeling is presented that generates model of various classes whilst identifying and filtering noisy training data. This noise free data is further used to learn model for other classifiers such as GMM and SVM. A weight learning method is then introduced to learn weights on each class for different classifiers to construct an ensemble. For this purpose, we applied genetic algorithm to search for an optimal weight vector on which classifier ensemble is expected to give the best accuracy. The proposed approach is evaluated on variety of real life datasets. It is also compared with existing standard ensemble techniques such as Adaboost, Bagging, and Random Subspace Methods. Experimental results show the superiority of proposed ensemble method as compared to its competitors, especially in the presence of class label noise and imbalance classes.

1. Introduction

In recent years, there has been a growth of research interest in the development of sophisticated data classification techniques as it has many practical applications in variety of fields such as object recognition, surveillance, medical imaging, and financial data analysis. These techniques have been widely used in industry in the form of tools and commercial applications giving so many benefits to them.

Classification of unseen data requires building model of normality of commonly followed patterns that exist in a given domain. Once these models for all the known patterns are learnt, they are then used to classify unseen samples to one of the modelled patterns. A variety of machine learning approaches have been proposed that model the normal pattern and classify newly coming data using the generated models of normality. Statistical approaches for classification [13] model a pattern by approximating the density of the training samples belonging to the given pattern. Khalid and Naftel [3] model the pattern by approximating a single multivariate Gaussian for each class. Classification of unseen samples is then performed using Mahalanobis classifier. Classification approaches based on Gaussian Mixture Model (GMM) have also been proposed [47]. Various techniques [811] based on support vector machines (SVM) have also been presented. The underlying logic behind SVM-based approaches is to identify an optimal hyperplane separating training data belonging to various patterns and then classifying new data based on these identified decision boundaries. One class classifier based on SVM (OCC-SVM) has also been employed for classification and anomaly detection [12, 13]. Neural network based classifiers have also been reported in literature [14, 15]. Owens and Hunter [14] employ self-organizing maps to learn model of normality of given set of patterns. Various -Mediods based approaches have been proposed for data classification and anomaly detection [1619].

There are many scenarios of distribution of samples belonging to different classes with respect to each other. There may be a high level of overlap in the distributions of samples belonging to different classes. In other cases, samples belonging to various classes may be exhibiting complex shape nonoverlapping distributions with tight decision boundaries amongst them. Similarly, we may have classes that exhibit multivariate distributions of samples within individual patterns. Another commonly occurring phenomenon is that different classes in a dataset have widely varying number of training samples for different classes which is normally referred to as class imbalance problem. This phenomenon is frequent in medical world where the sensitive, dangerous, and later stage of diseases is not very frequent, thus resulting in lesser amount of data available for training our classifiers. On the other hand, nonmalignant forms of these diseases are often commonly occurring and hence the large number of samples is available for training our machine learning algorithms. As general classifiers normally optimize their results by having overall higher classification accuracy, they focus more attention on correctly classifying samples from heavily populated classes as compared to classes which may be sensitive but having lesser number of membership counts. This is in total contrast to our actual expectations. Misclassification of insensitive diseases may be ignored, but doing the same with serious disease classes may result in loss of lives.

The above-mentioned and variety of other scenarios impose challenges for the classifiers. Individual classifiers may be more expert to handle some of these discussed problems, but there is a possibility that it may give poor performance in the presence of others. For instance, GMM is known to perform good in the presence of overlapping distribution of samples belonging to different classes but does not yield good results in the presence of nonoverlapping distributions with tight and complex decision boundaries. On the other hand, SVM performs well in the presence of complex decision surfaces between nonoverlapping distribution, but its performance degrades in the presence of class imbalance problem. Similarly, -Mediods [19] can handle multivariate distribution of samples within a pattern and can handle overlapping distributions belonging to different classes to a reasonable extent.

To overcome this problem, combining classifiers in an ensemble has gained interest quite recently [2043] in literature. The underlying logic behind classifier ensemble is that many classifiers are combined together in a certain framework to make a final strong classifier whose decision is expected to be more precise and effective as compared to its individual components. The ensemble-based classification approaches can be broadly categorized into two types: homogeneous and heterogeneous classifier ensembles. The homogeneous approaches combine classification algorithms of the same type. Contrary to this, heterogeneous ensemble combines classification algorithms of different types. The most popular and standard ensemble methods include Bagging [30], Boosting (Adaboost) [31], and Random Subspace Methods (RSM) [44] (generalization of Random Forest Method [32]). These are the dominant methods for diversifying and combining classification results.

Most of the existing classification techniques assume that the training data is free of problems such as class label noise and class imbalance problems. There exist some ensemble methods [4549] that handle class imbalance problem in datasets. There also exist very few methods that cater for the problem of class label noise in the datasets. In this paper, we present a novel classification framework that can handle problems such as class label noise and imbalance classes. An extension of -Mediods based classification approach is proposed that learns the model of normality of classes whilst identifying and filtering training samples representing class label noise. -Mediods classifier is further combined with other well-known classifiers using a proposed framework. We selected GMM and one class classifier based on SVM (OCC-SVM) which are carefully selected based on their capabilities to handle different types of class distributions. The learned models using selected classifiers are then combined by introducing genetic algorithm based weight learning method to learn weights at class level individually for all heterogeneous classifiers. The probabilistic output of each classifier is then combined in a weighted combination at class level to achieve better classification performance. The proposed framework is robust to the presence of class imbalance problem, class label noise, and the presence of multivariate distribution of samples within classes.

The remaining paper is organized as follows. In Section 2, we present a brief review of existing classifier ensemble techniques. Section 3 presents an overview of the learning of proposed ensemble framework and classification of unseen data using the learned ensemble. Section 4 presents different classifiers that are used for the construction of proposed ensemble. In Section 5, an extension of -Mediods based approach to filter class label noise is presented. The detailed description of proposed ensemble framework is given in Section 6. Experiments have been conducted to show the effectiveness of proposed approach as compared to the competitors. These experiments are reported in Section 7. The last section summarizes the paper.

Classifier ensembles are known to be very useful methods for improving the classification accuracy as well as diversity. They combine multiple classifiers together to get a single stronger one whose performance is more precise and accurate as compared to its individual members. A variety of factors have been considered in literature to improve the accuracy of the ensemble. These include classifier selection [20, 21, 28, 50], feature selection [21, 27, 32], diversity creation in ensemble of classifiers [34, 39, 40], combination methods [2023, 25, 3033, 39], and combining more than one ensemble [24, 41, 42]. Certain ensemble approaches integrate some statistical procedures with them such as Bagging with Principal Component Analysis (PCA) [21], weighting classifier dynamically based on cross validation [22], Dempster-Shafer theory of evidence [25], and supervised projection [26].

Earlier work has put a lot of concentration towards assigning weights to instances as well as classifiers to construct an ensemble by applying weight assignment methods [51] and voting algorithms [2932, 35, 52, 53]. Bagging [30] and Boosting [31] are one of the most well-known voting and weighting based standard ensemble methods. Breiman [30] introduced the Bagging algorithm which is an independent ensembling method as the output produced by one classifier does not depend on the output of previous classifiers. Bootstrap samples are generated from training set which are obtained by sampling with replacement. A classifier is then learned with different training set in each iteration. It follows the voting approach in order to combine the predictions of classifiers. C.-X. Zhang and J. Zhang [21] combined the concept of Bagging with Principal Component Analysis (PCA) to construct an ensemble. Bauer and Kohavi [52] provided the variant of Bagging, namely, wagging, which makes stochastic assignment of weights to each instance.

Random Forest Method proposed by Breiman [32] uses unpruned decision trees. It consists of a collection of trees like structured classifiers and uses a number of input variables to find the decision at a node of the tree. To classify a new pattern, this algorithm collects votes from every tree in the forest and then uses majority voting to finalize the class label. The generalization of Random Forest approach, referred to as Random Subspace Method (RSM), is also presented [44]. Instead of working only with decision trees, as in the case of Random Forest, RSM can take into account any classifier such as nearest neighbor classifier and support vector machine. A subspace from the feature space representation is identified by randomly selecting a subset of features.

Freund and Schapire proposed Boosting [31] algorithm which enhances the performance of a weak learner by iteratively running it on training data. García-Pedrajas and García-Osorio [26] combine the concept of Boosting with supervised projection method. The technique focuses on misclassified instances as they are used to find supervised projection of data. After getting the projections, the next classifier is trained on it. This approach does not employ majority voting or weighting scheme to combine the classifiers. Merler et al. [53] improve the simple Boosting algorithm by an iterative process which focuses on the inaccuracies produced by the previous classifiers. It entirely concentrates on those samples which are hard to classify. It is a homogeneous and dependent ensemble method. It takes the whole training set in each of its iterations and does not create the bootstrap of samples. Equal weights are initially assigned to every sample in a training dataset. After every iteration, weights of misclassified instances are increased while decreasing the weights of correctly classified ones. It further assigns weight to individual classifier to measure its overall accuracy. The higher weights are given to those classifiers performing accurately. New samples are classified using these weights.

Approaches using dynamic ensemble learning have also been proposed in literature [22, 27]. Zhu et al. [22] provided a new ensemble model named dynamic weighting ensemble classifier based on cross validation (DWEC-CV). In this method, different classifiers are used for different samples. To train and construct a classifier, Random Subspace Method is used. The feature subspaces from the original feature set are selected randomly. The number of feature subspaces selected determines the number of classifiers to be produced. A weight adjusted voting algorithm [29] is introduced which uses a weight vector to weight instances as well as the classifiers. The weight vector, for instance, gives higher weights to those instances which are difficult to classify. On the other side, the classifiers weight vector gives the highest weights to only those classifiers giving better performance on these difficult instances. These classifier weights are identified by those samples having higher weights. DECORATE [54] is a diversity creation method to construct a classifiers ensemble. It creates the additional artificial training data in order to create the diverse classifier ensemble. The methods for dynamic ensemble selection [28] using majority voting rule to combine the classifiers have also been introduced in literature. The methods in [24, 41, 42] combine more than one ensemble in order to achieve improved accuracy and diversity of classifiers. Al-Ani and Deriche [25] combine classifiers using Dempster-Shafer theory.

The problem with majority of exiting ensembles is that they do not cater for the presence of noise in the training data including feature space noise and class label noise, although we expect to have noise related problems in real life datasets. Development of approaches that are robust to various problems such as feature space noise and class label noise has received scant attention. Dietterich [55] performed comparative analysis of standard ensembling techniques in the presence of class label noise. It has been shown that accuracies of these ensembling techniques such as Bagging, Adaboost, and Random Subspace Method degrade in the presence of noise. We will also show through our experimental evaluation that Adaboost, Bagging, and RSM are also not performing well on imbalanced and overlapping classes.

There exist few approaches which cater for the presence of noise in the training data. One of them is a method based on Boosting with supervised projection [26] which shows the noise tolerance of its proposed ensemble method. It compares only one standard ensemble method such as Adaboost with their proposed ensemble using different base learners to prove the sensitivity of Adaboost toward different levels of class label noise. In literature, there exist some researches which show that Boosting methods such as Adaboost degrade their performance when some level of noise is present in the dataset [56]. Arjun and Arora introduced the TRandom [45] Adaboost which performs better than the simple Adaboost in the presence of low and high noise data. In literature there exist classifier methods [46] constructed from the imbalanced datasets. A dataset is said to be imbalanced if the number of instances for all the classes is not equally represented. In literature, ensemble methods [4749, 57, 58] have been proposed to improve their performance on imbalanced datasets. Zhou et al. [57] provided the relationship between ensemble and its Neural Network components for regression and classification. An EasyEnsemble method and BalanceCascade method have been introduced [47] to handle the class imbalance problem. EasyEnsemble method produces subsets from the majority class and the base learner is trained using each of these subsets and at the end, output of those learners is combined. Ryan and Nitesh [48] gave an extension of Random Subspace Method [44] that joins together SMOTE [46] to overcome the class imbalance problem. In literature, there is a cost sensitive ensemble method for class imbalance dataset [49]. It divides the instances of majority class into several subsets on the basis of imbalanced samples proportions. It then trains the subclassifiers using Adaboost method. Ensemble-based wrapper approach [58] for feature selection from the dataset has also been proposed. This method selects the feature from the data having highly imbalanced class distribution. It creates multiple balanced datasets from the original imbalanced one by doing sampling and then evaluates the feature subsets using ensemble classifiers each trained on balanced dataset.

The contribution of this paper is to present a robust classifier ensemble that combines existing state-of-the-art classifiers such as GMM and OCC-SVM with our proposed -Mediods classifier in a weighted ensemble. A genetic algorithm based weight learning approach to optimize the accuracy of proposed classifier ensemble is presented. The proposed ensemble can handle real life issues of labelled datasets such as class label noise and class imbalance problems.

3. Overview of Proposed Ensemble Framework

Classification in the presence of class label noise and imbalance in the distribution of samples across different classes is a challenging task. Figure 1 presents an overview of our proposed framework of combining classifiers in an ensemble to handle this challenge. The proposed framework is composed of three main modules: filtering of class label noise from train data to mitigate its effects on model learning process, learning of weights on classes with respect to different classifiers, and classification by combining their probabilistic output using learned weights. The module for filtering class label noise is based on the extension of our previously proposed -Mediods classifier and is composed of two steps. In step 1, we generate the mediods based model to represent different classes. In step 2, we prune those mediods that are isolated or surrounded by mediods from different classes and representing fewer numbers of samples. This module generates the -Mediod based model while filtering the samples representing class label noise. The filtered training data is then further utilized to generate model of normalities of other classifiers. The filtered training data is also used by second module to learn weights on the probabilistic output of classifiers with respect to different classes. We employed -fold cross validation to learn the weights. The models of normality of various classifiers are learned on fold whereas the left-out fold is used for cross validation. We speed up the learning process by identifying those samples from cross validation set for which there is a confusion; that is, all the classifiers do not predict same class for a given sample. Weight learning using genetic algorithm is then carried out by using only the confused samples. Learning weights at the class level with respect to different classifiers will enable our proposed framework to extract appropriate benefit from the classifier speciality at the class level. A classifier may perform very good while predicting a subset of classes due to existence of a particular type of distribution among those subsets of classes. The same classifier may perform bad for other classes with samples exhibiting different types of distributions. For instance, a classifier performing good in highly overlapping distributions may not work well for nonoverlapping but complex and tight boundaries between them. The learned models of normality of classifiers using filtered training data and weights learned in second module are used by third module for classifying test data. The probabilistic output of each classifier with respect to different classes is combined using learned weights to yield improved classification performance.

4. Classifiers

In this section, we describe different classifiers that are used for the construction of proposed classifier ensemble. Given the feature vector representation of training samples from any domain, we propose to generate model of normality using state-of-the-art classifiers including Gaussian Mixture Model (GMM), one class classifier based on support vector machine (OCC-SVM), and an extension of multivariate -Mediods based classifier, as proposed in [19], to incorporate the capacity of handling class label noise. We have selected these classifiers intentionally as they have abilities to handle different types of class distributions. GMM [4] is good in handling overlapping classes and diverse distribution, but it gives poor results with nonoverlapping classes having complex and tight boundaries. On the other hand, SVM [8, 12] handles the problem of complex and tight decision boundaries in a very good manner. It does this by looking for an optimal hyperdimensional decision surfaces which should separate the samples belonging to different classes. However, the effectiveness of OCC-SVM decreases with the increase in the amount of overlap amongst the classes. Another disadvantage of OCC-SVM is that it overlooks those classes having smaller membership counts to correctly classify samples belonging to classes having large membership counts. Multivariate -Mediod [19] classifier has the capacity to cater for the presence of multivariate distribution of samples within a modeled class. It has the strength of modeling complex patterns without imposing any limitation on the shape of distribution of samples within a given pattern. Once the multivariate -Mediods models for all the classes have been learnt, the classification of test samples is achieved using a soft classification technique that can handle for multimodal and overlapping distributions of samples among different patterns within a dataset. It also caters for patterns having small membership count. Combining classifiers with different skills and specialties will enable the ensemble to cover all possible aspects and problems of classification.

One of the major drawbacks of all these classifiers is that their effectiveness degrades significantly with the increasing amount of class label noise in the training data. In the next section, we are presenting an extension of our previously proposed -Mediods based classifier to mitigate the effect of class label noise in training data.

5. Modified -Mediods Based Classifier to Filter Class Label Noise

In this section, we are presenting an extension of -Mediods based classifier, as presented in [19], to filter class label noise whilst generating model of normality of different classes. The proposed approach is comprised of three major steps: modelling of known classes using -Mediods model, filtering training samples representing class label noise using -Mediods based model for all the classes, and classification of unseen samples using proposed classifier.

5.1. Multivariate -Mediods Based Modeling

Given a labelled training data, we propose to generate -Mediods model that represents the normal class distribution of known classes containing samples using a model composed of -Mediods. Let be our training dataset containing samples. A sample in is represented by a -dimensional feature vector . Let represent the samples belonging to class ; an algorithm to generate modified -Mediods based model, whilst filtering samples representing class label noise, is comprised of the following steps.(1)If the number of samples in training data is less than a threshold , go to step 8. The value of is set to a value large enough for which the application of agglomerative merging is feasible with respect to time. We assumed .(2)Initialize the semifuzzy self-organizing map (SFSOM) with a number of output nodes which is much greater than . Setting higher values of results in high computational complexity whereas lower values fail to mitigate the problem of local minima associated with quantization. We assumed based on empirical evaluation. Setting much higher values results in increasing computational complexity without having any considerable impact on modeling quality. On the other hand, setting much lower values fails to mitigate the issue of local minima typically affiliated with approaches based on quantization.(3)Initialize weight vector representation of output nodes, referred to as (where ), using the probability density function approximated from training samples in .(4)Determine nearest neighbors of input training sample from set of weight vectors representation of output nodes as follows: where is the Euclidean distance function and is the membership count function and , where is the neighborhood size function with respect to current training iteration .(5)Train the proposed SFSOM network by updating weights in using where is the weight vector associated with output neuron , is the order of closeness of to , is the learning rate of SFSOM with respect to current training cycle , and is a membership function which has an initial value of 1 and decreases with increasing values of .(6)Decrease neighborhood size and the learning rate with time as follows: where is the number of neighbors to be affected when and is the maximum number of training iterations.(7)Iterate through steps from 4 to 6 for all the training iterations.(8)Compute the membership of training samples by assigning them to the nearest output nodes.(9)Identify and remove output nodes with zero membership count.(10)Merge the closest pair of weight vectors, indexed as , using the following equation: where (11)Iterate through step 10 till the number of output nodes is equal to . Append weight vector to the list of mediods modeling the pattern .(12)Approximate the density of the local distribution around each mediod by computing the mean of the distance of the mediod from its nearest mediods. Append the average distance to in correspondence with a given mediod in the mediods list .

5.2. Filtering of Class Label Noise

Once the -Mediods based models of all the classes have been learnt, we apply a filtering process on the complete set of mediods to identify a subset of mediods tentatively representing samples with class label noise. This filtration process is based on the observation that the sample with class label noise will not generally be surrounded by samples belonging to the same class. Consequently, the mediod representing such samples will be surrounded by mediods representing other classes with little or no presence of mediods from the same class. The filtration algorithm to remove samples representing class label noise is composed of the following steps.(1)Merge sets of mediods modeling different classes into a superset as follows: where is the total number of classes in a given dataset.(2)Sequentially select mediod from (). Identify the subset of mediods from , referred to as that are member of nearest neighbors of specified as follows: (3)Identify subset of mediods from that belongs to the same class as , specified as follows:

where is the function that returns the label of a given sample or mediod.(4)Prune mediod if there are no mediods in specified as follows: (5)Filter mediods representing samples with class label noise using (6)Filter samples representing class label noise from training data using
5.3. Multivariate -Mediods Based Classification

Once we have generated the multivariate -Mediods based model of normal classes after mitigating the effect of class label noise, the classification of unseen samples is done by checking the closeness of feature vector representation of unseen sample from the -Mediods models of different classes. The sample is then classified to the class with minimum distance. The proposed algorithm for multivariate -Mediods based classification of unseen samples using learned -Mediods model is composed of the following steps.(1)Identify nearest neighbors of test sample from -Mediods model separately for each class as specified in (12).(2)Compute the membership of unseen lesion sample with respect to class as follows: where is the average of mean distances corresponding to mediods in as identified in (12). The mean distance corresponding to a given mediod is precomputed and stored in as specified in step 12 of the modeling algorithm, presented in Section 5.1. The test sample is classified to the class with the highest probability .

6. Proposed Classifier Ensemble

In this section, we present a framework for learning and classification using the proposed ensemble of classifiers. Although we intend to use three classifiers as specified in Section 4, we present a generic ensemble framework that can accommodate any number of classifiers. Once the model of different classifiers has been generated, we proposed a genetic algorithm based approach to learn weights on each class for different classifiers using set of identified confused samples. An instance in the population is a weight vector that contains weights to scale the probability/confidence of each classifier on prediction of different classes. The weight vector can be represented as follows: where is the number of classifiers integrated in the proposed framework and is the weight associated with the probabilistic output of classifier regarding class . The algorithm for learning optimal weight values on the classes to improve ensemble accuracy using comprises the following steps. (1)Initialize the population for genetic algorithm, represented as , by randomly generating weight vectors. Pad with some predefined weights where a particular classifier gives maximum confidence to one class and no confidence for other classes.(2)Normalize the weight vectors so that the sum of weights in each vector is equivalent to 1.(3)Set .(4)Divide noise free dataset into -folds. Let represent the th fold of where .(5)Select the th fold and treat it as cross validation set . Initialize training set . The training set to learn classifier model is then obtained as follows: (6)Learn the individual classifier models such as multivariate -Mediods, GMM, and OCC-SVM using .(7)Select a sample from and compute its probability to be classified to different classes using various classifiers as follows: where is the probability of a given sample to be classified to class according to classifier .

(8)Identify those samples for which at least two classifiers give different class predictions. This is achieved by pruning those samples from for which all the concerned classifiers predict the same class as these will not contribute in weight learning process. Removing such samples will have a positive impact of significantly speeding up the weight learning process on classes. More formally, let pruned cross validation set, containing confused samples, be referred to as . Set . The filtered cross validation set containing confused samples corresponding to th fold is obtained as specified in (9)Repeat steps 5–8 for all the folds.(10)Select a sample from the set of confused samples corresponding to th fold () and compute its probabilities to belong to various classes using different classifiers learnt using . Combine these probabilities given by different classifiers in an ensemble using a weight vector as follows: where is the weight assigned to the probabilistic output of sample to belong to class according to classifier and is the probability of sample to belong to class according to classifier ensemble created using weight vector . Classify the sample using (11)Increment by 1 if predicted class is equal to the true label of sample .(12)Iterate through steps 10-11 for all the confused samples in .(13)Iterate steps 10–12 for all the folds in the dataset.(14)Repeat steps 10–13 and compute their corresponding accuracies . The classification accuracies of ensemble computed using different weight vectors in are the objective function that we want to optimize using genetic algorithm.(15)Generate new sets of weight vectors by selecting 10 best weight vectors from with respect to their objective function values and applying the genetic operators, that is, crossover and mutation. The objective function using is computed as specified in steps 10–14. The evolved population of best 20 weight vectors is obtained by selecting the top weight vectors from . This step prevents us from losing track of any weight vector that gives us the optimal results during the weight learning procedure whilst filtering the newly evolved but poor population of weight vectors.(16)Iterate through steps 10–15 till there is no improvement in the optimal classification accuracy of the ensemble for 5 consecutive iterations or the number of iterations over the genetic algorithm exceeds a certain threshold. Select the weight vector that yields best accuracy for the learned ensemble.

Once the weight vector to combine classifiers in an ensemble is learned, the classification of test sample using proposed classifier ensemble is performed as follows: where .

7. Experiments

In this section, we present different experiments that are performed to show the effectiveness of proposed approach as compared to the competitors.

7.1. Datasets

To evaluate the performance of our proposed classifier ensemble methodology, four real life datasets have been used including IRIS, Satimage, DiaretDB, German, Pima Indian Diabetes, and Heart datasets. A brief overview of these datasets is given in Table 1. The distribution of instances in different classes within a dataset is presented in Table 2 to highlight the variation in distribution of samples among the classes.

7.2. Experiment 1: Evaluation of Proposed Classifier Ensemble Approach

The purpose of this experiment is to assess the effectiveness of our proposed ensemble approach as compared to individual classifiers in the ensemble. The experiment is conducted on DiaretDB dataset. We have extracted 70% of the instances separately from each class and treated it as a training data. The remaining 30% of the samples from each class are treated as test data. We employ -fold cross validation with to learn the classifier models individually in an ensemble as specified in Section 6. We managed to extract only 159 confused samples out of 2156 samples present in training DiaretDB dataset. Selection of only confused samples from the train dataset speeds up the weight learning process on all individual classes present in the dataset. Learning of optimal weights on classes for the proposed ensemble is done as specified in Section 6. The classification accuracies of classifier ensemble approach and its individual classifier members, using test dataset, are presented in Table 3. It is observed that hybrid classifier yields the best accuracy, that is, 99.675%, as compared to its individual member classifiers which shows the effectiveness of our proposed classifier ensemble as compared to individual classifiers.

7.3. Experiment 2: Comparison of Proposed Classifier Ensemble Approach with Competitors

This experiment is conducted to compare the performance of proposed ensemble method as compared to the competitors including Adaboost, Bagging, and Random Subspace Method (RSM). We have performed the experiment using DiaretDB, IRIS, Satellite Image, Heart, Pima Indian Diabetes, and German datasets. The experimental setup is similar to the one specified in Experiment 1. The competitive approaches require the manual specification of number of iterations used for training the ensemble. We use different number of iterations, that is, 50, 100, and 150 iterations, to learn the competitive learning approaches. The classification accuracies of competitors for variety of real life datasets, using 50, 100, and 150 iterations, are reported in Table 4. We can observe from the table that the performance of competitors is affected by changing the number of iterations. Adaboost shows more sensitivity toward the selection of the number of iterations followed by Bagging and RSM. We select the best classification accuracies of different competitors with respect to different number of iterations (represented in bold in Table 4) and used them to compare it with the classification accuracies obtained using our proposed ensembling approach. The results are presented in Table 5. As obvious from Table 5, the proposed approach gives the best classification accuracy results as compared to the competitors. Our approach is also independent of the manual specification of number of iterations.

7.4. Experiment 3: Comparing Proposed Ensemble Approach with Competitors in the Presence of Noise

The purpose of this experiment is to study the sensitivity of proposed ensemble approach and its competitors in the presence of class label noise. Class label noise is induced by randomly changing the label information of certain number of samples in training data with wrong labels. To simulate different noise level, we simulated noise in 5% and 10% of samples in training data. The remaining setup of the experiment is the same as specified in Experiment 2. The comparison of proposed approach with competitors using DiaretDB, IRIS, Satimage, Pima Indian Diabetes, and German datasets with class label noise is presented in Table 6. Based on these results and their comparison with performance of various approaches without the presence of class label noise, as presented in Table 5, we can see that the performance of Adaboost is affected by the presence of class label noise specially for Satimage and German data. With DiaretDB, the performance of Adaboost goes down for 10% noise level. Bagging is relatively less sensitive to noise than Adaboost. RSM also shows sensitivity to various noise levels using different dataset. However, it can be observed from Table 6 that our proposed ensemble approach shows the least sensitivity toward class label noise as compared to its competitors. Filtering class label noise using proposed extension of -Mediods based modeling approach has significantly mitigated the effect of class label noise. Resultantly, the proposed approach gives more accurate results as compared to competitors in the presence of class label noise.

7.5. Experiment 4: Analyzing Sensitivity of Ensembling Approaches to Class Imbalance in Datasets

The performance of proposed ensemble approach in the presence of class imbalance problem is evaluated and compared in this experiment. The experimental setup is similar to the one specified in Experiment 2. The confusion matrices of competitors and proposed approach using DiaretDB, Satimage, Heart, Pima Indian Diabetes, and German datasets are presented in Tables 7, 8, 9, 10, and 11, respectively. Looking at the confusion matrices obtained using different ensembling approaches across variety of datasets, it is obvious that Adaboost is performing poorly on class imbalance problem. Adaboost attempts to increase overall classification accuracy by focusing on classes with large membership count and ignoring those with lesser memberships. This is obvious from results for class 3 and class 4 for DiaretDB dataset, as presented in Table 7(a). Similarly, we can observe the same phenomena with classes 3, 4, and 6 of Satimage dataset as presented in Table 8(a). Classes 3 and 6 have large number of samples and hence they have overshadowed class 4 which has resulted in misclassification of many class 4 samples to class 3 and class 6. Similarly, the same situation is observed with Pima Indian Diabetes and German datasets as all of these binary class datasets have imbalanced number of instances in both classes. It is observed that there is a large difference between both classes of each of these two datasets. From Table 10(a) it can be observed that Adaboost performs well on the class having large membership count such as class 1, but its performance goes down on the class comprising small number of instances. It also shows poor performance on class 1 of German dataset having very small membership count as presented in Table 11(a). On the other hand, Bagging and RSM perform better than Adaboost but are still affected by the presence of class imbalance issue. The reason behind this behavior of competitors is that they focus on improving their overall classification accuracy without considering individual classes irrespective of their membership counts. As compared to competitors, proposed ensemble approach based on weight learning on classes is performing well on DiaretDB, Satimage, Heart, Pima Indian Diabetes, and German datasets as shown in Tables 7(d), 8(d), 9(d), 10(d), and 11(d), respectively.

7.6. Discussion

In this section, we presented a variety of experiments to demonstrate the superiority of proposed approach as compared to competitors in the presence of real world problems including presence of different level of class label noise class imbalance problem. The experimental results on different datasets obtained from variety of domains are presented. It has been observed that the proposed approach performs better than the existing state-of-the-art approaches such as Adaboost, RSM, and Bagging. The results presented in Table 4 demonstrated that the proposed approach gives the best classification accuracy results as compared to the competitors without imposing any requirement of manual specification of number of iterations. The proposed framework for combining classifiers is also robust to the presence of different level of class label noises. The proposed approach is least affected by the presence of noise as highlighted in Table 6. Robustness of proposed approach to the presence of class imbalance problem is highlighted in Experiment 4. This superiority of proposed approach, as compared to competitors, is attributed to the selection of classifiers with different specialties to handle various types of distribution that is expected within a dataset. Learning weights on these classifiers at the class level will result in having major contribution of decision making from a classifier that suits the localized probability distribution. This results in significant reduction of false positives and false negatives as compared to the competitors.

8. Conclusion

In this paper, we have discussed the issue of combining classifiers in an ensemble to enhance the overall classification accuracies. A novel classifier ensemble framework is presented that combines heterogeneous classifiers while handling the problems of class label noise and class imbalance problem. An extension of -Mediods based approach is presented to handle and filter samples in training data that are expected to have incorrect labels. The filtered data is then used to learn the remaining classifier models. These individual models are further combined by introducing weight learning method to learn weights on classes for each individual classifier to construct an ensemble. A genetic algorithm based approach is also presented that quickly searches for an optimal set of weights at class level for different classifiers. The weighted classifier ensemble is then used to classify unseen samples to one of the known classes.

Experiments are conducted to compare the performance of proposed classifier ensemble framework with existing state-of-the-art approaches such as Adaboost, Bagging, and RSM. It has been shown that the proposed classifier ensemble gives better classification accuracies, as compared to the competitors, on variety of datasets selected from different domains. Experimental results to show the robustness of proposed approach to class label noise are also presented in Section 7.4. The proposed approach gives better classification results than competitors in the presence of various levels of class label noise as obvious from Table 6. The sensitivity of proposed approach, as compared to competitors, in the presence of class imbalance problem is also analyzed. The proposed approach shows the least sensitiveness to class imbalance issue as demonstrated in Tables 79, respectively.

Conflict of Interests

The authors do not have a direct financial relation that might lead to a conflict of interests for any of the authors.

Acknowledgment

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2013R1A1A2061978).