A New Robust Classifier on Noise Domains: Bagging of Credal C4.5 Trees

Abellán, Joaquín; Castellano, Javier G.; Mantas, Carlos J.

doi:https://doi.org/10.1155/2017/9023970

Complexity

On this page

Abstract Introduction Conclusion Conflicts of Interest Acknowledgments Supplementary Materials References Copyright Related Articles

Research Article | Open Access

Volume 2017 | Article ID 9023970 | https://doi.org/10.1155/2017/9023970

A New Robust Classifier on Noise Domains: Bagging of Credal C4.5 Trees

Joaquín Abellán,¹Javier G. Castellano,¹and Carlos J. Mantas¹

Academic Editor: Roberto Natella

Received09 Jun 2017

Revised10 Oct 2017

Accepted02 Nov 2017

Published03 Dec 2017

Abstract

The knowledge extraction from data with noise or outliers is a complex problem in the data mining area. Normally, it is not easy to eliminate those problematic instances. To obtain information from this type of data, robust classifiers are the best option to use. One of them is the application of bagging scheme on weak single classifiers. The Credal C4.5 (CC4.5) model is a new classification tree procedure based on the classical C4.5 algorithm and imprecise probabilities. It represents a type of the so-called credal trees. It has been proven that CC4.5 is more robust to noise than C4.5 method and even than other previous credal tree models. In this paper, the performance of the CC4.5 model in bagging schemes on noisy domains is shown. An experimental study on data sets with added noise is carried out in order to compare results where bagging schemes are applied on credal trees and C4.5 procedure. As a benchmark point, the known Random Forest (RF) classification method is also used. It will be shown that the bagging ensemble using pruned credal trees outperforms the successful bagging C4.5 and RF when data sets with medium-to-high noise level are classified.

1. Introduction

Supervised classification [1] is an important task in data mining, where a set of observations or cases, described by a set of attributes (also called features or predictive variables), have assigned a value or label of the variable to be classified, also called class variable. This variable must be discrete; in other cases, the learning process is called regression task. A classifier can be considered as a learning method from data to obtain a set of laws to predict the class variable value for each new observation. In order to build a classifier from data, different approaches can be used, such as classical statistical methods [2], decision trees [3], and artificial neural networks or Bayesian networks [4].

Decision trees (DTs), also known as classification trees or hierarchical classifiers, are a type of classifiers with a simple structure where the knowledge representation is relatively simple to interpret. The decision tree can be seen as a set of compact rules in a tree format, where, in each node, an attribute variable is introduced; and in the leaves (or end nodes) we have a label of the class variable or a set of probabilities for each class label. Hunt et al.’s work [5] was the origin of decision trees, although they began to gain importance with the publication of the ID3 algorithm proposed by Quinlan [6]. Afterwards, Quinlan proposed the C4.5 [3] algorithm, which is an improvement of the previous ID3 one and obtains better results. This classifier has the characteristic of the instability, that is, that few variations of the data can produce important differences on the model.

The fusion of information obtained via ensembles or combination of several classifiers can improve the final process of a classification task; this can be represented via an improvement in terms of accuracy and robustness. Some of the more popular schemes are bagging [7], boosting [8], or Random Forest [9]. The inherent instability of decision trees [7] makes these classifiers very suitable to be employed in ensembles.

Class noise, also known as label noise or classification noise, is named to those situations which appear when data sets have incorrect class labels. This situation is principally motivated by deficiencies in the data learning and/or test capture process, such as wrong disease diagnosis method and human errors in the class label assignation (see [10–12]). One of the most important procedures to have success in a classification task in situations of noisy domains is the use or application of ensembles of classifiers. In the literature about classification on noisy domains, bagging scheme stands out as the most successful scheme. This ensemble scheme has characteristics that it reduces the variance and avoids overfitting. A complete and recent revision of machine learning methods to manipulate label noise can be found in [13].

On the other hand, until a few years ago, the classical theory of probability (PT) has been the fundamental tool to construct a method of classification. Many theories to represent the information have arisen as a generalization of the PT, such as theory of evidence, measures of possibility, intervals of probability, and capacities of order-2. Each one represents a model of imprecise probabilities (see Walley [14]).

The Credal Decision Tree (CDT) model of Abellán and Moral [15] uses imprecise probabilities and general uncertainty measures (see Klir [16]) to build a decision tree. The CDT model represents an extension of the classical ID3 model of Quinlan [6], replacing precise probabilities and entropy with imprecise probabilities and maximum of entropy. This last measure is a well-accepted measure of total uncertainty for some special type of imprecise probabilities (Abellán et al. [17]). In the last years, it has been checked that the CDT model presents good experimental results in standard classification tasks (see Abellán and Moral [18] and Abellán and Masegosa [19]). The bagging scheme, using CDT as base classifier, has been used for the particular task of classifying data sets about credit scoring (see Abellán and Castellano [20]). A bagging scheme that uses a type of credal tree different from the CDT presented in [15] will be described in this work. This new model achieves better results than the bagging of CDT shown in [20] when data sets with added noise are classified.

In Mantas and Abellán [21], the classical method of C4.5 of Quinlan [3] has been modified using similar tools to the ones used for the CDT method. The new algorithm is called Credal C4.5 algorithm (CC4.5). It is shown that the use of imprecise probabilities has some practical advantages in data mining: the manipulation of the total ignorance is coherently solved and the indeterminacy or inconsistency is adequately represented. Hence, on noisy domains, these classifiers have an excellent performance. This assertion can be checked in Mantas and Abellán [21] and Mantas et al. [22]. In [21], the new CC4.5 presents better results than the classic C4.5 when they are applied on a large number of data sets with different levels of class noise. In [22], the performance of CC4.5 with different values for its parameter is analyzed when data sets with distinct noise levels are classified and information about the best value for is obtained in terms of the noise level of a data set. In this work, the bagging scheme using CC4.5 as base classifier will be presented, which obtains very good results when data sets with added noise are classified.

DTs are models with low bias and high variance. Normally, the variance and overfitting are reduced by using postpruning techniques. As we said, ensemble methods like bagging are also used to decrease the variance and overfitting. The procedures of the CDT and CC4.5 also represent other ways to reduce these two characteristics in a classification procedure. Hence, we have three methods to reduce variance and overfitting in a classification task which can be especially important when they are applied on noisy domains. We prove here that the combination of these three techniques (bagging + pruning + credal trees) represents a fusion of tools to be successful in noise domains. This assertion is shown in this work via a set of experiments where the bagging ensemble procedure is executed by using different models of trees (C4.5, CDT, and Credal C4.5) with and without postpruning process.

Experimentally, we show the performance of the CC4.5 model when it is inserted on the known ensemble scheme of bagging (called bagging CC4.5) and applied on data sets with different levels of label noise. This model obtains improvements with respect to other known ensembles of classifiers used in this type of setting: the bagging scheme with the C4.5 model and the known classifier Random Forest (RF). It is shown in the literature that the bagging scheme with the C4.5 model is normally the winning model in many studies about classification noise [23, 24].

A bagging scheme procedure, using CC4.5 as base classifier, has three important characteristics to be successful under noisy domains: (a) the different treatment of the imprecision, (b) the use of the bagging scheme, and (c) the production of medium-size trees (it is inherent to the model and related to (a)).

To reinforce the analysis of results, we will use a recent measure to quantify the degree of robustness of a classifier when it is applied on noisy data sets. This measure is the Equalized Loss of Accuracy (ELA) of Sáez et al. [25]. We will see that the bagging scheme using the CC4.5 attains the best values with this measure when the level of added noise is increased.

The rest of the paper is organized as follows. In Section 2, we begin with the necessary previous knowledge about decision trees, Credal Decision Trees, the Credal-C4.5 algorithm, and the ensemble schemes used. Section 4 contains the experimental results of the evaluation of the ensemble methods studied on a wide range of data sets varying the percentage of added noise. Section 5 describes and comments on the experimentation carried out. Finally, Section 6 is devoted to the conclusions.

2. Classic DTs versus DTs Based on Imprecise Probabilities

Decision trees are simple models that can be used as classifiers. In situations where elements are described by one or more attribute variables (also called predictive attributes or features) and by a single class variable, which is the variable under study, classification trees can be used to predict the class value of an element by considering its attribute values. In such a structure, each nonleaf node represents an attribute variable, the edges or branches between that node and its child nodes represent the values of that attribute variable, and each leaf node normally specifies an exact value of the class variable.

The process for inferring a decision tree is mainly determined by the followings aspects:(1)The split criterion, that is, the method used to select the attribute to be inserted in a node and branching(2)The criterion to stop the branching(3)The method for assigning a class label or a probability distribution at the leaf nodes

An optional final step in the procedure to build DTs, which is used to reduce the overfitting of the model to the training set, is the following one:(4)The postpruning process used to simplify the tree structure

In classic procedures for building DTs, where a measure of information based on PT is used, the criterion to stop the branching (above point ) normally is the following one: when the measure of information is not improved or when a threshold of gain in that measure is attained. With respect to the above point , the value of the class variable inserted in a leaf node is the one with more frequency in the partition of the data associated with that leaf node; its associated distribution of probabilities also can be inserted. Then the principal difference among all the procedures to build DTs is point , that is, the split criterion used to select the attribute variable to be inserted in a node.

Considering classic split criteria and split criteria based on imprecise probabilities, a basic point to differentiate them is how they obtain probabilities from data. We will compare a classical procedure using precise probabilities with the one based on the Imprecise Dirichlet Model (IDM) of Walley [14] based on imprecise probabilities:(i)In classical split criteria, the probability associated with a state of the class variable, for a partition of the data, is the classical frequency of this state in that partition. Formally, let be the class variable with states and let be a partition of the data set. The probability of associated with the partition is where is the number of pieces of data with the state in the partition set ; and is the total number of pieces of data of that partition, .(ii)When we use the IDM, a model of imprecise probabilities (see Walley [14]), the probability of a state of the class variable is obtained in a different way. Using the same notation, now the probability is obtained via an interval of probabilities: where the parameter is a hyperparameter belonging to the IDM. The value of parameter regulates the convergence speed of the upper and lower probability when the sample size increases. Higher values of produce an additional cautious inference. Walley [14] does not give a decisive recommendation for the value of the parameter , but he proposed two candidates: and ; nevertheless, he recommend the value . It is easy to check that the size of the intervals increases when the value of increases.

In the following sections, we will explain the differences among the classic split criteria and the ones based on imprecise probabilities in a parallel way. We will compare the classic Info-Gain of Quinlan [6] with the Imprecise Info-Gain of Abellán and Moral [15] and the Info-Gain Ratio of Quinlan [3] with the Imprecise Info-Gain Ratio of Mantas and Abellán [21]. The final procedure to select the variable to be inserted in a node by each split criterion can be seen in Table 1.

The classical criteria use normally Shannon’s measure as base measure of information, and the ones based on imprecise probabilities use the maximum entropy measure. This measure is based on the principle of maximum uncertainty [16] which is widely used in classic information theory, where it is known as maximum entropy principle [26]. This principle indicates that the probability distribution with the maximum entropy, compatible with available restrictions, must be chosen. The maximum entropy measure verifies an important set of properties on theories based on imprecise probabilities that are generalizations of the probability theory (see Klir [16]).

2.1. Info-Gain versus Imprecise Info-Gain

Following the above notation, let be a general feature whose values belong to . Let be a general partition of the data set. The Info-Gain (IG) criterion was introduced by Quinlan as the basis for his ID3 model [6], and it is explained as follows:(i)The entropy of the class variable C for the data set is Shannon’s entropy [27] and it is defined as where represents the probability of the class in .(ii)The average entropy generated by the attribute is where represents the probability that in . is the subset of , where .

Finally, we can define the Info-Gain as follows:

The Imprecise Info-Gain (IIG) [15] is based on imprecise probabilities and the utilization of uncertainty measures on credal sets (closed and convex sets of probability distributions). It was introduced to build the so-called Credal Decision Tree (CDT) model. Probability intervals are obtained from the data set using Walley’s Imprecise Dirichlet Model (IDM) [14] (a special type of credal sets [28]). The mathematical basis applied is described below.

With the above notation, , defined for each value of the variable , is obtained via the IDM:where is the frequency of the case in the data set, is the sample size, and is the given hyperparameter belonging to the IDM.

That representation gives rise to a specific kind of credal set on the variable , [28]. This set is defined as follows:

On this type of sets (really credal sets, [28]), uncertainty measures can be applied. The procedure to build CDTs uses the maximum of entropy function on the above defined credal set. This function, denoted as , is defined in the following way:

The procedure to obtain for the special case of the IDM reaches its lowest computational cost for (see Abellán [28] for more details).

The scheme to induce CDTs is like the one used by the classical ID3 algorithm [6], replacing its Info-Gain split criterion with the Imprecise Info-Gain (IIG) split criterion which can be defined in the following way: where is calculated via a similar way to in the IG criterion (for a more extended explanation, see Mantas and Abellán [21]).

It should be taken into account that, for a variable and a data set , can be negative. This situation does not occur with the Info-Gain criterion. This important characteristic implies that the IIG criterion can discard variables that worsen the information on the class variable. This is an important feature of the model which can be considered as an additional criterion to stop the branching of the tree, reducing the overfitting of the model.

As for IG and IIG, the first part of each criterion is a constant value for each attribute variable. Both criteria select the variable with lower value of uncertainty about the class variable when the attribute variable is known, which is expressed in the second parts in (5) and (9). This can be seen as a scheme in Table 1.

2.2. Info-Gain Ratio versus Imprecise Info-Gain Ratio

The Info-Gain Ratio (IGR) criterion was introduced for the C4.5 model [3] in order to improve the ID3 model. IGR penalizes variables with many states. It is defined as follows:where

The method for building Credal C4.5 trees [21] is similar to Quinlan’s C4.5 algorithm [3]. Credal C4.5 is created by replacing the Info-Gain Ratio split criterion from C4.5 with the Imprecise Info-Gain Ratio (IIGR) split criterion. The main difference is that Credal C4.5 estimates the values of the features and class variable by using imprecise probabilities. This criterion can be defined as follows:where SplitInfo is defined in (11) and Imprecise Info-Gain (IIG) is where and are the credal sets obtained via the IDM for and variables, respectively, for a partition of the data set [15]; and is a probability distribution that belongs to the credal set .

We choose the probability distribution from which maximizes the following expression:It is simple to calculate this probability distribution. For more details, see Mantas and Abellán [21].

2.3. Bagging Decision Trees

In machine learning, the idea of taking into account several points of view before taking a decision is applied when several classifiers are combined. This is called by distinct names such as multiple classifier systems, committee of classifiers, mixture of experts, or ensemble-based systems. Normally, the ensemble of decision trees achieves a better performance than an individual classifier [10].

The usual strategy for the combination of decision trees is based on the creation of several decision trees aggregated with a majority vote criterion. If an unclassified instance appears, then each single classifier makes a prediction and the class value with the highest number of votes is assigned for the instance.

Breiman’s bagging [7] (or Bootstrap Aggregating) is an intuitive and simple method that shows a good performance, reduces the variance, and avoids overfitting. Normally it is implemented with decision trees, but it can be applied with any type of classifier. Diversity in bagging is obtained by generating replicated bootstrap data sets of the original training data set: “different training data sets are randomly drawn with replacement from the original training set and, in consequence, the replicated training data sets have the same size as the original data, but some instances may not appear in it or may appear more than once.” Afterwards, a single decision tree is built with each new instance of the training data set using the standard approach [29]. Thus, building each tree from a different data set, several decision trees are obtained, which are defined by a different set of variables, nodes, and leaves. Finally, the predictions of these trees are combined by a majority vote criterion.

3. Bagging Credal C4.5 and the Noise

Bagging Credal C4.5 consists of using the bagging scheme, presented in the previous section, with the Credal C4.5 algorithm as base classifier. The difference between CC4.5 and classic C4.5 is the split criterion. CC4.5 uses IIGR measure and C4.5 uses IGR. It can be shown that the measure IIGR is less sensitive to noise than the measure IGR. Hence, CC4.5 can perform a classification task on noisy data sets better than the classic C4.5, as it was experimentally demonstrated in [21].

The following example illustrates a case where the measure IIGR is more robust to noise than the measure IGR.

Example 1. Let us suppose a data set altered by noise and composed by 15 instances, 9 instances of class and 6 instances of class . It can be considered that there are two binary feature variables and . According to the values of these variables, the instances are organized as follows: If this data set appears in the node of a tree, then the C4.5 algorithm chooses the variable for splitting the node becausewhere is the noisy data set composed by the 15 instances.
It can be supposed that the data set is noisy because it has an outlier point when and class is . In this way, the clean distribution is composed by 10 instances of class and 5 instances of class , which are organized in the following way:When this data set appears in the node of a tree, then the C4.5 algorithm chooses the variable for splitting the node becausewhere is the clean data set composed by the 15 instances.
It can be observed that the C4.5 algorithm, by means of the IGR criterion, creates an incorrect subtree when noisy data are processed. However, a tree built with the IIGR criterion (and ) selects the variable for splitting the node in both cases (noisy data set and clean data set). That is,where is the data set with noise, andwhere is the clean data set.

This example shows the difference with respect to the robustness. CC4.5 algorithm is more robust to noise than C4.5. For this reason, bagging Credal C4.5 is also more robust to noise than bagging C4.5. This fact will be shown with the experiments of this paper.

4. Experimentation

In this section, we shall describe the experiments carried out and comment on the results obtained. We have selected 50 well-known data sets in the field of machine learning, obtained from the UCI repository of machine learning [30]. The data sets chosen are very different in terms of their sample size, number and type of attribute variables, number of states of the class variable, and so forth. Table 2 gives a brief description of the characteristics of the data sets used.

We have performed a study where the bagging of Credal C4.5 on data with added noise is compared with the Random Forest algorithm [9] and the bagging of other tree based models: C4.5 [10] and CDT [23]. We have used each model with and without a postpruning process. The pruning process of each model has been the one used by defect for each model. Hence, the algorithms considered are the following ones:(i)Bagging C4.5 with unpruned tress (BA-C4.5-U)(ii)Bagging CDTs with unpruned trees (BA-CDT-U)(iii)Bagging Credal C4.5 with unpruned trees (BA-CC4.5-U)(iv)Bagging C4.5 (BA-C4.5)(v)Bagging CDTs (BA-CDT)(vi)Bagging Credal C4.5 (BA-CC4.5)(vii)Random Forest (RF)

The Weka software [31] has been used for the experimentation. The methods BA-CDT and BA-CC4.5 and their versions with unpruned trees were implemented using data structures of Weka. The implementation of C4.5 algorithm provided by Weka software, called J48, was employed with its default configuration. We added the necessary methods to build Credal C4.5 trees with the same experimental conditions. In CDTs and Credal C4.5, the parameter of the IDM was set to , that is, the value used in the original methods by [18, 21], respectively. The reasons to use this value were principally that it was the value recommended by Walley [14]; and the procedure to obtain the maximum entropy value reaches its lowest computational cost for this value (see [28]).

The implementation of bagging ensembles and Random Forest provided by Weka were used with their default configurations, except that the number of trees used for those methods was equal to 100 decision trees. Although the number of trees can strongly affect the ensemble performance, this is a reasonable number of trees for the low-to-medium size of the data sets used in this study, and moreover it was the number of trees used in related research, such as [8].

Using Weka’s filters, we have added the following percentages of random noise to the class variable: 0%, 10%, 20%, 30%, and 40%, only in the training data set. The procedure to introduce noise was the following: a given percentage of instances of the training data set was randomly selected and, then, their current class values were randomly changed to other possible values. The instances belonging to the test data set were left unmodified.

We repeated 10 times a 10-fold cross validation procedure for each data set. It is a very known and used validation procedure. Tables 3, 4, 5, 6, and 7 show the accuracy of the methods with the different percentages of added noise. Table 8 presents a summary of the average accuracy results where the best algorithm for each added noise level is emphasized using bold fonts and the second best is marked with italic fonts.

Following the recommendation of Demšar [32], we used a series of tests to compare the methods using the software [33]. We used the following tests to compare multiple classifiers on multiple data sets.

Friedman Test (Friedman [34, 35]). It is a nonparametric test that ranks the algorithms separately for each data set, with the best performing algorithm being assigned the rank of 1 and the second best being assigned the rank of 2 and so forth. The null hypothesis is that all the algorithms are equivalent. If the null hypothesis is rejected, we can compare all the algorithms with each other using the Nemenyi test [36].

All the tests were carried out with a level of significance . Hence, Table 9 shows Friedman’s ranks about the accuracy of the methods when they are applied on data sets with different levels of added noise. The best algorithm for each noise level is emphasized using bold fonts and the second best one is marked with italic fonts. Tables 10, 11, 12, 13, and 14 show the p values of the Nemenyi test on the pairs of comparisons when they are applied on data sets with different percentage of added noise. In all cases, Nemenyi test rejects the hypotheses that the algorithms are equivalent if the corresponding p value is ≤0.002381. When there is a significant difference, the best algorithm is distinguished with bold fonts.

For the sake of clarity, the results of Nemenyi’s test can be seen graphically in Figure 1. In this graph, the columns express the values of Friedman’s ranks and the critical difference is expressed as a vertical segment. When the height of a segment on a column is lower than the one of the other column, the differences are statistically significant in favor of the algorithm represented with the lower rank (lower column).

To present the results of the average tree size (number of nodes) obtained by each method, we use Figure 2. In this figure, we can see in a quick way the average size of the trees built by each bagging method when they are applied on data sets with different levels of added noise.

We have extended the study of the results using a recent measure to quantify the degree of robustness of a classifier when it is applied on noisy data sets. This measure is the Equalized Loss of Accuracy (ELA) of Sáez et al. [25].

The Equalized Loss of Accuracy (ELA) measure is a new behavior-against-noise measure that allows us to characterize the behavior of a method with noisy data considering performance and robustness. measure is expressed as follows:where is the accuracy of the classifier when it is applied on a data set without added noise and is the accuracy of the classifier with it is applied on a data set with level of added noise of x%.

The measure (there exists another similar measure named as the Relative Loss of Accuracy (RLA) of Sáez et al. [37]. We find that the measure is more important than the measure, because takes into account higher levels of accuracy on data sets with added noise) considers the performance without noise as a value to normalize the degree of success. This characteristic makes it particularly useful when comparing two different classifiers over the same data set. The classifier with the lowest value for will be the most robust classifier.

Table 15 shows the values of the Equalized Loss of Accuracy () measures. The best algorithm for each level of added noise is identified using bold fonts and the second best one is represented with italic fonts.

5. Comments on the Results

From a general point of view, we can state that bagging of credal trees (BA-CC4.5 and BA-CDT) has a better performance than the models used as reference (BA-C4.5 and RF) when the level of added noise is increased. This improvement is not only with respect to the accuracy, via the tests of Friedman and Nemenyi carried out, but also in terms of the measures of robustness.

An important characteristic of the results is that the bagging ensembles using credal trees built less complex models than the ones built by the bagging of classic C4.5, as can be seen in Figure 2. When the level of added noise is increased, the complexity of the bagging models using credal sets is notably smaller than the ones that use C4.5. That complexity is an important aspect of a classifier when it is applied on data set with noise, because when the model is larger, the overfitting on data with errors is larger too. Hence, the model can produce a worse performance. This is the case for RF according to Figure 2: the complexity of the random trees for RF is very large; therefore RF has a bad performance when it is applied on noisy data sets.

Next, we are going to analyze the results, on each level of added noise, taking into account principally the accuracy and measures of robustness. The following aspects must be remarked.

0. According to accuracy and test of Friedman, without added noise, is the best model. We can observe in Table 9 (Friedman’s ranking) that all the bagging models without pruning are better in accuracy than the same bagging models with pruning. Besides, BA-C4.5-U is the best model compared with the other bagging models. These results are coherent with the original bagging algorithm proposed in [7], where the trees are built without pruning for each bootstrap sample. In this way, the trees tend to be more different from each other than if they were pruned. This is a good characteristic of a model, for reducing variance, when it is used as base classifier in a bagging scheme. When we use unpruned trees, we are increasing the risk of overfitting; however, the aggregation of trees carried out by bagging offsets this risk. We remark that this assertion is right for data sets without added noise.

10. With this low level of added noise, BA-C4.5 is now the best model but suffers notable deterioration in its performance about accuracy. Also BA-C4.5-U, which was excellent without added noise, is now the worse method. It must be remarked that it builds the largest trees. Here BA-CC4.5 begins to have excellent results in accuracy, being the second better classifier for this level of added noise. The measure indicates that the best value is for BA-C4.5 followed by BA-CC4.5. According to Friedman’s ranking about accuracy, we can observe that each bagging model with pruned trees is better than the same model with unpruned trees for this added noise level. With these results, we can conclude that the bagging algorithm needs to aggregate trees with pruning in order to manipulate data sets with low level of added noise. That is, using only a bagging scheme is insufficient in order to classify data sets with this level of added noise. Then, to prune the trees is also necessary here.

20. With this medium-to-high level of added noise, the situation is notably different from the one with lowest level of added noise. Here BA-CC4.5 is the better procedure in terms of accuracy followed by BA-CDT. BA-C4.5 has still good performance but it is worse than the bagging credal trees. We cannot say the same for RF that has a very bad performance, getting worse when the level of noise increases. The Nemenyi test carried out presents significant differences in favor of bagging credal trees when they are compared with RF and some versions of the methods without pruning. The measure has the best results for BA-CC4.5. BA-C4.5-U is again the worse method considering all the aspects analyzed. The size of the trees impairs seriously their performance. Hence, to obtain better results, the bagging scheme needs to use pruned credal trees when it is applied on data sets with a level of added noise greater than or equal to 20% (we will see similar conclusion for higher level of added noise).

30 and 40. As with these levels of added noise the results are very similar, we will comment on their results together. For these levels of added noise, BA-CC4.5 is always the best procedure in terms of accuracy. The other model based on credal trees, BA-CDT, obtains the second better results. These comments are reinforced by the tests of Friedman and Nemenyi carried out. Here, even BA-C4.5 is significantly worse than the two bagging schemes of credal trees, via the test carried out. RF is now even worse than with medium level of added noise. It is remarkable that the method BA-CC4.5-U (without pruning) has better results than the pruned method BA-C4.5, although they have similar average tree sizes. Also the robustness measure confirms these assertions. Again BA-CC4.5 is the best model for the measure. In all cases, BA-C4.5 has medium results but the same model without pruning, BA-C4.5-U, has now very bad results, being the worse method for these high levels of added noise. The second worse results are obtained by RF, which also is not a good procedure for high level of added noise, when it is compared with bagging schemes of credal trees. With these results, and considering the ones for 20% of added noise, we can say that the combination of bagging, pruning, and credal trees is necessary to obtain the best significant results when we want to apply the methods on data sets with levels of added noise greater than or equal to 20%.

With respect to the average tree size, we have the following comments. It can be observed that the model BA-CDT builds always smaller trees. Perhaps this is one of the reasons why it works well with high level of added noise but not without added noise, when it is compared with the rest of models. When the level of added noise is increased, the percentage of increasing of the average size is the smallest one for BA-CDT. BA-CC4.5 has medium tree size compared with all the models with pruning; we remember that it has decent results in accuracy on data sets without added noise, and it is the best model in accuracy on data sets with medium and high levels of added noise. The following methods in tree size, with very similar sizes, are BA-C4.5 and BA-CC4.5-U, that is, a pruned method and an unpruned one; the second one is better in accuracy for level of added noise of 20–40%. At this point, we can argue that the size is not as important as the split criterion used; CC4.5 has a different treatment of the imprecision than C4.5, as was explained in previous sections. The rest of unpruned methods build larger trees, with BA-C4.5 being the one with larger results in tree size but the one with worse results in the rest of aspects, when it is compared with the other methods.

We can conclude that the method with a moderate or medium tree size, BA-CC4.5, has the best results in accuracy and measures of robustness, when the level of added noise is increased. Hence, we can think that the tree size is not a fundamental aspect of the performance of a model on noisy domains.

6. Conclusion

A very recent model called Credal C4.5 (CC4.5) is based on the classical C4.5 algorithm and imprecise probabilities. In a previous work, its excellent performance in noise domains has been shown. In this paper, we have used it in a bagging scheme on a large experimental study. We have compared it with other models that can be considered as very appropriate in this type of domains: bagging C4.5 and bagging Credal Decision Trees (CDTs). This last model, called CDT, represents other procedures based on imprecise probabilities, which was presented some years ago to be very suitable under noise.

With the results obtained in this paper, we show that bagging CC4.5 obtains excellent results when it is applied on data sets with label noise. Its performance is better than the ones of the other models used as benchmark here in two folds: accuracy and measures of robustness under noise. This improvement is even greater when the level of label noise increases.

Real data commonly have noise. This reason allows us to believe that the bagging of Credal C4.5 trees is an ideal candidate to use on data from real applications. It combines several resources to be successful in the treatment of noisy data: imprecise probabilities, bagging, and pruning. Hence, it could be considered as a powerful tool to apply in noise domains.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work has been supported by the Spanish “Ministerio de Economía y Competitividad” and by “Fondo Europeo de Desarrollo Regional” (FEDER) under Project TEC2015-69496-R.

Supplementary Materials

Table 1. Accuracy results of the methods when they are used on data sets without added noise. Table 2. Accuracy results of the methods when they are used on data sets with a percentage of added noise equal to 10%. Table 3. Accuracy results of the methods when they are used on data sets with a percentage of added noise equal to 20%. Table 4. Accuracy results of the methods when they are used on data sets with a percentage of added noise equal to 30%. Table 5. Accuracy results of the methods when they are used on data sets with a percentage of added noise equal to 40%. Table 6. Average result of the accuracy of the different algorithms when they are built from data sets with added noise. Table 7. Friedman’s ranks about the accuracy of the algorithms when they are applied on data sets with different percentages of added noise. Table 8. values of the Nemenyi test about the accuracy on data sets without added noise. Nemenyi’s procedure rejects those hypotheses that have an unadjusted value -< 0:002381. Table 9. values of the Nemenyi test about the accuracy on data sets with 10% of added noise. Nemenyi’s procedure rejects those hypotheses that have an unadjusted value -< 0:002381. Table 10. values of the Nemenyi test about the accuracy on data sets with 20% of added noise. Nemenyi’s procedure rejects those hypotheses that have an unadjusted value -< 0:002381. Table 11. values of the Nemenyi test about the accuracy on data sets with 30% of added noise. Nemenyi’s procedure rejects those hypotheses that have an unadjusted value -< 0:002381. Table 12. values of the Nemenyi test about the accuracy on data sets with 40% of added noise. Nemenyi’s procedure rejects those hypotheses that have an unadjusted value -< 0:002381. Table 13. values of the Bonferroni-Dunn test about the accuracy on data sets without added noise, where Random Forest is the best method in Friedman’s rank. Table 14. values of the Bonferroni-Dunn test about the accuracy on data sets with 10% of added noise, where bagging of C4.5 is the best method in Friedman’s rank. Table 15. values of the Bonferroni-Dunn test about the accuracy on data sets with 20% of added noise, where bagging of Credal C4.5 is the best method in Friedman’s rank. Table 16. values of the Bonferroni-Dunn test about the accuracy on data sets with 30% of added noise, where bagging of Credal C4.5 is the best method in Friedman’s rank. Table 17. values of the Bonferroni-Dunn test about the accuracy on data sets with 40% of added noise, where bagging of Credal C4.5 is the best method in Friedman’s rank. (Supplementary Materials)

References

D. J. Hand, Construction and Assessment of Classification Rules, John Wiley and Sons, New York, NY, USA, 1997.
D. J. Hand, Discrimination and Classification, John Wiley, 1981.
J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 1993.
J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, Boston, Mass, USA, 1988.
View at: MathSciNet
E. B. Hunt, J. Marin, and P. Stone, in Experiments in Induction, Academic Press, 1966.
J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
View at: Publisher Site | Google Scholar
L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.
View at: Google Scholar
Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” in Proceedings of the the Thirteenth International Conference on Machine Learning (ICML 1996), L. Saitta, Ed., pp. 148–156, Morgan Kaufmann, 1996.
View at: Google Scholar
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
View at: Publisher Site | Google Scholar
T. G. Dietterich, “Experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization,” Machine Learning, vol. 40, no. 2, pp. 139–157, 2000.
View at: Publisher Site | Google Scholar
P. Melville and R. J. Mooney, “Constructing diverse classifier ensembles using artificial training examples,” in Proceedings of the 18th International Joint Conference on Artificial Intelligence, IJCAI’03, pp. 505–510, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003, http://dl.acm.org/citation.cfm?id=1630659.1630734.
View at: Google Scholar
L.-Y. Dai, C.-M. Feng, J.-X. Liu, C.-H. Zheng, J. Yu, and M.-X. Hou, “Robust nonnegative matrix factorization via joint graph Laplacian and discriminative information for identifying differentially expressed genes,” Complexity, Article ID 4216797, 11 pages, 2017.
View at: Publisher Site | Google Scholar | MathSciNet
B. Frénay and M. Verleysen, “Classification in the presence of label noise: A survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 5, pp. 845–869, 2014.
View at: Publisher Site | Google Scholar
P. Walley, “Inferences from multinomial data: learning about a bag of marbles,” ournal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, pp. 3–57, 1996.
View at: Google Scholar | MathSciNet
J. Abellán and S. Moral, “Building Classification Trees Using the Total Uncertainty Criterion,” International Journal of Intelligent Systems, vol. 18, no. 12, pp. 1215–1225, 2003.
View at: Publisher Site | Google Scholar
G. J. Klir, Uncertainty and Information, Foundations of Generalized Information Theory, Wiley-Interscience, New York, NY, USA, 2006.
View at: Publisher Site
J. Abellán, G. J. Klir, and S. Moral, “Disaggregated total uncertainty measure for credal sets,” International Journal of General Systems, vol. 35, no. 1, pp. 29–44, 2006.
View at: Publisher Site | Google Scholar | MathSciNet
J. Abellán and S. Moral, “Upper entropy of credal sets. Applications to credal classification,” International Journal of Approximate Reasoning, vol. 39, no. 2-3, pp. 235–255, 2005.
View at: Publisher Site | Google Scholar | MathSciNet
J. Abellán and A. R. Masegosa, “A filter-wrapper method to select variables for the naive bayes classifier based on credal decision trees,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 17, no. 6, pp. 833–854, 2009.
View at: Publisher Site | Google Scholar
J. Abellán and J. G. Castellano, “A comparative study on base classifiers in ensemble methods for credit scoring,” Expert Systems with Applications, vol. 73, pp. 1–10, 2017.
View at: Publisher Site | Google Scholar
C. J. Mantas and J. Abellán, “Credal-C4.5: Decision tree based on imprecise probabilities to classify noisy data,” Expert Systems with Applications, vol. 41, no. 10, pp. 4625–4637, 2014.
View at: Publisher Site | Google Scholar
C. J. Mantas, J. Abellán, and J. G. Castellano, “Analysis of Credal-C4.5 for classification in noisy domains,” Expert Systems with Applications, vol. 61, pp. 314–326, 2016.
View at: Publisher Site | Google Scholar
J. Abellán and A. R. Masegosa, “Bagging schemes on the presence of class noise in classification,” Expert Systems with Applications, vol. 39, no. 8, pp. 6827–6837, 2012.
View at: Publisher Site | Google Scholar
S. Verbaeten and A. Van Assche, “Ensemble Methods for Noise Elimination in Classification Problems,” in Multiple Classifier Systems, vol. 2709 of Lecture Notes in Computer Science, pp. 317–325, Springer Berlin Heidelberg, Berlin, Heidelberg, 2003.
View at: Publisher Site | Google Scholar
J. A. Sáez, J. Luengo, and F. Herrera, “Evaluating the classifier behavior with noisy data considering performance and robustness: The Equalized Loss of Accuracy measure,” Neurocomputing, vol. 176, pp. 26–35, 2016.
View at: Publisher Site | Google Scholar
E. T. Jaynes, “On The Rationale of Maximum-Entropy Methods,” Proceedings of the IEEE, vol. 70, no. 9, pp. 939–952, 1982.
View at: Publisher Site | Google Scholar
C. E. Shannon, “A mathematical theory of communication,” Bell Labs Technical Journal, vol. 27, pp. 379–423, 623--656, 1948.
View at: Publisher Site | Google Scholar | MathSciNet
J. Abellán, “Uncertainty measures on probability intervals from the imprecise Dirichlet model,” International Journal of General Systems, vol. 35, no. 5, pp. 509–528, 2006.
View at: Publisher Site | Google Scholar | MathSciNet
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Wadsworth, Belmont, Mass, USA, 1984.
View at: MathSciNet
M. Lichman, UCI Machine Learning Repository, 2013, http://archive.ics.uci.edu/ml.
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2nd edition, 2005.
J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.
View at: Google Scholar | MathSciNet
J. Alcalá-Fdez, L. Sánchez, S. García et al., “KEEL: a software tool to assess evolutionary algorithms for data mining problems,” Soft Computing, vol. 13, no. 3, pp. 307–318, 2009.
View at: Publisher Site | Google Scholar
M. Friedman, “The use of ranks to avoid the assumption of normality implicit in the analysis of variance,” Journal of the American Statistical Association, vol. 32, no. 200, pp. 675–701, 1937.
View at: Publisher Site | Google Scholar
M. Friedman, “A comparison of alternative tests of significance for the problem of m rankings,” The Annals of Mathematical Statistics, vol. 11, no. 1, pp. 86–92, 1940.
View at: Publisher Site | Google Scholar | MathSciNet
P. Nemenyi, Distribution-free multiple comparisons [Doctoral Dissertation], Princeton University, New Jersey, USA, 1963.
J. A. Sáez, J. Luengo, and F. Herrera, “Fuzzy rule based classification systems versus crisp robust learners trained in presence of class noise's effects: A case of study,” in Proceedings of the 2011 11th International Conference on Intelligent Systems Design and Applications, ISDA'11, pp. 1229–1234, Spain, November 2011.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2017 Joaquín Abellán et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1599

Downloads

1067

Citations