Complexity

Volume 2017, Article ID 9023970, 17 pages

https://doi.org/10.1155/2017/9023970

## A New Robust Classifier on Noise Domains: Bagging of Credal C4.5 Trees

Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain

Correspondence should be addressed to Joaquín Abellán; se.rgu.iasced@nallebaj

Received 9 June 2017; Revised 10 October 2017; Accepted 2 November 2017; Published 3 December 2017

Academic Editor: Roberto Natella

Copyright © 2017 Joaquín Abellán et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The knowledge extraction from data with noise or outliers is a complex problem in the data mining area. Normally, it is not easy to eliminate those problematic instances. To obtain information from this type of data, robust classifiers are the best option to use. One of them is the application of bagging scheme on weak single classifiers. The Credal C4.5 (CC4.5) model is a new classification tree procedure based on the classical C4.5 algorithm and imprecise probabilities. It represents a type of the so-called* credal trees*. It has been proven that CC4.5 is more robust to noise than C4.5 method and even than other previous credal tree models. In this paper, the performance of the CC4.5 model in bagging schemes on noisy domains is shown. An experimental study on data sets with added noise is carried out in order to compare results where bagging schemes are applied on credal trees and C4.5 procedure. As a benchmark point, the known Random Forest (RF) classification method is also used. It will be shown that the bagging ensemble using pruned credal trees outperforms the successful bagging C4.5 and RF when data sets with medium-to-high noise level are classified.

#### 1. Introduction

Supervised classification [1] is an important task in data mining, where a set of observations or cases, described by a set of* attributes* (also called* features* or* predictive variables*), have assigned a value or label of the variable to be classified, also called* class variable*. This variable must be discrete; in other cases, the learning process is called regression task. A classifier can be considered as a learning method from data to obtain a set of laws to predict the class variable value for each new observation. In order to build a classifier from data, different approaches can be used, such as classical statistical methods [2], decision trees [3], and artificial neural networks or Bayesian networks [4].

Decision trees (DTs), also known as classification trees or hierarchical classifiers, are a type of classifiers with a simple structure where the knowledge representation is relatively simple to interpret. The decision tree can be seen as a set of compact rules in a tree format, where, in each node, an attribute variable is introduced; and in the leaves (or end nodes) we have a label of the class variable or a set of probabilities for each class label. Hunt et al.’s work [5] was the origin of decision trees, although they began to gain importance with the publication of the ID3 algorithm proposed by Quinlan [6]. Afterwards, Quinlan proposed the C4.5 [3] algorithm, which is an improvement of the previous ID3 one and obtains better results. This classifier has the characteristic of the* instability*, that is, that few variations of the data can produce important differences on the model.

The fusion of information obtained via ensembles or combination of several classifiers can improve the final process of a classification task; this can be represented via an improvement in terms of accuracy and robustness. Some of the more popular schemes are bagging [7], boosting [8], or Random Forest [9]. The inherent instability of decision trees [7] makes these classifiers very suitable to be employed in ensembles.

* Class noise*, also known as* label noise* or* classification noise*, is named to those situations which appear when data sets have incorrect class labels. This situation is principally motivated by deficiencies in the data learning and/or test capture process, such as wrong disease diagnosis method and human errors in the class label assignation (see [10–12]). One of the most important procedures to have success in a classification task in situations of noisy domains is the use or application of ensembles of classifiers. In the literature about classification on noisy domains, bagging scheme stands out as the most successful scheme. This ensemble scheme has characteristics that it reduces the variance and avoids overfitting. A complete and recent revision of machine learning methods to manipulate label noise can be found in [13].

On the other hand, until a few years ago, the classical theory of probability (PT) has been the fundamental tool to construct a method of classification. Many theories to represent the information have arisen as a generalization of the PT, such as theory of evidence, measures of possibility, intervals of probability, and capacities of order-2. Each one represents a model of imprecise probabilities (see Walley [14]).

The Credal Decision Tree (CDT) model of Abellán and Moral [15] uses imprecise probabilities and general uncertainty measures (see Klir [16]) to build a decision tree. The CDT model represents an extension of the classical ID3 model of Quinlan [6], replacing precise probabilities and entropy with imprecise probabilities and maximum of entropy. This last measure is a well-accepted measure of total uncertainty for some special type of imprecise probabilities (Abellán et al. [17]). In the last years, it has been checked that the CDT model presents good experimental results in standard classification tasks (see Abellán and Moral [18] and Abellán and Masegosa [19]). The bagging scheme, using CDT as base classifier, has been used for the particular task of classifying data sets about credit scoring (see Abellán and Castellano [20]). A bagging scheme that uses a type of credal tree different from the CDT presented in [15] will be described in this work. This new model achieves better results than the bagging of CDT shown in [20] when data sets with added noise are classified.

In Mantas and Abellán [21], the classical method of C4.5 of Quinlan [3] has been modified using similar tools to the ones used for the CDT method. The new algorithm is called Credal C4.5 algorithm (CC4.5). It is shown that the use of imprecise probabilities has some practical advantages in data mining: the manipulation of the total ignorance is coherently solved and the indeterminacy or inconsistency is adequately represented. Hence, on noisy domains, these classifiers have an excellent performance. This assertion can be checked in Mantas and Abellán [21] and Mantas et al. [22]. In [21], the new CC4.5 presents better results than the classic C4.5 when they are applied on a large number of data sets with different levels of class noise. In [22], the performance of CC4.5 with different values for its parameter is analyzed when data sets with distinct noise levels are classified and information about the best value for is obtained in terms of the noise level of a data set. In this work, the bagging scheme using CC4.5 as base classifier will be presented, which obtains very good results when data sets with added noise are classified.

DTs are models with low bias and high variance. Normally, the variance and overfitting are reduced by using postpruning techniques. As we said, ensemble methods like bagging are also used to decrease the variance and overfitting. The procedures of the CDT and CC4.5 also represent other ways to reduce these two characteristics in a classification procedure. Hence, we have three methods to reduce variance and overfitting in a classification task which can be especially important when they are applied on noisy domains. We prove here that the combination of these three techniques (bagging + pruning + credal trees) represents a fusion of tools to be successful in noise domains. This assertion is shown in this work via a set of experiments where the bagging ensemble procedure is executed by using different models of trees (C4.5, CDT, and Credal C4.5) with and without postpruning process.

Experimentally, we show the performance of the CC4.5 model when it is inserted on the known ensemble scheme of bagging (called bagging CC4.5) and applied on data sets with different levels of label noise. This model obtains improvements with respect to other known ensembles of classifiers used in this type of setting: the bagging scheme with the C4.5 model and the known classifier Random Forest (RF). It is shown in the literature that the bagging scheme with the C4.5 model is normally the winning model in many studies about classification noise [23, 24].

A bagging scheme procedure, using CC4.5 as base classifier, has three important characteristics to be successful under noisy domains: (a) the different treatment of the imprecision, (b) the use of the bagging scheme, and (c) the production of medium-size trees (it is inherent to the model and related to (a)).

To reinforce the analysis of results, we will use a recent measure to quantify the degree of robustness of a classifier when it is applied on noisy data sets. This measure is the Equalized Loss of Accuracy (ELA) of Sáez et al. [25]. We will see that the bagging scheme using the CC4.5 attains the best values with this measure when the level of added noise is increased.

The rest of the paper is organized as follows. In Section 2, we begin with the necessary previous knowledge about decision trees, Credal Decision Trees, the Credal-C4.5 algorithm, and the ensemble schemes used. Section 4 contains the experimental results of the evaluation of the ensemble methods studied on a wide range of data sets varying the percentage of added noise. Section 5 describes and comments on the experimentation carried out. Finally, Section 6 is devoted to the conclusions.

#### 2. Classic DTs versus DTs Based on Imprecise Probabilities

Decision trees are simple models that can be used as classifiers. In situations where elements are described by one or more* attribute variables* (also called* predictive attributes* or* features*) and by a single* class variable*, which is the variable under study, classification trees can be used to predict the class value of an element by considering its attribute values. In such a structure, each nonleaf node represents an attribute variable, the edges or branches between that node and its child nodes represent the values of that attribute variable, and each leaf node normally specifies an exact value of the class variable.

The process for inferring a decision tree is mainly determined by the followings aspects:(1)The* split criterion*, that is, the method used to select the attribute to be inserted in a node and branching(2)The criterion to stop the branching(3)The method for assigning a class label or a probability distribution at the leaf nodes

An optional final step in the procedure to build DTs, which is used to reduce the overfitting of the model to the training set, is the following one:(4)The postpruning process used to simplify the tree structure

In classic procedures for building DTs, where a measure of information based on PT is used, the criterion to stop the branching (above point ) normally is the following one: when the measure of information is not improved or when a threshold of gain in that measure is attained. With respect to the above point , the value of the class variable inserted in a leaf node is the one with more frequency in the partition of the data associated with that leaf node; its associated distribution of probabilities also can be inserted. Then the principal difference among all the procedures to build DTs is point , that is, the split criterion used to select the attribute variable to be inserted in a node.

Considering classic split criteria and split criteria based on imprecise probabilities, a basic point to differentiate them is how they obtain probabilities from data. We will compare a classical procedure using precise probabilities with the one based on the Imprecise Dirichlet Model (IDM) of Walley [14] based on imprecise probabilities:(i)In classical split criteria, the probability associated with a state of the class variable, for a partition of the data, is the classical frequency of this state in that partition. Formally, let be the class variable with states and let be a partition of the data set. The probability of associated with the partition is where is the number of pieces of data with the state in the partition set ; and is the total number of pieces of data of that partition, .(ii)When we use the IDM, a model of imprecise probabilities (see Walley [14]), the probability of a state of the class variable is obtained in a different way. Using the same notation, now the probability is obtained via an interval of probabilities: where the parameter is a hyperparameter belonging to the IDM. The value of parameter regulates the convergence speed of the upper and lower probability when the sample size increases. Higher values of produce an additional cautious inference. Walley [14] does not give a decisive recommendation for the value of the parameter , but he proposed two candidates: and ; nevertheless, he recommend the value . It is easy to check that the size of the intervals increases when the value of increases.

In the following sections, we will explain the differences among the classic split criteria and the ones based on imprecise probabilities in a parallel way. We will compare the classic* Info-Gain* of Quinlan [6] with the* Imprecise Info-Gain* of Abellán and Moral [15] and the* Info-Gain Ratio* of Quinlan [3] with the* Imprecise Info-Gain Ratio* of Mantas and Abellán [21]. The final procedure to select the variable to be inserted in a node by each split criterion can be seen in Table 1.