Abstract

This article describes how the costs of misclassification given with the individual training objects for classification learning can be used in the construction of decision trees for minimal cost instead of minimal error class decisions. This is demonstrated by defining modified, cost-dependent probabilities, a new, cost-dependent information measure, and using a cost-sensitive extension of the CAL5 algorithm for learning decision trees. The cost-dependent information measure ensures the selection of the (local) next best, that is, cost-minimizing, discriminating attribute in the sequential construction of the classification trees. This is shown to be a cost-dependent generalization of the classical information measure introduced by Shannon, which only depends on classical probabilities. It is therefore of general importance and extends classic information theory, knowledge processing, and cognitive science, since subjective evaluations of decision alternatives can be included in entropy and the transferred information. Decision trees can then be viewed as cost-minimizing decoders for class symbols emitted by a source and coded by feature vectors. Experiments with two artificial datasets and one application example show that this approach is more accurate than a method which uses class dependent costs given by experts a priori.

1. Introduction

The inductive construction of classifiers from training sets is one of the most common research areas in machine learning (ML) and therefore in human-computer interaction. The traditional task is to find a hypothesis (a classifier) that minimizes the mean classification error (see, e.g., [1] for an overview). However, as already stated in the technological roadmap of the MLnetII project (Network of Excellence in Machine Learning [2]), the consideration of costs in learning and classification is one of the most relevant fields in machine learning research and many applications. Turney [3] gives an overview of the possible types of costs which may appear in inductive concept learning in general, and Ling and Sheng [4] present an overview of cost-sensitive learning methods. In this paper, we will consider the case of classification learning through the construction of decision trees, which minimize the mean cost of false classifications (with error minimization as a special case). Ling et al. [5], for example, describe the construction of decision trees guided by a minimization of total costs (including costs for misclassification and attribute measurement). Drummond and Holte [6] investigate the cost-sensitivity of four commonly used attribute selection criteria in tree learning.

One way to incorporate costs in classification learning is to use a cost function that specifies the mean misclassification costs in a class-dependent manner a priori [1, 710]. In general, the class-dependent costs ck for not recognizing class have to be provided by an expert. Using this type of cost implies that the misclassification costs are assumed to be the same for each example of the respective class . A more natural approach is to let the cost depend on the individual training example for which it is objectively measured or determined, and not just on its class. The mean cost for a class or subclass can then be computed (“learned”) on an objective basis. We hold that directly using the individual costs may also produce more accurate classifiers as compared with using cost estimates given by experts.

One example of application is the classification of a bank’s credit applicants as either “good customers” (who will pay back their credit) or “bad customers” (who are not likely to pay back their loans in full). A classifier for this task can be constructed from the bank’s records of past credit applications containing personal information on customers, information on the actual loans (amount, duration etc.), back payments on loans, and the bank’s actual profit or loss. The loss occurring in a single case can be seen as the misclassification costs for that example in a natural way. In the case of a good customer, the cost is the bank’s loss if that customer has been rejected. Where bad customers are concerned, the cost is simply the actual loss if the loan is not paid back in full. There are a lot of other possible applications where major costs resulting from false decisions have to be avoided. For example, a medical diagnosis may not overlook a dangerous disease like cancer, something that might not be very likely; yet not detecting it could lead to high costs (e.g., the death of the patient in an extreme case). Another example would be searching for texts in a text database such as on the Internet, the importance of which depends on the goals of specific user groups. Yet another application is the modeling of cognitive and general behavioral processes where, for example, emotional evaluation plays a major role.

One approach for using example-dependent costs has already been discussed and applied in the context of rough classifiers [11], the perceptron, piecewise linear classifiers, support vector machines [1214], in the examination of concept drift [15], and in reinforcement learning and process control [16, 17]. In this paper it is applied to the decision tree learning (see also [18, 19]). From the perspective of information theory, which will be the dominating aspect of this paper, this kind of classifier can be considered a decoder which tries to detect (reconstruct) a class symbol coded by a feature vector, which in turn is transmitted through a (generally noisy) channel. We will also introduce a new, cost-dependent information measure, discuss its properties, and use it in tree construction. We feel that it might be of general importance for information theory.

On the other hand, decision trees have the advantage of being able to be broken up into a set of rules which can be interpreted in a human-like fashion. Therefore, decision tree learning can be used as a tool for automatic knowledge acquisition in human-machine interaction, for example, in expert systems. In the cost-dependent case introduced here, this process can also be controlled by a factor of subjective importance defined by the cost of false decisions.

This article is structured as follows. Section 2 introduces the new, cost-dependent information measure. In Section 3, the CAL5 algorithm for the induction of decision trees [2022] is introduced with a short overview. Section 4 describes the modification of CAL5 in the case of object dependent costs. Experiments with two artificial domains and the above-mentioned credit problem can be found in Section 5. In Section 6, a comparison with the results of three other algorithms is presented and Section 7 concludes the article.

2. Computation of a Cost-Dependent Information Measure

We start with the introduction of cost-dependent probabilities and a cost-dependent information measure as a generalization of the information measure introduced by Shannon, which only depends on classical probabilities [23]. The term “information measure” refers to the quantity of information produced by a source, which is coded and then transferred through a (generally noisy) channel to a receiver where it is decoded. This is sometimes (especially in ML) called “transinformation,” a term we also use in the following. As an example of application, we describe its use in the construction of decision trees as classifiers from training sets consisting of objects/situations, each of which is described by a feature vector, its class, and observed or measured cost for misclassification. This is a relevant task in machine learning, decision, and information theory.

In this construction (learning) of decision trees, which can be viewed as sequential decoders of class symbols coded by feature vectors, the next attribute for branching has to be selected if no unique class decision is possible yet. The attribute x giving the (local) maximum information, that is, the maximum reduction of uncertainty about the classification, is usually used. In the case of attributes with continuous (real) values, an appropriate discretization has to be performed (see Section 3 for details). The information about the classes carried (coded) by a single attribute is measured by the transinformation [23], which is defined as the difference of the entropy of the classes before measuring and , the expectation (mean) of entropy defined for the set of measured values of . measures the uncertainty of a class decision before measuring x, and the remaining mean uncertainty (equivocation) after measuring and decoding . In this context, we regard and as stochastic variables with the values and, respectively. Transinformation is then defined by

is the a priori probability of class is the probability of observing the value of attribute x, and denotes the probability of class when an attribute value has been observed.

Now we introduce a cost for not deciding the true class if value is measured. We therefore assume that the costs are independent of the incorrectly predicted class and that the costs for correct classification are zero. To define a cost-dependent transinformation measure as a generalization of the Shannon information, we introduce the cost-dependent probabilities to replace in [12, 18] with the definitions:

where

denotes the mean cost arrived at by averaging on all objects of class . The value is the new, cost-sensitive probability of class arrived at by multiplying the “classical” probability with the mean cost for not recognizing class normalized by the mean cost b that takes all classes into account.

Furthermore, we define the cost-transformed conditional probabilities

where The probabilities and the conditional probabilities are defined according to the same principle used above, that is, by multiplying the classical probabilities with the mean normalized costs. is the mean cost for misclassification if feature emerges. It can be proved that for the normalization constant holds

It can also be proved that are probabilities which do indeed satisfy the axioms and rules that hold for classical probabilities.

The well-known Bayes decision rule decides for the class, which has the highest probability in the cost-free case [24]. Defining new probabilities by multiplying the original ones with the normalized costs is consistent with the cost-sensitive Bayes decision rule including costs for misclassification:

which gives identical decision results if both sides are divided by the normalization constant . Then the rule can be written in the form:

that is, with probabilities again, but cost-dependent ones.

The class decision is usually based on the observed attribute values, not only on the prior class probabilities and costs. In the following, we will restrict our considerations to a single attribute , but this can be generalized for arbitrary sequences of test attributes. Using conditional probabilities, we have the corresponding decision rule restricted to objects with feature :

When using joint probabilities, this is equivalent to. A special case isthat is, if only class dependent costs are considered.

Note that the cost of misclassification also measures the (perhaps subjective or problem-dependent) importance of the class in question: the cost corresponds to the relevance of in a situation (or for an object) characterized by the feature Similarly, the cost depends on a feature vector describing a situation or object in more detail. This is the type of cost associated with training vectors (“example-dependent costs”) taken into consideration in this paper.

We have simplified the definitions of cost-dependent probabilities here by regarding the attributes with discrete values only. For tree construction in the case of attributes with continuous values, these must be discretized to transform them into attributes with discrete values. The discrete values constructed represent intervals , whose cost function is computed as the mean of for objects falling in that interval (see Section 4).

Now we can define a cost-dependent entropy and a cost-dependent transinformation by simply replacing in the classical expressions appropriate probabilities with the cost-dependent ones defined above:

is the cost-dependent transinformation which can be regarded as a generalization of Shannon information and will be used in the construction of cost-dependent decision trees. These decision trees are (locally) optimized during sequential construction to minimize the mean cost of misclassification instead of the mean classification error.

The following proposition states two important properties of the cost-sensitive transinformation and relates it to its classical, cost-independent counterpart.

Proposition 1. (a) If the costs for all classes and values are equal, one arrives at a cost-independent classical definition of (Shannon entropy) and (Shannon information) as accordingly special cases.
(b) If the mean cost for one class k exceeds the mean cost for all other classes, then the cost-dependent probability will be greater than the classical probability If, in the special case of two classes, there are different mean costs for those two classes, then there will be a larger mean difference of the cost-dependent probabilities compared to the classical probabilities, and the entropy will thus become smaller; that is, it holds that This means that there is a reduction of (subjective) uncertainty. is also reached in the general case of more than two classes if the cost for a class exceeds some threshold depending on the (classical) probability of that class.

Proof. (a) If for all values and all classes , then and because and ; that is, the classical case will emerge.
(b) We consider the case of two classes and set If, that is, , then and, subsequently, because Then we have
The function monotonously decreases in the interval with its maximum at From , it then follows that Because of the symmetry of the entropy function , there is also if andsince then holds. The generalization to more than two classes is shown in the following. If the relative cost is very small (e.g., c is very large as compared with ), then in the limit that is, if or , and , and we get a total overgeneralization of class This also holds for more than two classes with In this extreme situation there is no uncertainty; it is then subjectively optimal to decide for class every time despite the high error rate. It also follows that will hold if the maximum cost exceeds some threshold. This is a generalization of the case of two classes treated above and means that the decision uncertainty given by the entropy is reduced if there are high costs for not recognizing special classes.

Remark 1. An attribute x which is irrelevant, that is, does not discriminate between the classes (gives no information about the classes), can become relevant in cost-dependent classification because its values might be associated with different costs. An example of this will be presented in Section 5.2.

Remark 2. As mentioned in the introduction, the results obtained here are applied in cognitive science for a theoretical foundation of cost-controlled human behavior. One application of the explanation of the generation and possible control of psychopathological behavior is described in a paper written together with a well-known German psychotherapist and researcher in psychoanalysis [25].

3. Decision Tree Learning with CAL5

The following section provides an overview of how a decision tree is constructed with CAL5. A comparison with other decision tree algorithms can be found in Section 3.2.

3.1. Overview of CAL5

The CAL5 algorithm [1, 2022, 26] for learning decision trees for classification and prediction converts real-valued attributes into discrete-valued ones by defining intervals on the real dimension through the use of statistical considerations. The intervals (corresponding to discrete or “linguistic” values) are automatically constructed and adapted to establish an optimal discrimination between the classes through axis-parallel hyperplanes in the feature space. The trees are built top-down in the usual manner through stepwise branching with new attributes to improve the discrimination. An interval in a new dimension is formed if the hypothesis “one class dominates in the interval” or the alternative hypothesis “no class dominates” can be decided on using a user-defined confidence level by means of the estimated conditional class probabilities. “Dominates” means that the class probability exceeds some threshold given by the user.

In the following, we will give a more detailed-description of CAL5. If, during tree construction, a terminal node (leaf) representing an interval in which no class dominates is reached, it will be refined by using the next attribute for branching. The attribute is selected using the transinformation measure (see Section 2) to decide which is the (locally) best discriminating attribute, that is, which one transfers maximum information on the classes at this node. To determine this locally best discriminating attribute, the following procedure of discretization and computation of transinformation is applied to all attributes occurring in the feature vector (apart from the attribute immediately preceding the actual node on the path leading to it).

We assume that one (real-valued) attribute is selected. In order to use it to construct a branching and compute its transinformation, intervals defining the discrete values must be constructed first.

In the first step of the interval construction, all values of the training objects reaching the terminal node are ordered along the new dimension . In the second step, values are collected one by one from left to right, tentatively forming intervals. Now a confidence interval for each class probability is defined for a user-specified confidence level to estimate the class probabilities in each current tentative interval I on x. This depends on the frequency of class , the total number of objects in and on (Note that defines a tentative discrete value of the originally numerical attribute ) A Bernoulli distribution is assumed for the occurrence of class symbols in . The confidence interval is derived using the Chebyshev inequality which gives

Applying a confidence interval for probabilities in the statistical sense makes it easier for us to introduce the confidence of class decisions in the cost-sensitive setting as compared with the use of a geometrical method such as the consideration of the distance of class boundaries as used by, for example, Tóth and Pataki [27, 28].

With the confidence interval, the following “metadecision” is made for each tentative -interval .

(1)If, for a class k in I, it holds that

that is, the entire confidence interval is above , then decide that “class dominates” in interval ” and close . This means that holds for the probability of the occurrence of class in with a probability (confidence) of with 0.5 ≤ S ≤ 1.0 is a user-given threshold, defining the maximum admissible error 1-S in a class decision using the tree. If class dominates in , the path is closed and is attached as a final class label to this newly created terminal node when x becomes the attribute with maximum transinformation for the current decision node (see below). The interval I then corresponds to a newly defined discrete value of x.

(2)If, for all classes k in I, it holds that

that is, if no class dominates, then also close interval and begin construction of the next adjacent interval.

(3)If neither nor holds, a class decision is rejected and the interval I will be extended by the next attribute value in the ordering on x to enlarge the statistics. Special heuristics (decision for a majority class) are applied in the case where an interval remains in dimension x, which satisfies neither hypothesis nor hypothesis , and no extension of the interval is possible due to the finite training set. Finally, adjacent intervals with the same class label are merged.

This procedure is repeated recursively until all intervals of x are constructed after which the transinformation for the discretized attribute x is computed (see Section 2). Note that the statistics for a constructed interval I, that is, the set of class probabilities estimated from the training objects it includes, must be retained for the sake of computing the transinformation, even in case where a unique class decision is possible.

This computation of transinformation is done for all available attributes apart from the one immediately preceding the actual node. The attribute that delivers the maximum transinformation is used for branching. Branching stops if either terminal nodes of the branch contain a unique class label or the statistics in an interval is too poor to make either decision or . We will give a more detailed and precise description of the algorithm modified to include example-dependent costs in Section 4.

Note also that interval and the transinformation both depend on the path from the root of the tree to the actual node (not indicated here for easier reading). Also, because an attribute might be used several times on the same path in the tree, a larger interval of attribute x, which was already constructed at a higher level, may become split up into smaller ones. For example, “large” might become “medium large” and “very large” if the attribute x means “size.”

If the costs for not recognizing some classes are given, the original CAL5 uses class dependent thresholds From decision theory, it follows [20] that one has to choose

where is the cost of the misclassification of an object belonging to class k. The cost must be provided by the user and depends solely on the class. The main aim of this paper is to introduce object-dependent costs given with the training objects. This allows the use of these costs locally for the construction of an interval I (defining a region in the feature space together with the intervals of the other attributes on the path) and independently of experts’ subjective estimates. This is explained in more detail in the following sections. In Section 5 it will be shown that high costs of misclassification lead to an enlargement of the decision regions for the corresponding classes.

3.2. Comparison with Other Decision Tree Algorithms

CAL5 has been compared with other algorithms for classification learning, particularly within the scope of the famous European STATLOG project described in detail in the book [1]. STATLOG compared 24 algorithms for classification learning (classical statistical methods, modern statistical techniques, machine learning of rules and trees, neural networks) by means of 24 datasets in different types of practical applications (image and letter recognition, bank loans, medical diagnoses, shuttle control, chromosomes and DNA, technical problems, tsetse fly distribution). The tree learning algorithms most similar to CAL5 were Quinlan’s famous C4.5 [29] and CART [30]. A comparison will be given here in after. C4.5 uses its information measure for subtree splitting in the same sense as our CAL5, which is why our new, cost-dependent version introduced above could apply in cases where costs are given. CART uses the Gini index as a splitting criterion that also measures the class impurity at a (temporary) node. The main difference between both algorithms compared with CAL5 is their application of “backward pruning” motivated by “overfitting,” since the tree construction has a priori no stopping criterion. (Overfitting is the process of inferring more structure in the tree than justified by the population from which it is drawn (see [1, Sections  5.1.5–5.1.7] for details). In contrast to this, CAL5 uses the confidence intervals for the confident estimation of class probabilities at a leaf for the decision of (immediate) pruning as explained above. The confidence bounds used in the decisions are based on the parameter α defining the confidence interval, which must be optimized. Another difference is that CAL5 is able to construct more than two values (intervals) of a numerical attribute used for splitting, whereas C4.5 and CART only allow binary splits.

Considering the experimental results of the STATLOG project, CAL5 showed good results on average. Its performance averaged on the 24 datasets used in STATLOG was similar to that of C4.5 (even a little better [26]). In one case (Australian credit dataset), it even achieved the best performance (see the tables of results in [1]). Of course, different applications call for different algorithms. The main advantage of CAL5 is the construction of smaller trees as compared to algorithms using backward pruning [1]. This results in better interpretability and rules that are easier to understand. The reason for this might lie in the use of confident probability estimates and the possibility of constructing discrete-valued attributes with a potentially arbitrary number of values.

An algorithm for the construction of decision trees using “total costs” (defined as the sum of misclassification costs and the cost for measuring an attribute ) where the costs are to be given a priori is described by Ling et al. [5]. The additional use of attribute costs is a possibility to extend CAL5 (if the costs are known a priori) and CAL5_OC if example dependent attribute costs are given, too.

4. The Algorithm CAL5_OC

In this section, we describe an extended version of CAL5 capable of handling example-dependent costs. In Section 4.1 we describe a modified version of the decision rules (1)–(3) defined in the last section. In Section 4.2, a detailed description of the algorithm CAL5_OC is provided based on the ideas presented in Section 4.1.

4.1. Using Cost-Dependent Decision Thresholds in the Case of Object-Dependent Costs

Now we extend the metadecision rules (1)–(3) and the (locally applied) optimal branching procedure introduced in Section 3 for the case in which the training objects originally described by a feature vector and its class are also presented with costs for misclassification. This means that we have a training set

with feature vector class and cost of misclassification as a former experience in a situation where had been observed and a false decision was made.

The algorithm performs an interval formation if a new branch has to be constructed with attribute selected. We assume that I is the (perhaps temporary) current interval constructed. It represents a discrete value of x. In the following, we use xto denote both the original attribute x and its discretized version, which is to be constructed.

Based on decision theory, the following rule for class decision can be derived (see also Appendix and [18, 20]):

Decide class in interval I if
(a) (Bayes’ decision rule) (b)and simultaneously

is the conditional probability of (to be approximated by the relative frequency) in interval I, where I is considered a stochastic variable. is the mean cost for the misclassification of class in interval I and is estimated by

where the sum ranges over all examples for which and holds. The value of is the number of training vectors of class j falling into I. can now be interpreted as (locally) as explained in Section 2 since I represents the discrete value of the discretized attribute . Note that I and thus the mean cost also depend on the path from the root of the current decision tree to the root node of the new branch (labeled with x) where it was constructed.

defined in (b) is the decision threshold, which now depends on the mean cost in I. is the cost for the rejection of a dominance decision and subsequent further branching. Since it is unknown in advance, we eliminate it through the division of by. For all classes with indices i and j this gives us

Let be the cost of the class with minimal cost in I, and its threshold. For this class we choose a constant threshold, which is a parameter of the algorithm and, in particular, is independent of I. Note, however, that is related to the unknown cost of rejection (see the appendix). The value of determines all other thresholds by

controls the complexity of the resulting tree. It does not change the relations of the other thresholds (see the appendix). The optimal value of must be defined by the user or optimized through a systematic search as we have done in our experiments described in Section 5.

It can now be seen that, as a measure of importance, high costs for the misclassification of a class (e.g., the failed detection of cancer in a medical diagnosis) lead to a lower threshold for class decision; that is, has a higher chance of being decided on in contrast to other classes with lower importance (yet perhaps with a higher degree of probability).

Using the class-dependent thresholds, we arrive at modified metadecision rules (see Section 3):

(1)If, for a class in I, it holds that(a) or equivalently for all (Bayes’ decision rule) and(b) then decide that “ dominates in I.” The formula for the estimation of used in our algorithm for learning cost-dependent trees will be given below.(2)If, for all classes in , it holds that

that is, no class dominates, then a branch with the next attribute is constructed.

(3)Case (3) of the description given in Section 3 holds without modification.

For applying Bayes’ decision rule (1), the value of can be estimated by

Because we would like to find the maximum over all classes, we can skip the denominator because it only depends on the interval.

Note that the transinformation computed for each attribute occurring in the feature vector (apart from the one immediately preceding on the path in the tree) for finding the optimal attribute for the new branch is now cost-dependent, since the intervals I and the thresholds are cost-dependent (and also class-dependent). A high cost for misclassification leads to a low threshold for the corresponding class. This is also interesting from the perspective of modeling human decision-making, controlling attention and consciousness where the subjective importance of the class of the actual object/situation and its context (defined by the path to the leaf where the class to be decided on is located) plays a major role.

4.2. Description of the CAL5_OC Algorithm

The following offers a concise summary and description of the cost-dependent construction of decision trees with CAL5_OC. The algorithm has the following input and parameters:

(i)a training set where is a feature vector representing the -th example, is the misclassification cost associated with it, and is its given class,(ii)the confidence value for the statistical decision about the dominance of a class in a given interval for any attribute, (iii)the decision threshold used to determine if the “least expensive” class dominates in a given interval.

The output of the algorithm is a decision tree with decision nodes and leaves. A decision node is labeled with an attribute , and each branch originating from it is labeled with an interval I resulting from the interval construction, that is, discretization process as described in Section 4.1. As usual, the leaves are labeled with classes denoted as.

The learning algorithm constructs intervals for an attribute x in order to determine its cost-sensitive transinformation as explained in Section 2 and for the splitting procedure. The best attribute for a given training set has the highest transinformation and is used to construct the next branch, followed by a recursive application of the algorithm.

We implement the rules (1)–(3) defined in Section 4.1 using the following Boolean functions that return either true or false:

(i)bayes_optimal : true ifis maximal,(ii)dominates : bayes_optimal and are true,(iii)no_dominant_class for all classes it is true that

Note that in addition to dominates , for some and no_dominant_class there is the third possibility in rule that we cannot make a statistically confident decision because there might not be enough examples in I. In the interval algorithm, this triggers the extension of a tentative interval for x so that more examples can be included.

Now we focus on a single attribute x and describe the cost-sensitive interval construction for it. The interval construction is based on the current training set which is either the one that was originally given or the one that occurs in one of the tree algorithm’s recursive calls (see below). In other words, it corresponds with the objects “arriving” at a certain node in the tree for which the next attribute is to be selected.

The function intervals returns a sequence of intervals constructed for x based on the given examples. The number of intervals is also determined by the algorithm. The intervals should not be confused with the cost-sensitive transinformation we denoted as.

Let be an interval of this sequence. Its boundariesand correspond to specific values in the ordered sequence of all x-values in the training set M extended with the values and. These boundaries, or rather the indices of the lower and upper bounds, and will be adjusted by the following algorithm. The result of this algorithm is a sequence of intervals such that holds; that is, the lower bound of the first interval is , as well as , that is, the upper bound of the last interval is , this means that the intervals cover the entire range of possible values. For two neighboring intervals it holds that

Function intervals(x):
(1)Sort the training examples according to their -values resulting in a sequence of attribute values .(2)Set(3)Set that is, the first interval considered is is the index of the interval currently constructed.(4)Repeat until holds, that is, until the last and largest value is reached:(a)If dominates holds for some class or if it holds no_dominant_class then:(i)Set that is, close the current interval and start working on the next interval.(ii)Set This means that the next largest attribute value is added to the next interval.(b)Else (when no decision is possible in I): (i)Extend interval to the right by setting that is, the next largest attribute value from the training set is included.(5)Set (extend last interval to). The total number of intervals is set to(6)Return the sequence of intervals.

In a postprocessing stage, it is possible to merge two intervals and with identical characteristics; that is, either there is the same dominant class, there is no dominant class, or no decision was possible in either interval. It is also possible—and reasonable—to change the interval boundaries so that they lie in the middle of two consecutive attribute values. This is achieved for by changing the interval bounds ofto

Note that this transformation does not change the statistical values that were computed. We are now ready to define the algorithm for constructing the tree.

Function CAL5_OC ():
(1)Evaluate intervals for every attribute x and determine its cost-sensitive transinformation .(2)Let be the attribute that has the highest transinformation . Let the associated intervals be . (3)Create a decision node labeled with x. (4)For(a)Let denote the set of examples in M that lie in This set is defined as (b)If dominates holds for some then (i)create a leaf labeled with.(c)Else if no_dominant_class holds, then (i)create a subtree recursively by evaluating CAL5_OC (). (d)Else (if no decision is possible)(i)create a leaf labeled with a class so that bayes_optimal() holds.(e)Connect the decision node and the created leaf or subtree using a branch labeled with the interval (5)Return the tree.

5. Experiments

The theoretical considerations from the previous sections are demonstrated in following by experiments using two artificial datasets and one example of application [18, 19].

5.1. Experiments without Costs

Figure 1 shows a training set (NSEP) consisting of two overlapping classes each taken from a Gaussian distribution in the -plane. The attribute is irrelevant; that is, the classification does not depend on it and can be performed with alone. The original CAL5 algorithm with parameters (and no costs) constructs two discriminating straight lines parallel to the x-axis. The region between them indicates that no statistically safe decision is possible here due to the nonseparability and equal probabilities of both classes, as attribute is irrelevant and therefore cannot deliver any improvement of classification.

The dataset (MULTI) shown in Figure 2 consists of two classes containing two subclasses (clusters) each. All four clusters are derived from Gaussian distributions. CAL5 constructs two piecewise linear functions which discriminate between the classes. Subdivisions within the same class built by CAL5 are omitted here.

5.2. Experiments with Object-Dependent Costs

From the NSEP dataset in Section 5.1, a new dataset, NSEP_OC, is constructed using object-dependent costs. These are computed from the cost functions

(Figure 3), both of which are only dependent on the irrelevant feature .

Figure 4 shows the training dataset and class discrimination function computed by using the modified CAL5 (CAL5_OC), which includes the cost functions “learned” as mean costs from the object-dependent costs given in the NSEP_OC training set as described in Section 4. The value of the appropriate cost function for a training object is indicated by the size of the rectangles representing the training objects.

Now the discrimination function is piecewise linear: the decision region for class 1 is enlarged in the positive half-space of attribute and for class 0 in the negative half-space of , respectively, that is, for the high costs of misclassifications. This means that the attribute that was originally irrelevant, , now becomes relevant for class discrimination due to the cost dependence; that is, the decision tree is not optimized in regard to the minimum classification error as it was before, but rather with respect to the minimum cost for misclassification. The costs defining the (local) decision thresholds are “learned” from the individual costs specified by the training objects. Note also that the area in the middle of Figure 1, in which no class decision is possible due to the overlapping of both classes with equal probabilities, is reduced in size in Figure 4.

For the sake of comparison, we ran both the original CAL5 (constructed without costs) and CAL5_OC (constructed with object-dependent costs as described in Section 4) with the NSEP_OC dataset using the (optimized) parameters in the second case. For optimization, a systematic search and evaluation with 10-fold cross-validation were conducted. As a result, we arrived at a mean cost of 0.165 in the first case and 0.150 in the second. This means that the mean cost declined by about 9% in the object-dependent, that is, cost-optimized case where CAL5_OC was used. The classification error increased from 16.45% to 16.85%. Note that the classification error must increase for classes with different costs of misclassification, here within each class.

For the MULTI dataset, the cost functions

are chosen. Figure 5 shows the resulting dataset and discrimination functions that define the decision regions for both classes.

Comparing this with Figure 2 (our cost-independent case), one can see the enlargement of the decision region containing the right cluster of class 0, which consists of objects with high costs for misclassification. This means that there is some overgeneralization of class 0 increasing the error, yet optimizing the mean risk (cost).

5.3. Application in a Real-Life Domain: German Credit Dataset

We conducted experiments with the German credit dataset from the STATLOG project [1]. This dataset comprises 700 examples of the “good costumer” class (class +1) and 300 examples of the “bad customer” class (class −1). Each example is described by 24 attributes. The dataset comes with class-dependent costs estimated by experts, that is, a cost-matrix. For the German credit dataset, human experts estimated a relative cost of 5 for not recognizing a bad customer (class −1) as compared to a cost of 1 for not recognizing a good customer (class +1) and cost 0 in the case of correct classification (cost matrix). The original version of CAL5 had already achieved good results for this dataset in the past and learned decision trees with an average of two nodes ([1, page 154]).

In order to evaluate CAL5_OC, we designed the following cost model which features example-dependent costs as opposed to only class-dependent costs. If a good customer is incorrectly classified as a bad customer, we assumed a cost of

where the attribute duration is the duration of the loan in months, and amount is the credit amount. Duration and amount are just two of the 24 attributes used for the sample description in the German credit dataset mentioned above. Here we assumed an effective yearly interest rate of 0.1, that is, 10% for every loan, because the actual interest rates are not given in the dataset itself.

If a bad customer is incorrectly classified as a good costumer, we assumed that 75% of the entire credit amount will be lost (normally a customer will pay back at least part of the loan). When averaging the example-dependent costs for each class, we arrived at a ratio close to that originally given by experts, 1 : 5, which underpins the plausibility of our model. Note that when applying our approach to data from a real bank, we do not have to design a cost function based on the attributes. Instead, the cost values are naturally specified for individual customers. In the case of the German credit dataset, however, we did not have access to these values. In the following, we examine the example-dependent costs defined by the cost function above as the real costs of individual cases.

In our experiments we aimed at a comparison of the results that use example-dependent costs with the results where only class-dependent costs were given. This means that learning with CAL5 was based on a cost matrix, while learning with CAL5_OC was based on the example-dependent costs. However, evaluation was performed in regard to the example-dependent costs, since we viewed them as the real costs of the examples. In the case of CAL_OC, we did not use the given cost matrix estimated by the experts, but rather the new matrix estimated from the example-dependent costs. We constructed this new cost matrix by computing the average of the costs of class +1 and class −1 examples from the training set, which resulted in 6.27 and 29.51, respectively (the credit amounts were normalized so that they lied in the interval ). The ratio of the misclassification costs in this new cost matrix corresponds roughly with the ratio in the original matrix given by the experts.

We ran CAL5 with the new matrix and CAL5_OC with object-dependent costs using the German credit dataset with the (optimized) parameters and . As a result, we arrived at a mean cost of 3.34 in the first case and 2.97 in the second. This means that the mean cost declined by about 11% for the object-dependent, that is, optimized case. The classification error increased from 38.2% to 41.5%, which is due to the fact that we optimized the misclassification costs instead of the error rate.

6. Comparison with Other Algorithms of Classification Learning

We compared the modified decision tree algorithm CAL5_OC with an extended perceptron algorithm (DIPOL, a piecewise linear classifier [1, 26, 31]) modified to include object-dependent costs. We also performed experiments with the support vector machine. In contrast to the experiments with an extended Matlab implementation of the 2-norm SVM described in [14], we used the SVMLight [32] here. This particular implementation of a 1-norm SVM allows the use of individual costs as example weights given as real values. Since these weights are only used for learning, not for evaluating classifiers, we added the required functionality and embedded the learning algorithms into a 10-fold cross-validation. For the experiments, an RBF kernel was utilized and we performed an extensive parameter optimization.

We also took a cost-proportional resampling method into consideration, as described in [13, 14]. The resampling method we developed consists of building a new, cost-free dataset based on the original dataset with costs (see also [33, 34]). This is achieved through a stochastic sampling method that includes an example in the new dataset with a probability proportional to the cost given for that example (as an example-dependent costs or through a cost matrix). This in turn means that the new dataset can be seen as being sampled from the cost-transformed probabilities and described in the introduction of this paper. This resampling method allows the application of any cost-in-sensitive learning method to the derived dataset, while still achieving cost-minimization with respect to the original dataset. In the experiments described in [13], we used DIPOL together with the resampling method.

We applied all four approaches to the modified version of the German credit dataset described in Section 5.3. Table 1 shows the costs for the different classifiers estimated by 10-fold cross-validation. It should be noted that CAL5_OC performs better than DIPOL_OC and the resampling method, and slightly better than the SVMLight. Because it is well known that neural networks and kernel methods usually perform better than decision tree algorithms in terms of error, we do not want to draw any general conclusions from this result. However, decision trees allow symbolic rules to be derived and are therefore important in data mining applications. This cannot be achieved in an easy manner with either neural networks or kernel methods. Moreover, it has been shown in the STATLOG project that the performance of CAL5 is comparable to that of C4.5 and even somewhat better in terms of the average rank achieved [26].

The datasets NSEP_OC and MULTI_OC have mainly been designed for demonstrating the qualitative behaviour of the costs-sensitive learning methods. However, for the costs-sensitive versions of DIPOL and the SVM, diagrams corresponding to the ones for CAL5_OC (Figures 4 and 5) can be found in [13, 14]. The specific shape of the class borders differs: CAL5_OC computes axis-parallel borders and DIPOL_OC finds separate hyperplanes in a general position, while the shape of the borders computed by the SVM is determined by the kernel used. The RBF kernel used in the experiments results in a curvy class border and, in the case of NSEP_OC, possibly also disconnected class regions. However, all three approaches locally move the borders towards the classes that contain the examples that are less expensive to be misclassified.

7. Conclusions

In this article we introduced a new, cost-dependent information measure and described how object-dependent costs can be used to learn decision trees (decoders) for cost optimal decisions instead of error minimal decisions. This was demonstrated through the use of decision theory and by defining the cost-minimizing extension CAL5_OC from the CAL5 algorithm, which automatically converts real-valued attributes into discrete-valued ones by constructing intervals. The cost-dependent information measure was used for the selection of the (locally) best next attribute for tree building. It can be used in other algorithms for decision tree learning, but it is of general importance for information theory, modeling in cognitive science and human-computer interaction because the control of behavior by error is replaced by control through costs for false decisions. There are many practical applications in classification learning where minimizing the costs of decisions plays a role, such as in medical diagnosis and financial areas.

Experiments with two artificial datasets and one example of application show the feasibility of our approach and that it is more adequate than a method that uses cost matrices given by experts if cost-dependent training objects are available. Since decision trees constructed with CAL5_OC also separate the classes in the feature space by axis-parallel hyperplanes, it can be used to attain symbolic representations of classes and rules depending on and ordered by their importance. In the future, it would be interesting to introduce misclassification costs into methods for constructing decision trees with hyperplanes in a general position, using distances to the hyperplanes as a measure of confidence for class decisions [27, 28].

In contrast, for instance, to the cost-sensitive extension of DIPOL, CAL5_OC in its current form is not able to handle misclassification costs that depend not only on the original class of the example but also on the class into which it might be classified incorrectly. This means that each example in the training set must come with a whole vector of cost values corresponding to the different possible classes. We think that, in practice, these cost vectors per example might be difficult to obtain, whereas a single cost value per example (as it is used by CAL5_OC) could be given as the cost that occurred for the respective example in the past.

We also did not consider costs for measuring attributes (e.g., [27, 28]) although it might be possible to incorporate them in the presented framework. We leave this to future work.

Appendix

The Decision Rule for a Class in a Completed Interval Including the Possibility of Rejection of the Class Decision

In Section 4 we used a decision rule for a class in a completed interval I (labeling a leaf):

Decide for class in interval if

(a) (Bayes’ decision rule) and simultaneously(b) where is the cost of the rejection of a class decision, that is, for further branching out in general. Note that the value of is generally unknown prior to learning; it will be approximately, yet implicitly, optimized by choosing the value for (see Section 4 and below) which leads to a minimization of the expected costs for misclassification.

For the proof of the necessity of also using the second rule, we start with the formulation of a condition for the rejection of a class decision. This is given by the following.

Decide for the rejection of a class decision if for all classes , ,n occurring in I, it holds that

The sum in the LHS of the inequation is the expected value of the costs of the misclassification that would arise during tree application for all other classes if is ultimately decided on in the process of tree construction, that is, if I is replaced by the label . This expected value is estimated from the set of training objects occurring in I.

The inequation can be transformed to

or its equivalent

which means that it must hold for all as a condition for the rejection of a class decision in I.

The value of corresponds with the mean cost in interval ; that is, it corresponds with the value defined in Section 2, where is one of the values for attribute . In general, the value of the rejection cost is not known in advance. Considering the class with the minimum costs in as a special case, we therefore get

or its equivalent

The value of is replaced by , that is, by a general parameter, which is independent of I. The parameter value has to be optimized or specified by the user. is a factor contained in all values , which can be enlarged or reduced without changing the relations between them. Therefore, also controls the complexity of the resulting tree.

The rule for a decision for a single class in I, part (b) stated above, is a negation of the rejection rule:

Do not reject a class decision if there is at least one class with

If condition (a) also holds for class , that is, if the value of is the maximum value, then a decision can be made for at this leaf.