Abstract

Despite the success of ILP systems in learning first-order rules from small number of examples and complexly structured data in various domains, they struggle in dealing with multiclass problems. In most cases they boil down a multiclass problem into multiple black-box binary problems following the one-versus-one or one-versus-rest binarisation techniques and learn a theory for each one. When evaluating the learned theories of multiple class problems in one-versus-rest paradigm particularly, there is a bias caused by the default rule toward the negative classes leading to an unrealistic high performance beside the lack of prediction integrity between the theories. Here we discuss the problem of using one-versus-rest binarisation technique when it comes to evaluating multiclass data and propose several methods to remedy this problem. We also illustrate the methods and highlight their link to binary tree and Formal Concept Analysis (FCA). Our methods allow learning of a simple, consistent, and reliable multiclass theory by combining the rules of the multiple one-versus-rest theories into one rule list or rule set theory. Empirical evaluation over a number of data sets shows that our proposed methods produce coherent and accurate rule models from the rules learned by the ILP system of Aleph.

1. Introduction

Inductive logic programming is a branch of machine learning that is concerned with learning logic programs inductively from examples in structural domains [1, 2]. ILP algorithms, such as FOIL [3], PROGOL [4], and Aleph [5], induce theories from a small number of examples to be generalised over the entire population of examples through learning first-order clauses (or rules) that mostly take the form of Horn clauses. A Horn clause is a disjunction of literals (atomic formulae) with at most one positive literal. Most ILP algorithms use a programming language called(PROLOG was originally created by Alain Colmerauer and Robert Kowalski in 1972) which stands for PROgramming in LOGic. In PROLOG, theories (or programs) are expressed using a collection of Horn clauses [6]. In fact a theory, whether it is an input or output, is understood and interpreted as a conjunction of Horn clauses. We often refer to a Horn clause as a rule. Unlike the propositional logic framework, first-order logic (FOL) framework allows the use of variables and structural literals in addition to the use of functional literals.

While first-order decision tree learners, such as TILDE [7], can learn from examples of multiple classes, first-order rule learners in ILP typically learn rules from two classes only (positive and negative examples as we pointed out earlier). Despite their ability to learn from complex structured data and build effective classification models in a range of domains, they struggle, unfortunately, in dealing with multiclass problems. In most situations they reduce a multiclass problem into multiple binary problems following the pairwise one-versus-one or one-versus-rest binarisation techniques.

Aleph, as a case in point, can learn a multiclass theory in the one-versus-rest paradigm where the outcome of its induction can be seen as a combination of several black-box models. Each model induces rules for one specific (positive) class, and a default rule is added to predict the remaining classes.

As discussed earlier, many ILP rule learning systems including Aleph, PROGOL, and FOIL can only induce binary theories and multiclass theories are obtained by converting a multiclass problem into several binary problems. The rules of the final model are, in practice, a combination of independent multiple binary theories. Inductive Logic Constraint (ICL) [8] upgraded the propositional CN2 [9] to handle multiclass first-order theories. While most of the ILP systems implement the covering (separate-and-conquer) approach, TILDE implements a divide-and-conquer approach and induces a single first-order logic multiclass theory that takes a form of decision tree. Tree models handle multiple classes naturally.

Several papers suggested different approaches of dealing with multiple binary models [1016]. A comparison of many such approaches was made in [15] not only suggesting a superiority of the one-versus-rest approach in general but also pointing out that the choice of the binarisation technique makes little difference once we learn good binary models.

Nevertheless, we challenge the suitability of the one-versus-rest approach for first-order rule learners. This is because there is a strong bias towards the negative classes leading to unrealistic estimates of predictive power. Moreover, the lack of integrity between the different binary models results in inconsistent predictions.

At the end of the introductory section we would like to outline the remaining sections of the article. Section 2 investigates the reliability and consistency of one-versus-rest binary models and illustrates the difference with a proper multiclass model. The reliability reflects how much one can rely on the quality of a one-versus-rest binary model while the consistency reflects how consistent are the predictions of multiple one-versus-rest binary models. In Section 3 we investigate several methods to overcome the problems of the current application of one-versus-rest technique in ILP rule learners. Additionally, we study and illustrate a simple method of representing the rules in a concept lattice in Section 4. We experimentally demonstrate the performance of our suggested methods in Section 5 and compare them to the standard binary method of Aleph. In the final section we summarise the work and discussion presented in this journal paper and draw the conclusion.

2. Multiclass versus Multimodel Predictions

In machine learning accuracy is widely used for comparing and assessing the classification performance. Hence many researchers report their results in terms of accuracy and compare their results with accuracies of other algorithms. The accuracy of a model can be interpreted as the expectation of correctly classifying a randomly selected example.

With respect to the notation explained in Figure 1, let us introduce the following definitions.

Definition 1 (recall). The recall of a given class, denoted asor, is the proportion of examples of classthat is correctly classified by a model (). The negative recall of class, denoted as, is the proportion of examples of classincorrectly classified (). In case of two classes, positive and negative, we denote the recall of the positive class asand of the negative class as.

Definition 2 (accuracy). Given two classes,and, the binary accuracy of a model is defined as that is, binary accuracy is a weighted average of the positive and negative recall, weighted by the class prior. This extends to multiple classes:
For this reason we sometimes refer to accuracy as (weighted) average positive recall.

Definition 3 (multimodel accuracy). Givenclasses andone-versus-rest models, one for each class, the multimodel accuracy is defined as the average binary accuracy of themodels:

The following result is worth noting.

Lemma 4. The accuracy of a single multiclass model is not equivalent to the multimodel accuracy of the one-versus-rest models derived from the multiclass model.

Proof. One has
In going from (5) to (6) in the above equations, we rely on the fact that the one-versus-rest models are derived from a single multiclass model. If the case is different (unlike the case in Aleph, for instance), then weighted average positive recall is not the same as accuracy, which compounds the issue.

It can be seen from Lemma 4 that the two accuracies are not the same. Accuracy of a multiclass model relies on the positive recalls weighted by the class priors, while the average accuracy of multiple binary models relies on the recalls of both classes where the importance of the positive recalls is decreasedtimes. Hence, there is an increase of the importance of classifying a negative exampletimes. It is clear that the average accuracy of the binary models is 1.5 times more than the accuracy of the multiclass model because the weight of the negative class is twice the weight of the positive class. When having a proper multiclass model, there are only credits for classifying examples correctly. Averaging the positive and negative recalls for multiple one-versus-one theories could be misleading but it is even more harmful when it comes to one-versus-rest theories as the problem is propagated.

Another problem arising when inducing multiple independent binary theories is the lack of integrity between the predictions of the different binary theories. This may cause an example to have different possible predictions in several contingency tables because each model produces predictions independently of the others. The predictions of the models on each example should be consistent. For instance, by consideringone-versus-rest models where each model is trained to predict one class as positive, then the prediction for an exampleon theth model should be consistent with its prediction on theth model;isandis, whereandexpress the prediction of theth and theth binary model, respectively, for example,.

If the predictions are inconsistent then such conflicts need to be solved to ensure the consistency in the predictions for each example in all models. There are some classification methods that use all one-versus-rest models but resolve these collisions by obtainingscores from each one of themodels and the model with the maximum score wins the prediction [15, 16]. A rule learner such aslearns ordered rule lists in one of its settings to avoid such conflicts. In pairwise techniques, voting methods [10, 11, 13, 14] can be considered to integrate the predictions.

The discussion about unreliability and inconsistency holds generally when employing one-versus-rest technique in any learning system but we are emphasising the importance of this issue particularly in ILP binary rule learning systems such as Aleph. This is due to the fact that we only induce rules for the positive class in each one-versus-rest model, while a default rule that always predicts the negative class is added in case an example can not be classified by any induced rule. The default rule gets credits for not classifying negative examples which makes it easy to obtain high negative recalls without inducing any rules for the negatives (empty theories) (an empty theory is a theory where a binary rule learner fails to induce any rule for the positive examples) and just predict the negative class being the majority class. Hence, there is a need to integrate the different binary models of such rule learning systems in order to ensure that high reliability and consistency of their model predictions are met.

3. Improved Learning of Multiclass Theories

In this section we investigate how one could improve the reliability of the all one-versus-rest theories in ILP by combining their binary models into a single rule list (Multiclass Rule List) or rule set model (Multiclass Rule Set Intersection and Multiclass Rule Set Union). Our approach is different from the other first-order rule learning approaches in various respects. First, it does not treat thevarious models as independent black-box models but instead combines the rules of all the models into a single model. Secondly, there is only one default rule and the class of the default rule is determined probabilistically according to the distribution of the uncovered training examples of all the classes. Finally, a single prediction is obtained for each example in one multiclass contingency table.

3.1. Multiclass Rule List Theories

In any rule list model, the rules are ordered in the final theory according to a certain criterion. When an unseen example is encountered, the rules are tried one by one in the order of the list and the first rule that fires determines the class of the example. So the key idea is to have a sensible criterion to determine the order of the rules in the list. This can be achieved simply by evaluating rules and assigning their numerical scores that reflect their significance. If we have rules induced by one-versus-rest models for each one of theclasses, we need a multiclass scoring function to achieve this goal. Luckily several multiclass evaluation measures have been proposed earlier in [17]. They can be used to evaluate all rules over the multiple classes. We then can prioritise the rules which have been obtained fromone-versus-rest models based on their multiclass scores to build a Multiclass Rule List (MRL) model. This is similar to prioritising the subgroup rules before building a subgroup tree in [18]. We adopted Chi-Squaredfrom the work of [17] in our experiments.

MRL. In this method, after learning rules for all classes, the rules are reordered on decreasing . The ties are broken randomly. If a rule is added to the rule list, then all examples it covers are removed from the training set and the rest of the rules are reevaluated based on the remaining examples until no further rule is left. At the end, a single default rule is assigned predicting the majority class from the distribution of the uncovered examples.

3.2. Multiclass Rule Set Theories

In a rule set model, the rules are unordered and the class of a new example is determined based on the training statistics of all rules that fire for that particular example. For instance, thepropositional rule learner learns a rule set model, in one of its two settings, and tags the rules with their coverage distribution on all the classes. If a new example is to be classified,sums up the coverage of all rules that fire over each class and the class with the highest coverage wins. This approach has been adapted byfirst-order rule learner [8]. We propose two methods to handle multiclass rule set theories, the Multiclass Rule Set Intersection () method and the Multiclass Rule Set Union () method. The descriptions of the two methods are discussed below. Later we will compare our approaches to our upgraded version of Aleph that handles probabilities similarly toand.

MRSI. Inevery rule from the multiple one-versus-rest models is evaluated over the entire training set once, and the identifiers of the examples they cover are stored. A default rule is formed based on the majority class of the uncovered training examples. If a new example is to be classified, all the rules are tried. For those rules that fire, we determine the intersection of their training set coverage using the example identifiers such that the examples in the set are not covered by rules that do not fire. The class distribution of this set gives us the empirical (training) probability on each class. The probability of a test exampleof belonging to classwith respect to themethod can be formalised as follows:whereis a boolean function that is activated if the th rule fires for the testing exampleandis a function that returns the subset of training examples covered by theth rule. The class with the maximum probability is predicted for the example. Again the ties are broken randomly. In the case of an empty intersection, the majority class is assigned to the example.

MRSU. Themethod differs from themethod as it determines the class of a new example based on the union of the training coverage of all rules that cover the new example, instead of the intersection. The probability of a test example of belonging to classwith respect to themethod can be formalised as follows:

Themethod is closer in spirit to themethod, which adds up the coverage of all rules that fire. However, by using example identifiers we avoid doublecounting of examples that are covered by several rules, which means that we obtain proper empirical probabilities rather than’s estimates.

To illustrate those two methods let us consider Example 5. If a new testing example is found to be covered by the following set of rules, then the probability distribution over the three classes is, respectively, for. As forthe probability distribution is, respectively. With regard tomethod, classis predicted, for example, , because the score ofis higher than. Alternatively we can predict the class probabilistically formethod based on the coverage distribution of the first rule that fires but this is always going to be the majority class originally predicted by the rule.

Example 5. Simple example illustrating, andmethods (this example is borrowed from Abudawood’s Ph.D. thesis [19]). Below we give hypothetical three-class problems,, of 5 examples each,, and a model of three rules induced on them. The predicted class and the coverage information as well as the evaluation scores of the rules are shown in Table 1. Figure 2 illustrates their coverage and their overlaps.
One has

In fact all the three methods can be illustrated by drawing a rule list or a rule tree. A rule list corresponds tomethod which is very similar to the conventional decision list (ordered set of rules) model, while a rule tree can be seen as unordered rule set model and hence it is suitable to demonstrate our proposed rule set-based methods,and. Figure 3 illustrates the use ofmethod in building a predictive model for Example 5. Figures 4 and 5 show illustrations of the use ofandmethods, respectively, to create predictive models on the same example.

It deserved to be mentioned that a rule list can be seen as a special type of a rule tree where the node branching is restricted to a left or right branching only in the former one. The construction of a rule tree involves placing a single rule at each single level. Inandwe start building the rule list or rule tree by having all training examples at the root node, and adding a new rule causes the examples at each node to be split into two new nodes reflecting covered and uncovered subset of examples by the new rule on their parent’s examples. In, however, we start with the empty set of examples at the root and instead of splitting we merge the examples covered by multiple rules such that a leaf will contain all examples covered by a chain set of rules.

4. Multiclass Theories and Formal Concept Analysis

In this section we introduce the notion of formal concept lattice in rule learning context and use it to visualise and explore rules, examples, and their binary coverage relationship. We also draw the link between binary trees and formal concept lattice at the end of this section.

4.1. Formal Concept Analysis in Rule Learning Context

Formal Concept Analysis () gains an increasing attention in the field of artificial intelligence and several authors [2022] have employed it in the machine learning field. It is based on an order theory in mathematics where hypotheses (concepts) and their relationships can be represented in a lattice, called a concept lattice.can structure a lattice in a simple way showing how a set of rules are related to each other based on their coverages. It has been used for structuring, exploring, and analysing complex knowledge. A thorough investigation and discussion ofare beyond the scope of this work and the reader is referred to the survey of [23] for more details onand its applications.

only allows boolean features and we will take advantage of such a powerful technique in the rule learning context by regarding classification rules as binary features. The overall idea is that once we have a set of rules obtained by learning the multiple one-versus-rest models in ILP, we could represent them by a concept lattice with the help of the Formal Concept Analysis technique.

The aim is to explore the rules and their partial relations with respect to their coverage over the examples in a simple and compact graph. Such a graph may not be useful to make predictions in a straightforward manner, but it could give an insight on how the combination of multimodel rules may perform before we even use the multiclass methods discussed above. Let us introduce our basic definitions ofin rule learning context. We would like to draw the reader attention that the following definitions are adaptations of the classicaldefinitions found in the literature where the attributes are simply replaced with rules.

Definition 6 (formal context). Letbe a set of first-order(rules can be induced using a first-order or a propositional rule learner),a set of examples, anda relation such that is covered by. A formal contextis then the triple.

Definition 7 (formal concept). Let and then a formal concept is defined to be the pairsatisfying the following four conditions: and.is called the extent of the formal conceptandis called the intent of the formal concept.

Definition 8 (concept lattice). The concepts are ordered according toin order to form the complete concept lattice of the formal context. At the bottom of the lattice we can see the concepts with the most general intents and thus the largest extents. At the head of the lattice we can see the concepts with the most specific intents and thus the smallest extents.

There is a strong relationship betweenand closed item-sets mining that aims at finding a set of nonredundant hypotheses investigated in the work of [24]. This is because a formal concept incan be seen as a closed item-set in their formalism. The work of [25] also confirmed this relationship and explained that a concept inexpresses a maximal set of examples that shares all elements of a maximal set of features (rules) and vice versa. We will get back to the maximality property when discussing the relationship betweenand trees at the end of the section.

4.2. Representing Rules with Multiple Concept Lattices

Assuming that we have a fixed set of first-order rules (or propositional rules) in the intents, we could extend the conventionalto two-class problem by simply extending the extent setto, the set of examples belonging to the first class, and, the set of examples belonging the the second class. However, if each class is known to have a separate intent, as well as a separate extent the problem can be reformulated as follows.

Consider a set of first-order rulesand examplesof class. Consider a set of first-order rulesand examplesof class and a relationsuch that. We say thatif and only if exampleis covered by rule, whereA formal contextis then the triple. The formal concept and concept lattice are defined similarly as in Section 4.1.

The above can be seen as merging two formal contextswithto form the single formal context. Notice that we could find a case wheremay be covered byor. This may present a noise in the extents or an underfitting (overgenerality) in the intents. Another form of noise might occur if the same exampleis also covered by which suggests a conflict in the original theories induced for the positive and negative classes.

To this extent we explained howcan be used in a two-class scenario but it is not hard to see how such a formalism can be generalised for a multiclass scenario (having multiple theories for multiple classes) by introducing further formal contexts.

4.3. Representing Rules with a Single Concept Lattice

Having multiple formal contexts and their corresponding concept lattices may seem appropriate especially to visualise multiple one-versus-rest theories in ILP but apart from the high complexity of building multiple lattices, we are more interested to combine the models, visualise them, and study their collective performance. Therefore, a better solution would be to build a single concept lattice by taking into account all rules and examples as if the rules were generated from a single model learned over a multiclass problem. Consequently, the intentcorresponds to the set of rules induced for all the classes and the extentcorresponds to the set of all examples belonging to all the classes. Different colours can be used to distinguish the rules predicting different classes and similarly to distinguish examples belonging to various classes.

Figure 6 illustrates drawing a single formal concept lattice for Example 5 discussed above.

4.4. FCA and Rule Trees

Reference [26] amongst some others investigated inducing decision trees as selected paths from large concept lattices in a propositional domain. They regard the concept lattice as a collection of overlapping trees and the task is to search for the most accurate one in a classification context. In our case we employedas a postlearning phase and we have a limited number of rules to represent. As a matter of fact, the formal concepts correspond to the leaves of a complete binary or rule tree as it is the case inmethod as can be seen in Theorem 9 and illustrated by Figure 6 and Figure 4. This is because the maximality property is maintained in the leaves of atree. Nevertheless, this is not the case when it comes toorbecause the maximality property is broken.

Theorem 9. MRSI’s leaves are equivalent to formal concepts.

Proof. Let,, andthen(selected set of rules on nodes of atree) and(intersection set of examples covered by the selected set of rules that represents a’s leaf) satisfying the requirements of a formal concept, whereis a function that returns all rules covering an exampleandis the set of examples found in a leaf of the tree described by the set of all rules inthat apply to all examples in.

Since we established the link between’s rule trees and concept lattices, it is possible to turn the concept lattice into a probabilistic classifier similarly tomethod by associating each internal node in the lattice by a probability distribution instead of the actual coverage promoting a probabilistic concept lattice as can be seen in Figure 7. The figure is useful in showing the probability distribution when one or multiple rules fire for a given example. Of course having the complete description of rules can be more useful but this is just an illustrative example and we have a limited space for drawing the concept lattice. Therefore, bothand the probabilistic concept lattice can be used in the same way.

At this stage it is not obvious how we could take advantage of the formal concept lattice to be used withandmethods but the good news is thatexperimentally outperforms the other two methods when it comes to multiclass domains in terms of predictive accuracy and AUC (AUC is an abbreviation for the area under the ROC curve and used as a measure of predictive performance; for more details about AUC and ROC please see [27]) as will be shown in the next section.

5. Empirical Evaluation

In this section we evaluate and compare our proposed single multiclass theory learning methods (,and) over 6 multiclass data sets and 5 binary data sets (Table 2). We use Aleph as our base-learner, learning rules for each class in turn. We then turn the rules learned by Aleph into coherent multiclass models using the techniques proposed in Section 3. We compare the performance of our methods andrule set method described above.

For each data set, cross-validated accuracies (Table 3) and AUCs (Table 4) were recorded. Themethod does not produce class probabilities and hence produces a single point in a ROC plot; in this case, AUC boils down to the (unweighed) average of true positive and true negative rates.,, andproduce class probabilities and hence AUC evaluates their ranking performance in the usual way. A multiclass AUC is obtained by averaging each one-versus-rest AUC weighted by the class prior.

We report the ranks (1 is best, 4 is worst) of the accuracies and AUCs on each data set. We use the Friedman significance test on these ranks atwith Bonferroni-Dunn posthoc test on our three proposed methods. In the Friedman test we record wins and losses in the form of ranks and ignore the magnitude of these wins and losses. Graphical illustrations of the posthoc test results in the AUC and accuracy ranks are given in Figures 8, 9, 10, and 11 for the multiclass and the two-class data sets. The critical difference (CD) value is shown at the top of the figure. If the difference between two methods exceeds this value, then the methods are significantly different; otherwise, they are not. In the later case a thick black line will connect them together to indicate the insignificance difference between them. Note that the lower the rank is, the higher the performance is. By looking at the average performance rank and calculating the posthoc test and critical differenceCD , on the multiclass data sets,is significantly better thanon both accuracy and AUC, whileperforms significantly worse on AUC. If we take a look at the binary data sets (CD ), we can see that bothandare significantly outperformingwith respect to AUC, while no statistical significance is reported regarding their accuracies. The conclusion seems warranted thatis preferable for multiclass data sets, whileis preferable for binary data sets.

For reference we also show the (multimodel) accuracy reported by Aleph, although this does not correspond to a coherent multiclass model and overemphasises the default rules. Also reported is the average positive recall, but this does not take proper account of rule overlaps.

6. Concluding Remarks

In this work we investigated the lack of reliability and consistency of the one-versus-rest technique on multiclass domains. We showed that we could build a simple and single multiclass model by combining the rules of all one-versus-rest models and turning them into a coherent multiclass classifier and we proposed three methods for that: Multiclass Rule List (), Multiclass Rule Set Union (), and Multiclass Rule Set Intersection ().

Moreover we showed that we can adapt a graphical model with the help of Formal Concept Analysis () such that it can be used to explore the relationships and partial order between rules with respect to their coverage over the examples.

In Section 3 we illustrated our proposed multiclass methods in term of rule lists and rule trees and in Section 4 a connection between themethod and a formal concept lattice was drawn. We pointed out that it is possible to use a formal concept lattice as a probabilistic classifier similarly tobut with a simpler and more compact representation.

We showed that our proposed methods generate consistent and reliable multiclass predictions and experimentally produce significant results, with respect to accuracy and AUC, on both multiclass and binary domains when compared to themethod. When classification is made based on rule intersection,, the best accuracies and AUCs were achieved taking the multiclass data sets into account. The rule list method seems to be suitable for two-class problems. The origin of this difference is the subject of ongoing investigations. The difference suggests thatbenefits from having trees with larger leaves (i.e., Figure 3) to best decide one of two classes in two-class scenarios while this becomes a bit harder when having multiclass scenarios wheremethod, reflecting trees with smaller leaves (i.e., Figure 4), tends to perform better.

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.