A Comparison Study on Rule Extraction from Neural Network Ensembles, Boosted Shallow Trees, and SVMs
One way to make the knowledge stored in an artificial neural network more intelligible is to extract symbolic rules. However, producing rules from Multilayer Perceptrons (MLPs) is an NP-hard problem. Many techniques have been introduced to generate rules from single neural networks, but very few were proposed for ensembles. Moreover, experiments were rarely assessed by 10-fold cross-validation trials. In this work, based on the Discretized Interpretable Multilayer Perceptron (DIMLP), experiments were performed on 10 repetitions of stratified 10-fold cross-validation trials over 25 binary classification problems. The DIMLP architecture allowed us to produce rules from DIMLP ensembles, boosted shallow trees (BSTs), and Support Vector Machines (SVM). The complexity of rulesets was measured with the average number of generated rules and average number of antecedents per rule. From the 25 used classification problems, the most complex rulesets were generated from BSTs trained by “gentle boosting” and “real boosting.” Moreover, we clearly observed that the less complex the rules were, the better their fidelity was. In fact, rules generated from decision stumps trained by modest boosting were, for almost all the 25 datasets, the simplest with the highest fidelity. Finally, in terms of average predictive accuracy and average ruleset complexity, the comparison of some of our results to those reported in the literature proved to be competitive.
The explanation of neural network responses is essential for their acceptance. As an example, physicians cannot trust any model without any form of enlightenment. An intuitive way to give insight into the knowledge embedded within neural network connections and neuron activation is to extract symbolic rules. However, producing rules from Multilayer Perceptrons (MLPs) is an NP-hard problem .
In the context of classification, the format of a symbolic rule is given as follows: “if tests on antecedents are true then class ,” where “tests on antecedents” are in the form or , with as an input variable and as a real number. Class designates a class among several possible classes. The definition of the complexity of the extracted rules is often described with two parameters: number of rules and number of antecedents per rule. Rulesets of low complexity are preferred compared to those with high complexity, since at first sight fewer rules and fewer antecedents are better understood. Another reason of preference is that rule bases with lower complexity also reduce the risk of overfitting on new data. Nevertheless, Freitas clarified that the comprehensibility of rules is not necessarily related to a small number of rules . He proposed a new measure denoted as prediction-explanation size, which strongly depends on the average number of antecedents per rule. Another measure of rule transparency is consistency. Specifically, an extracted ruleset is deemed to be consistent if, under different training sessions, the rule extraction algorithm produces rulesets which classify samples into the same classes. Finally, a rule is redundant if it conveys the same information or less general information than the information conveyed by another rule.
An important characteristic of rulesets is whether they are ordered or not. Ordered rules correspond to the following: if tests on antecedents are true then …, else if tests on antecedents are true then …, …, else …
In unordered rules “else if” is replaced again by “if tests on antecedents are true then conclusion.” Thus, a sample can activate more than a rule. Long ordered rulesets are difficult to understand since they potentially include many implicit antecedents; specifically, those negated by “else if.” Generally, unordered rulesets present more rules and antecedents than ordered ones, since all rule antecedents are explicitly provided, thus being more transparent than ordered rulesets. Each rule of an unordered ruleset represents a single piece of knowledge that can be examined in isolation, since all antecedents are explicitly given. With a great number of unordered rules, one would try to accurately understand the meaning of each rule with respect to the data domain. Getting the global picture could take a long time; nevertheless, one could be interested only in some parts of the whole knowledge, for instance, those rules with the highest number of covered samples.
The Discretized Interpretable Multilayer Perceptron (DIMLP) represents a special feedforward neural network architecture from which crisp symbolic rules are extracted in polynomial time . This particular Multilayer Perceptron (MLP) model can be used to learn any classification problem, and rule extraction is also performed for DIMLP ensembles. Furthermore, special DIMLP architectures were also defined to produce fuzzy rules .
Decision trees are widely used in Machine Learning. They represent transparent models because symbolic rules are easily extracted. However, when they are combined in an ensemble rule, extraction becomes harder . Here, we propose generating rules from ensembles of shallow decision trees with the help of DIMLP ensembles. In practical terms, each rule extracted from a tree is inserted into a single DIMLP network; then, all the rules generated from a tree ensemble are represented by a DIMLP ensemble. Finally, rule extraction is performed to obtain a ruleset representing the knowledge embedded within the decision tree ensemble. Because of the No Free Lunch Theorem no model is better than any other, in general . Hence, if a connectionist model is more accurate than a direct rule learner such as RIPPER , then it is worth extracting rules to understand the classifications, even if this involves extra computing time.
Authors who generated rules from single neural networks or Support Vector Machines (SVMs), very rarely assessed their techniques by tenfold cross-validation. Our experiments are based on ten repetitions of stratified tenfold cross-validation trials over 25 binary classification problems. Note that the total number of training trials is equal to 42500. Moreover, we compare the complexity of the rules generated from DIMLP ensembles, boosted shallow trees (BST), and SVMs. For SVMs we define the Quantized Support Vector Machine (QSVM), which is a DIMLP architecture trained by an SVM learning algorithm . Our purpose is not to determine which model is the best for these classification problems, but to characterize the complexity of the rules produced by the models. Our results could serve as a basis for researchers who would like to compare their rule extraction techniques applied to connectionist models by 10-fold cross-validation. In the following sections we present the DIMLP model that allows us to produce rules from BSTs and SVMs and then the experiments, followed by the conclusion.
1.1. State of the Art
Since the earliest work of Gallant on rule extraction from neural networks , many techniques have been introduced. In the 1990s, Andrews et al. introduced a taxonomy aiming at characterizing rule extraction techniques . Essentially, rule extraction algorithms belong to three categories: decompositional; pedagogical; and eclectic. In decompositional techniques, rules are extracted at the level of hidden and output neurons by analyzing weight values. Here, a basic requirement is that the computed output from each hidden and output unit must be mapped into a binary outcome which corresponds to the notion of a rule consequent. The basic idea of the pedagogical approach is to view rule extraction as a learning task where the target concept is the function computed by the network and the input attributes are simply the network’s input neurons. Weight values are not taken into account in this category of techniques. Finally, the eclectic approach takes into account elements of both decompositional and pedagogical techniques. A few years later, Duch et al. published a survey article on this topic . More recently, Diederich published a book on techniques to extract symbolic rules from Support Vector Machines (SVMs)  and Barakat and Bradley reviewed a number of rule extraction techniques applied to SVMs .
1.1.1. Rule Extraction from Neural Network Ensembles
Many rule extraction techniques from single neural networks have been introduced, but only a few authors have started to extract rules from neural network ensembles. Bologna proposed the Discretized Interpretable Multilayer Perceptron (DIMLP) to generate unordered symbolic rules from both single networks and ensembles [21, 22]. With the DIMLP architecture rule extraction is performed by determining the precise location of axis-parallel discriminative hyperplanes. Zhou et al. introduced the REFNE (Rule Extraction from Neural Network Ensemble) algorithm , which utilizes the trained ensembles to generate instances, and then extracted symbolic rules from those instances. Attributes are discretized during rule extraction and it also uses particular fidelity evaluation mechanisms. Moreover, rules have been limited to only three antecedents. For Johansson, rule extraction from ensembles is an optimization problem in which a trade-off between accuracy and comprehensibility must be taken into account . He used a genetic programming technique to produce rules from ensembles of 20 neural networks. Ao and Palade extracted rules from ensembles of Elman networks and SVMs by means of a pedagogical approach to predict gene expression in microarray data . More recently Hara and Hayashi proposed the two-MLP ensembles by using the “Recursive-Rule eXtraction” (Re-RX) algorithm  for data with mixed attributes . Re-RX utilizes C4.5 decision trees and backpropagation to train MLPs recursively. Here, the rule antecedents for discrete attributes are disjointed from those for continuous attributes. Subsequently, Hayashi at al. presented the “three-MLP Ensemble” by the Re-RX algorithm .
1.1.2. Rule Extraction from Ensembles of Decision Trees
Basically, rule extraction techniques applied to ensembles of decision trees belong to two distinguished groups. In the first, the purpose is to reduce the number of decision trees by increasing their diversity. Techniques for the optimization of diversity are reported in ; as an example Gashler et al. improved the ensemble diversity by combining different decision trees algorithms .
Techniques in the second group concentrate on the rules extracted during the ensemble construction. A well-known representative technique in this group is RuleFit . The base learners are rules extracted from a large number of CART decision trees . Specifically, these trees are trained on random subsets of the learning set, the main idea being to define a linear function including rules and features that approximates the whole ensemble of decision trees. At the end of the process this linear function represents a regularized regression of the ensemble responses with a large number of coefficients equal to zero. Node Harvest is another rule-based representative technique . Its purpose is to find suitable weights for rules by performing a minimization on a quadratic program with linear inequality constraints. Finally, in , the rule extraction problem is viewed as a regression problem using the sparse group lasso method , such that each rule is assumed to be a feature, where the aim is to predict the response. Subsequently, most of the rules are removed by trying to keep accuracy and fidelity as high as possible.
1.1.3. Rule Extraction from Support Vector Machines
To produce rules from SVMs, a number of techniques applied a pedagogical approach [35–38]. As a first step, training samples are relabeled according to the target class provided by the SVM. Then, the new dataset is learned by a transparent model, such as decision trees, which approximately learn what the SVM has learned. As a variant, only a subset of the training samples are used as the new dataset: the support vectors . Before the training of a decision tree algorithm, Martens at al. generate additional learning examples close to randomly selected support vectors . In another technique, Barakat and Bradley generate rules from a subset of the support vectors using a modified covering algorithm, which refines a set of initial rules determined by the most discriminative features .
Fu et al. proposed a method aiming at determining hyperrectangles whose upper and lower corners are defined by determining the intersection of each of the support vectors with the separating hyperplane . This is achieved by solving an optimization problem depending on the Gaussian kernel. Núñez et al. determined prototype vectors for each class [15, 42]. With the use of the support vectors, these prototypes are translated into ellipsoids or hyperrectangles. An iterative process is defined in order to divide ellipsoids or hyperrectangles into more regions, depending on the presence of outliers and the SVM decision boundary. Similarly, Zhang et al. introduced a clustering algorithm to define prototypes from the support vectors . Then, small hyperrectangles are defined around these prototypes and progressively grown until a stopping criterion is met. Note that for these two last methods the comprehensibility of the rules is low, since all input features are present in the rule antecedents.
2. Material and Methods
In this section we present the models used in this work, which are DIMLP ensembles, Quantized Support Vector Machines, and shallow boosted trees. The rule extraction process of the last two models has been made possible by transforming them into particular DIMLP architectures.
2.1. The DIMLP Model
DIMLP differs from MLP in the connectivity between the input layer and the first hidden layer. Specifically, any hidden neuron receives only a connection from an input neuron and the bias neuron, as shown in Figure 1. After the first hidden layer, neurons are fully connected. Note that very often DIMLPs are defined with two hidden layers, the number of neurons in the first hidden layer being equal to the number of input neurons.
2.1.1. DIMLP Architecture
The activation function in the output layer is a sigmoid function given as
In the first hidden layer the activation function is a staircase function with stairs that approximates the sigmoid function. represents the abscissa of the first stair. By default . represents the abscissa of the last stair. By default . Otherwise, if we haveSquare brackets indicate the integer part function and . The step function is a particular case of the staircase function with only one step:If we would like to obtain a better approximation of the sigmoid function we could change these values and increase the number of stairs. The activation function in the hidden layers above the first one is again a sigmoid. Note that the step/staircase activation function makes it possible to precisely locate possible discriminative hyperplanes.
As an example, in Figure 1 assuming two different classes, the first is being selected when (black circle) and the second with (white squares). Hence, two possible hyperplane splits are located in and , respectively. As a result, the extracted unordered rules are as follows:(i) square(ii) square(iii) and circle.
The training of a DIMLP network having step activation functions in the first hidden layer was performed by simulated annealing , since the gradient is undefined with step activation functions. When the number of stairs was allowed to approximate the sigmoid function sufficiently well, a modified backpropagation algorithm was used . The default number of stairs in the staircase activation function was equal to .
2.1.2. Rule Extraction
Each neuron of the first hidden layer creates a number of virtual parallel hyperplanes that is equal to the number of stairs of its staircase activation function. As a consequence, the rule extraction algorithm corresponds to a covering algorithm for which the goal is to determine whether a virtual hyperplane is virtual or effective. A distinctive feature of this rule extraction technique is that fidelity which is the degree of matching between network classifications and rules’ classifications is equal to 100%, with respect to the training set.
Here we describe the general idea behind the rule extraction algorithm, since more details are described in . The relevance of a discriminative hyperplane corresponds to the number of points viewing this hyperplane as the transition to a different class. In the first step of the rule extraction algorithm the relevance of discriminative hyperplanes is estimated from all training examples and DIMLP responses.
Once the relevance of discriminative hyperplanes has been established a special decision tree is built according to the strongest relevant hyperplane criterion. In other terms, during tree induction in a given region of the input space the hyperplane having the largest number of points viewing this hyperplane as the transition to a different class is added to the tree.
Each path between the root and a leaf of the obtained decision tree corresponds to a rule. At this stage rules are disjointed and generally their number is large, as well as their number of antecedents. Therefore, a pruning strategy is applied to all rules according to the most enlarging pruned antecedent criterion. The use of this heuristic involves that at each step the pruning algorithm removes the rule antecedent which mostly increases the number of covered examples without changing DIMLP classifications. Note that at the end of this stage rules are no longer disjointed and unnecessary rules are removed.
When it is no longer possible to prune any antecedent or any rule, again, to increase the number of covered examples by each rule all thresholds of remaining antecedents are modified according to the most enlarging criterion. More precisely, for each attribute new threshold values are determined according to the list of discriminative hyperplanes. At each step, the new threshold antecedent which mostly increases the number of covered examples without altering DIMLP classifications is retained.
The general algorithm is summarized as follows:(1)Determine relevance of discriminant hyperplanes using available examples.(2)Build a decision tree according to the highest relevant hyperplane criterion.(3)Prune rule antecedents according to the most enlarging pruned antecedent criterion.(4)Prune unnecessary rules.(5)Modify antecedent thresholds according to the most enlarging criterion.
2.1.3. DIMLP Ensembles
We implemented DIMLP ensemble learning by bagging  and arcing . Bagging and arcing are based on resampling techniques. For the first training method, assuming a training set of size , bagging selects for each classifier included in ensemble samples drawn with replacement from the original training set. Hence, for each DIMLP network many of the generated samples may be repeated while others may be left out. In this way, a certain diversity of each single network proves to be beneficial with respect to the whole ensemble of combined classifiers.
Arcing defines a probability with each sample of the original training set. The samples of each classifier are chosen according to these probabilities. Before learning, all training samples have the same probability to belong to a new training set (=1/). Then, after the first classifier has been trained the probability of sample selection in a new training set is increased for all unlearned samples and decreased for the others.
Rule extraction from ensembles can still be performed, since an ensemble of DIMLP networks can be viewed as a single DIMLP network with one more hidden layer. For this unique DIMLP network, weight values between subnetworks are equal to zero. Figure 2 illustrates three different kinds of DIMLP ensembles. Each “box” in this figure is transparent, since it can be translated into symbolic rules. The ensemble resulting from different types of combinations is again transparent, since it is still a DIMLP network with one more layer of weights.
2.1.4. Classification Strategy of the Rules
For the training set the degree of matching between DIMLP classifications and rules, also denoted as fidelity, is equal to 100%. With unordered rules, an unknown sample not belonging to the training set activates zero, one, or several rules. Thus, several activated rules of different class involve an ambiguous decision process. As a remedy, classifications provided by DIMLPs are taken into account to disambiguate the classification process. We summarize the possible situations for an unclassified sample not belonging to the training set:(i)No activated rules: the classification is provided by the DIMLP network (thus, no explanation is provided).(ii)One or several rules belonging to the same class corresponding to the one provided by the DIMLP network: thus, rule(s) and network agree.(iii)One or several rules belonging to different classes: if the class provided by DIMLP is represented in the rule(s), we only take into account this (these) rule(s) to explain the classification and discard the other(s).(iv)One or several rules belong to one or several classes, but the class provided by DIMLP is not represented in the rule(s). Thus, rule(s) and network disagree and the classification provided by the rules is wrong.
Predictive accuracy is the proportion of correct classified samples of an independent testing set. With respect to the rules it can be calculated by following three distinct strategies:(i)Classifications are provided by the rules. If a sample does not activate any rule the class is provided by the model without explanation.(ii)Classifications are provided by the rules, when rules and model agree. In case of disagreement, no classification is provided. Moreover, if a sample does not activate any rule the class is provided by the model.(iii)Classifications are provided by the rules, when rules and model agree. In case of disagreement, the classification is provided by the model without any explanation. Moreover, if a sample does not activate any rule, the class is again provided by the model without explanation.
By following the first strategy, the unexplained samples are only those that do not activate any rule. For the second one, in case of disagreement between rules and models no classification response is provided; in other words the classification is undetermined. Finally, the predictive accuracy of rules and models is equal in the last strategy, but with respect to the first strategy we have a supplemental proportion of uncovered samples, those for which rules and models disagree.
2.2. Quantized Support Vector Machines (QSVMs)
Functionally, SVMs can be viewed as a feedforward neural networks. Here, we focus on how an SVM is transformed into a QSVM, which is a DIMLP network with specific neuron activation functions. Since QSVM is also a DIMLP network, rules can be extracted by performing the DIMLP rule extraction algorithm. QSVM is trained by a standard SVM training algorithm, for which details are provided in  or .
The classification decision function of an SVM model is given by and being real values, corresponding to the target values of the support vectors, and representing a kernel function with as the vector components of the support vectors. The sign function is
The following kernels are used:(i)Linear (dot product)(ii)Polynomial(iii)Gaussian.
Specifically, for the dot and polynomial cases we havewith for the dot kernel and for the polynomial kernel. The Gaussian kernel iswith , a parameter.
We define a Quantized Support Vector Machine as a DIMLP network with two hidden layers. The activation function of the neurons in the second hidden layer is related to the SVM kernel. Figure 3 presents a QSVM with a Gaussian activation function in the second hidden layer.
Neurons in the first hidden layer have a staircase activation function. The role of neurons of the first hidden layer is to perform a normalization of the input variables. This normalization is carried out through weight values depending on the training data before the learning phase. Note that during training these weights remain unchanged. Let us assume that we have the same number of input neurons and hidden neurons in the first hidden layer. These weights are defined as(i), with as the standard deviation of input ,(ii), with as the average on the training set of input .
With a dot kernel, the activation function in the second hidden layer corresponds to the identity function, while it is a cubic polynomial with a polynomial kernel. The number of neurons in this layer is equal to the number of support vectors, with the incoming weight connections corresponding to the components of the support vectors. Specifically, a weight between the first and second hidden layers denoted as in Figure 3 corresponds to the th component of the th support vector. Weights between the second hidden layer and the output neuron denoted as in Figure 3 correspond to coefficients in (6). Finally, the activation function of the output neuron is a sign function.
2.3. Ensembles of Shallow Decision Trees
A binary decision tree is made of nodes and branches. At each node, a test on an attribute is performed; depending on its predicate value the path continues to the left or to the right branch (if any), until a terminal node also denoted as a leaf is reached. Shallow trees have very limited number of nodes; they represent “weak” learners with limited power of expression. As an example, a tree with a unique node performs a test only on an attribute. Such a shallow tree is also called a decision stump. The key idea behind ensembles of shallow decision trees is to obtain strong classifiers by training weak learners by boosting . Three variants of boosting are used in this work to train boosted shallow trees (BSTs):(i)Modest Adaboost (ii)Gentle Adaboost (iii)Real Adaboost .
A single decision tree is built according to a splitting criterion. Specifically, at each step the most informative attribute that splits the training set accurately is determined. Many possible criteria can be used to determine the best splitting attribute; for more details see [31, 52]. Once training is completed, BSTs are transformed into DIMLP ensembles. Specifically, for each BST, a path from a root to a leaf represents a symbolic rule. Then, each rule is inserted into a unique DIMLP network. Note also that all the rules extracted from a BST could be inserted into a DIMLP, but for simplicity we will show the former rule insertion technique. We assume here that DIMLPs have a unique hidden layer with an activation function which is a sigmoid (cf. (5)).
Figure 4 exhibits a shallow decision tree with two nodes. Following the paths between the root and the leaves, we obtain three rules.
Each rule is inserted into a single DIMLP. Note that rule antecedents are present in the weight values between the input layer and the hidden layer (see Figure 5).
Without loss of generality we formulate the rule insertion algorithm for classification problems of two classes, vector coding the first class and vector coding the second.
Rule Insertion Algorithm(1)For all BSTs generate the list of rules with their corresponding class by following all the paths between roots and leaves.(2)For each rule in , let be the number of antecedents of ; then let us define a network with inputs, neurons in the hidden layer, and two output neurons.(3)For each coding a unique rule in and for the th antecedent in , such as ( being a constant), and , with being the weight value between the bias neuron and hidden neuron and being the weight value between input neuron and hidden neuron .(4)For each coding a unique rule in and for each antecedent in , such as , and .(5)For each coding rule of class , , for ( designates weight values between the hidden layer and the first output neuron) and ( is the weight value between the bias neuron and the first output neuron); ( designates weight values between the hidden layer and the second output neuron) and ( is the weight value between the bias neuron and the second output neuron).(6)For each coding rule of class , , for and ; and .
Boosting algorithms provide for each weak learner coefficients that are inserted in the combination layer (cf. Figure 2). Note that for DIMLP ensembles trained with bagging or arcing these weights are equal to , with being the number of networks in the ensemble.
In the experiments we use 25 datasets representing classification problems of two classes. Table 1 illustrates their main characteristics in terms of number of samples, number of input features, type of features, and source. We have four types of inputs: Boolean; categorical; integer; and real. The public sources of the datasets are(i)UCI: Machine Learning Repository at the University of California, Irvine: https://archive.ics.uci.edu/ml/datasets.html ,(ii)KEEL: http://sci2s.ugr.es/keel/datasets.php ,(iii)LIBSVM: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
3.1. Models and Learning Parameters
Our experiments are based on 10 repetitions of stratified 10-fold cross-validation trials. Training sets were normalized by Gaussian normalization. Specifically, the input variable averages and standard deviations calculated on a training set were used to normalize the input variables in a testing set. The following models were trained on the 25 datasets:(i)Boosted shallow trees trained by modest boosting (BST-M)(ii)Boosted shallow trees trained by gentle boosting (BST-G)(iii)Boosted shallow trees trained by real boosting (BST-R)(iv)DIMLP ensembles trained by bagging (DIMLP-B)(v)DIMLP ensembles trained by arcing (DIMLP-A)(vi)QSVM with dot kernel (QSVM-L)(vii)QSVM with polynomial kernel of third degree (QSVM-P3)(viii)QSVM with Gaussian kernel (QSVM-G).
The complexity of boosted shallow trees was controlled according to the parameter defining the number of splits for each shallow tree (cf. Section 2.3). This parameter varies from one to four. Note that when this value is equal to one we obtain decision stumps. The number of decision trees in each ensemble was fixed to 200, since very often after this value the improvement in accuracy is very small.
For DIMLP ensembles the learning parameters are(i)the learning parameter (),(ii)the momentum (),(iii)the Flat Spot Elimination (),(iv)the number of stairs in the staircase function ().
The default number of neurons in the first hidden layer is equal to the number of input neurons and the number of neurons in the second hidden layer is empirically defined in order to obtain a number of weight connections that is less than the number of training samples. Finally, the default number of DIMLPs in an ensemble is equal to 25, since it has been empirically observed that for bagging and arcing the most substantial improvement in accuracy is achieved with the first 25 networks .
For QSVMs, default learning parameters are those defined in the libSVM library (this software is available at https://www.csie.ntu.edu.tw/~cjlin/libsvm/). The number of stairs in the staircase function was set to , in order to guarantee a sufficient number of quantized levels in the input values. We used nu-SVM ; note that our goal was not to optimize the predictive accuracy of the models but just to use default configurations in order to assess the accuracy and complexity of the models. With respect to all the defined models and datasets, the total amount of training and rule extractions is equal to 42500 (=).
3.2. Overall Results
Figure 6 gives a general view of the logarithm of the complexity of the rulesets (-axis) generated from the models (-axis). Here, complexity corresponds to the total number of rule antecedents per ruleset. With respect to the -axis indexes 1 to 4 indicate BST-M with the split parameter varying from 1 to 4, indexes from 5 to 8 are related to BST-G, indexes from 9 to 12 indicate BST-R, and finally indexes from 13 to 17 are illustrated as the results corresponding to DIMLP-B, DIMLP-A, QSVM-L, QSVM-P3, and QSVM-G, respectively. For each boxplot, the central mark is the median obtained by cross-validation trials and the edges of the box are the 25th and 75th percentiles.
Overall, with respect to the 25 datasets used in the experiments the lowest median complexity is obtained by BST-M1, while the top medians are given by BST-G3, BST-G4, BST-R3, and BST-R4. Moreover, it clearly appears that the median complexity augments with the increase of the number of splits in the shallow trees from one to three.
Figure 7 illustrates the average predictive accuracy of the extracted rulesets (-axis) with respect to each model (-axis). It is worth noting that BST-R4 and DIMLP-B reach the highest medians, with DIMLP-B obtaining a better 25th percentile.
Figure 8 shows boxplots of the average fidelity of the extracted rulesets. Qualitatively, BST-M obtains the best results with respect to median fidelity, while BST-G and BST-R give lowest fidelity results. As a qualitative rule of the obtained results, the lower the complexity of the extracted rulesets the higher the fidelity, and vice versa. This observation is also illustrated in Figure 9. Specifically, with respect to the 25 classification problems used in the experiments, each point of this figure represents the average fidelity of the extracted rulesets versus the average number of antecedents per ruleset. Is it worth noting that from left to right (with respect to the -axis), red “+” indicates BST-M1, BST-M2, BST-M3, and BST-M4. Thus, ruleset complexity augments with the number of splits of the shallow trees. Similarly, we can see the same trend for the triangles related to BST-Gs and BST-Rs. Based on the 17 models, a linear regression is also shown. Hence, we can clearly see a trend for which fidelity is inversely proportional to the complexity of rulesets.
3.3. Detailed Results
Table 2 gives for each dataset the average predictive accuracy obtained by the best model (column three), as well as the average predictive accuracy of the best extracted rulesets (column five). The difference of these average accuracies is reported in column six. The last three columns indicate the average fidelity, the average number of generated rules, and the average number of antecedents per rule, respectively. It is worth noting that the average predictive accuracy of rulesets is rarely better than the predictive accuracy provided by the best model, because the power of expression of rules is somewhat limited with respect to that of the original models. However, for many datasets, ruleset average predictive accuracy is quite close to that provided by the best model.
Results shown in Table 3 are similar to those provided by Table 2. The only difference resides in the way that the average predictive accuracy of the rulesets is measured. Specifically, here, we only take into account whether the model from each rule is generated and if the rules agree. In that case, the average predictive accuracy of the rules is always equal or higher than that provided by the model. Intuitively, it means that if rules and models agree then results are more reliable.
The purpose of Figure 10 is to show the average difference in predictive accuracy between a model and its generated rulesets over the 25 classification problems. The lower part of this Figure concerns this average difference when rules and network agree.
Tables 4 and 5 present the detailed results of rulesets’ average predictive accuracy and standard deviations. Note that the classification decision was determined by the neural network model when a testing sample was not covered by any rule. Moreover, in the case of conflicting rules (i.e., rules of two different classes), the selected class is again the one determined by the model. Tables 6 and 7 show the average complexity in terms of average number of rules and average number of antecedents per ruleset. Finally, Tables 8 and 9 illustrate average fidelity results with their standard deviations.
In Table 10 our purpose is to illustrate the impact of DIMLP ensembles with respect to single DIMLPs. We focus on average predictive accuracy and average complexity of the generated rulesets. Columns four and seven are related to single architectures. Complexity, which is given in terms of number of rules and average number of antecedents per rule, is in bold when the product of these two components is the lowest. Note that for single architectures, 10% of the samples are used to decide when to stop training (with 80% of samples used for training). With respect to single DIMLPs, bagging tends to reduce average complexity of the generated rulesets, since in 22 problems out of 25 it was lower. Conversely, for DIMLP ensembles trained by arcing, average complexity was higher in 20 problems. Finally, average predictive accuracy of rulesets produced by ensembles was higher than or equal to that provided by single DIMLPs in 22 problems out of 25.
3.4. Related Work
Among several published works on the knowledge extracted from ensembles, very few are based on cross-validation trials. Table 11 presents rule extraction results with respect to the Breast Cancer classification problem. Only the last two rows concern rule extraction from ensembles. Note that a fair comparison for the complexity of the extracted rules is difficult, since some techniques such as Re-RX generate ordered rules, while DIMLP-B extracts unordered rules. For the predictive accuracy, DIMLP-B obtains the highest average.
With the use of G-REX , a genetic programming technique, Johansson presented a number of results on the extraction of decision trees from ensembles of 20 neural networks, based on one repetition of 10-fold cross-validation. Table 12 presents these results, with columns three and four depicting the results provided by Trepan , which is a general technique for knowledge extraction . Our results with DIMLP-Bs (based on 10 repetitions of stratified 10-fold cross-validation) are shown in the last three columns. Average fidelity of DIMLP-Bs is always greater than that obtained by G-REX and Trepan (it is considerably higher in five of the classification problems). With the exception of one classification problem, the average predicative accuracy values of our models and rulesets are a bit greater than that of G-REX and Trepan.
In  rule extraction from SVMs is reported based on ten repetitions of stratified tenfold cross-validation. Table 13 illustrates the comparison with our results obtained by QSVMs. Note that the average number of antecedents is not reported, because their number in  is equal to the number of inputs. Thus, we generate less complex rulesets, on average, while our predictive accuracy is better or very close. Finally, we obtain better average fidelity.
SVMs are very often used as single models, because with boosting they tend to overfit the data. Shallow trees are weak learners; thus they have to be trained in ensembles. For DIMLPs, we observed that when they are trained by bagging, the complexity of the extracted rulesets tends to be a bit lower than that of rulesets produced by a single network 22 times out of 25. In contrast, ensembles trained by arcing show increased complexity in the extracted rulesets 20 times out of 25. Concerning the impact of model architecture, from this work it turned out that for boosted decision trees when the number of splits is increased, then the extracted rulesets tend to be more complex, on average (see Figure 9 with BST-M, BST-G, and BST-R with the number of splits in a decision tree varying from 1 to 4).
With respect to rulesets, the lower the fidelity, the higher the complexity. Conversely, the higher the fidelity, the lower the complexity. Since average predictive accuracy is in some cases provided by the most complex rulesets, we also have a clear trade-off between accuracy and complexity. Another compromise to take into account is the proportion of covered samples with respect to predictive accuracy. Specifically, from Table 2 we showed that very often the average predictive accuracy of rulesets is lower than that of the models from which they are generated. In case of disagreement between rules and models, if rules are ignored, more samples are left without explanation, but the remaining rules will have better predictive accuracy, on average (cf. Table 3).
Let us suppose that a physician is in a realistic situation for which a patient diagnosis is provided by an ensemble of DIMLPs. If the patient symptoms (e.g., inputs) are not covered by any rule, the physician cannot explain the response given by the neural ensemble. Hence, a first possibility would be to perform again rule extraction by including the new patient data. However, this solution has two drawbacks. The first is the rule extraction time duration, which is fast for all the used datasets in this work but will be prohibitive with big data. The second drawback is that, after reextraction of the rules, the new ruleset could have considerably changed and so it could take time for the physician to understand it.
To minimize the number of times a new sample remains unexplained, we can increase fidelity. The basic idea consists of aggregating the rules extracted from several models. With the use of unordered rules representing single pieces of knowledge, even if their number is greater than those obtained with a single model, their comprehension could be possible in a reasonable amount of time. In the next experiment we consider combinations of five models (out of 17) by majority voting, even if the number of extracted rules roughly increases by a factor equal to five. When rules of different classes are activated we ignore the rules that are different from the majority voting response (this corresponds to the first strategy in Section 2.1.4). This approach was applied to 10 classification problems. Table 14 shows the obtained results for all the possible combinations of five aggregated rulesets, equal to 6188. The second column represents the average over the 6188 possible combinations of the average predictive rulesets’ accuracies (with the standard deviation). Columns three and four show the minimal and maximal rulesets’ predictive accuracy and the last column is the average of the average fidelity. It is worth noting that this last value is always above 99.6% and the average of the average ruleset accuracy is greater than the best corresponding values shown in Table 2 (fifth column).
In this work, the DIMLP model was used to extract unordered rules from ensembles of DIMLPs, boosted shallow trees, and Support Vector Machines. Experiments were performed on 25 datasets by 10 repetitions of 10-fold cross-validation. We measured the predicative accuracy of the generated rulesets, their complexity, and their fidelity. For the 17 classifiers used in this study, we emphasized a strong relationship between average complexity and average fidelity of the extracted rulesets. As a result, we obtained a spectrum of models showing a clear trade-off between fidelity and complexity. At one end lie the decision stumps trained by modest Adaboost for which the less complex rulesets are generated, bringing also the best fidelity, on average. At the other end lie models with highest complexity and lowest fidelity, corresponding to BSTs trained by real Adaboost and gentle Adaboost. The average complexity of rulesets produced by BSTs is augmented with the number of splitting nodes.
Another trade-off is between the covering of testing samples by rules and predictive accuracy. We clearly pointed out that when models and rulesets agree then the average predictive accuracy is better when we ignore the test samples for which models and rules disagree. Intuitively, this can be explained by the fact that when models and rules disagree the classification is somewhat more uncertain. By aggregating the responses of several models it was possible to increase both fidelity and predictive accuracy. Nevertheless, this also increased complexity.
Very few works systematically assessed symbolic rules generated from connectionist models by cross-validation. Hence, our work could be useful in the future to researchers who would like to compare their results. So far, the comparison with a work in which rules were extracted from MLP ensembles was in our favour for both fidelity and predictive accuracy in eight out of nine classification problems. Moreover, with respect to two datasets from which rules were generated from SVMs we obtained better fidelity, with predictive accuracy being greater in one of the problems and slightly worse in the other. Lastly, we would like to encourage researchers to perform systematic experiments by 10-fold cross-validation to assess their rule extraction algorithms applied to neural networks.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
M. Golea, On the complexity of rule extraction from neural networks and network querying. In Rule Extraction From Trained Artificial Neural Networks Workshop, Society For the Study of Artificial Intelligence and Simulation of Behavior Workshop Series (AISB), pages 51–59, 1996.
G. Bologna, “FDIMLP: A new neuro-fuzzy model,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN'01), vol. 2, pp. 1328–1333, USA, July 2001.View at: Google Scholar
A. Van Assche and H. Blockeel, “Seeing the Forest Through the Trees: Learning a Comprehensible Model from an Ensemble,” in Machine Learning: ECML 2007, vol. 4701 of Lecture Notes in Computer Science, pp. 418–429, Springer Berlin Heidelberg, Berlin, Heidelberg, 2007.View at: Publisher Site | Google Scholar
W. W. Cohen, “Fast effective rule induction,” in In Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123, 1995.View at: Google Scholar
G. Bologna and C. Pellegrini, “Three medical examples in neural network rule extraction,” Physica Medica, vol. 13, no. 1, pp. 183–187, 1997.View at: Google Scholar
M. Craven and J. W. Shavlik, “Extracting tree-structured representations of trained networks,” In Advances in neural information processing systems, pp. 24–30, 1996.View at: Google Scholar
U. Johansson, Obtaining accurate and comprehensible data mining models: An evolutionary approach. Linköping University, Department of Computer and Information Science, 2007.
J. Diederich, Rule Extraction from Support Vector Machines, vol. 80, Springer Science & Business Media, 2008.
Z.-H. Zhou, Y. Jiang, and S.-F. Chen, “Extracting symbolic rules from trained neural network ensembles,” Artificial Intelligence Communications, vol. 16, no. 1, p. 16, 2003.View at: Google Scholar
A. Hara and Y. Hayashi, “Ensemble neural network rule extraction using Re-RX algorithm,” in Proceedings of the 2012 Annual International Joint Conference on Neural Networks, IJCNN 2012, Part of the 2012 IEEE World Congress on Computational Intelligence, WCCI 2012, Australia, June 2012.View at: Publisher Site | Google Scholar
M. Gashler, C. Giraud-Carrier, and T. Martinez, “Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous,” in Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications, pp. 900–905, San Diego, CA, USA, December 2008.View at: Publisher Site | Google Scholar
L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and Regression Trees, CRC press, 1984.
N. Barakat and J. Diederich, “Learning-based rule-extraction from support vector machines,” in Proceedings of the 14th International Conference on Computer Theory and applications ICCTA'2004, 2004.View at: Google Scholar
N. Barakat and J. Diederich, “Eclectic rule-extraction from support vector machines,” International Journal of Computational Intelligence, vol. 2, no. 1, pp. 59–62, 2005.View at: Google Scholar
X. Fu, C. Ong, S. Keerthi, G. G. Hung, and L. Goh, “Extracting the knowledge embedded in support vector machines,” in Proceedings of the International Joint Conference on Neural Networks, pp. 291–296, IEEE, 2004.View at: Google Scholar
H. Núñez, C. Angulo, and A. Català, “Rule extraction from support vector machines,” Esann, pp. 107–112, 2002.View at: Google Scholar
L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.View at: Google Scholar
L. Breiman, Bias, variance, and arcing classifiers (technical report 460). Statistics Department, University of California, 1996.
V. N. Vapnik, Statistical Learning Theory, Adaptive and Learning Systems for Signal Processing, Communications, and Control, Wiley- Interscience, New York, NY, USA, 1998.View at: MathSciNet
R. E. Schapire, “A brief introduction to boosting,” in Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI '99), pp. 1401–1406, Stockholm, Sweden, August 1999.View at: Google Scholar
A. Vezhnevets and V. Vezhnevets, “Modest adaboost-teaching adaboost to generalize better,” in Proceedings of the 15th International Conference on Computer Graphics and Vision, GraphiCon 2005, vol. 12, pp. 987–997, Computer Graphics in Russia, June 2005.View at: Google Scholar
Y. Freund and R. E. Schapire, “A desicion-theoretic generalization of on-line learning and an application to boosting,” in Proceedings of the European Conference on Computational Learning Theory, pp. 23–37, Springer, 1995.View at: Google Scholar
M. Lichman, UCI machine learning repository, university of california, irvine, school of information and computer sciences, 2013.
J. Alcalá-Fdez, A. Fernández, J. Luengo et al., “Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework,” Journal of Multiple-Valued Logic and Soft Computing, vol. 17, pp. 255–287, 2011.View at: Google Scholar