Applied Computational Intelligence and Soft Computing

Volume 2018, Article ID 4084850, 20 pages

https://doi.org/10.1155/2018/4084850

## A Comparison Study on Rule Extraction from Neural Network Ensembles, Boosted Shallow Trees, and SVMs

^{1}Department of Computer Science, University of Applied Sciences and Arts Western Switzerland, Rue de la Prairie 4, 1202 Geneva, Switzerland^{2}Department of Computer Science, University of Geneva, Route de Drize 7, 1227 Carouge, Switzerland^{3}Department of Computer Science, Meiji University, Tama-ku, Kawasaki, Kanagawa 214-8571, Japan

Correspondence should be addressed to Guido Bologna; hc.egseh@angolob.odiug

Received 27 July 2017; Revised 17 November 2017; Accepted 4 December 2017; Published 9 January 2018

Academic Editor: Erich Peter Klement

Copyright © 2018 Guido Bologna and Yoichi Hayashi. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

One way to make the knowledge stored in an artificial neural network more intelligible is to extract symbolic rules. However, producing rules from Multilayer Perceptrons (MLPs) is an NP-hard problem. Many techniques have been introduced to generate rules from single neural networks, but very few were proposed for ensembles. Moreover, experiments were rarely assessed by 10-fold cross-validation trials. In this work, based on the Discretized Interpretable Multilayer Perceptron (DIMLP), experiments were performed on 10 repetitions of stratified 10-fold cross-validation trials over 25 binary classification problems. The DIMLP architecture allowed us to produce rules from DIMLP ensembles, boosted shallow trees (BSTs), and Support Vector Machines (SVM). The complexity of rulesets was measured with the average number of generated rules and average number of antecedents per rule. From the 25 used classification problems, the most complex rulesets were generated from BSTs trained by “gentle boosting” and “real boosting.” Moreover, we clearly observed that the less complex the rules were, the better their fidelity was. In fact, rules generated from decision stumps trained by modest boosting were, for almost all the 25 datasets, the simplest with the highest fidelity. Finally, in terms of average predictive accuracy and average ruleset complexity, the comparison of some of our results to those reported in the literature proved to be competitive.

#### 1. Introduction

The explanation of neural network responses is essential for their acceptance. As an example, physicians cannot trust any model without any form of enlightenment. An intuitive way to give insight into the knowledge embedded within neural network connections and neuron activation is to extract symbolic rules. However, producing rules from Multilayer Perceptrons (MLPs) is an NP-hard problem [1].

In the context of classification, the format of a symbolic rule is given as follows: “if tests on antecedents are true then class ,” where “tests on antecedents” are in the form or , with as an input variable and as a real number. Class designates a class among several possible classes. The definition of the complexity of the extracted rules is often described with two parameters: number of rules and number of antecedents per rule. Rulesets of low complexity are preferred compared to those with high complexity, since at first sight fewer rules and fewer antecedents are better understood. Another reason of preference is that rule bases with lower complexity also reduce the risk of overfitting on new data. Nevertheless, Freitas clarified that the comprehensibility of rules is not necessarily related to a small number of rules [2]. He proposed a new measure denoted as* prediction-explanation size*, which strongly depends on the average number of antecedents per rule. Another measure of rule transparency is consistency. Specifically, an extracted ruleset is deemed to be consistent if, under different training sessions, the rule extraction algorithm produces rulesets which classify samples into the same classes. Finally, a rule is redundant if it conveys the same information or less general information than the information conveyed by another rule.

An important characteristic of rulesets is whether they are ordered or not. Ordered rules correspond to the following: *if tests on antecedents are true then …,* *else if tests on antecedents are true then …,* *…,* *else …*

In unordered rules “else if” is replaced again by “if tests on antecedents are true then conclusion.” Thus, a sample can activate more than a rule. Long ordered rulesets are difficult to understand since they potentially include many implicit antecedents; specifically, those negated by “else if.” Generally, unordered rulesets present more rules and antecedents than ordered ones, since all rule antecedents are explicitly provided, thus being more transparent than ordered rulesets. Each rule of an unordered ruleset represents a single piece of knowledge that can be examined in isolation, since all antecedents are explicitly given. With a great number of unordered rules, one would try to accurately understand the meaning of each rule with respect to the data domain. Getting the global picture could take a long time; nevertheless, one could be interested only in some parts of the whole knowledge, for instance, those rules with the highest number of covered samples.

The* Discretized Interpretable Multilayer Perceptron* (DIMLP) represents a special feedforward neural network architecture from which crisp symbolic rules are extracted in polynomial time [3]. This particular Multilayer Perceptron (MLP) model can be used to learn any classification problem, and rule extraction is also performed for DIMLP ensembles. Furthermore, special DIMLP architectures were also defined to produce fuzzy rules [4].

Decision trees are widely used in Machine Learning. They represent transparent models because symbolic rules are easily extracted. However, when they are combined in an ensemble rule, extraction becomes harder [5]. Here, we propose generating rules from ensembles of shallow decision trees with the help of DIMLP ensembles. In practical terms, each rule extracted from a tree is inserted into a single DIMLP network; then, all the rules generated from a tree ensemble are represented by a DIMLP ensemble. Finally, rule extraction is performed to obtain a ruleset representing the knowledge embedded within the decision tree ensemble. Because of the* No Free Lunch Theorem* no model is better than any other, in general [6]. Hence, if a connectionist model is more accurate than a direct rule learner such as RIPPER [7], then it is worth extracting rules to understand the classifications, even if this involves extra computing time.

Authors who generated rules from single neural networks or Support Vector Machines (SVMs), very rarely assessed their techniques by tenfold cross-validation. Our experiments are based on ten repetitions of stratified tenfold cross-validation trials over 25 binary classification problems. Note that the total number of training trials is equal to 42500. Moreover, we compare the complexity of the rules generated from DIMLP ensembles, boosted shallow trees (BST), and SVMs. For SVMs we define the Quantized Support Vector Machine (QSVM), which is a DIMLP architecture trained by an SVM learning algorithm [16]. Our purpose is not to determine which model is the best for these classification problems, but to characterize the complexity of the rules produced by the models. Our results could serve as a basis for researchers who would like to compare their rule extraction techniques applied to connectionist models by 10-fold cross-validation. In the following sections we present the DIMLP model that allows us to produce rules from BSTs and SVMs and then the experiments, followed by the conclusion.

##### 1.1. State of the Art

Since the earliest work of Gallant on rule extraction from neural networks [17], many techniques have been introduced. In the 1990s, Andrews et al. introduced a taxonomy aiming at characterizing rule extraction techniques [18]. Essentially, rule extraction algorithms belong to three categories:* decompositional*;* pedagogical*; and* eclectic*. In decompositional techniques, rules are extracted at the level of hidden and output neurons by analyzing weight values. Here, a basic requirement is that the computed output from each hidden and output unit must be mapped into a binary outcome which corresponds to the notion of a rule consequent. The basic idea of the pedagogical approach is to view rule extraction as a learning task where the target concept is the function computed by the network and the input attributes are simply the network’s input neurons. Weight values are not taken into account in this category of techniques. Finally, the eclectic approach takes into account elements of both decompositional and pedagogical techniques. A few years later, Duch et al. published a survey article on this topic [9]. More recently, Diederich published a book on techniques to extract symbolic rules from Support Vector Machines (SVMs) [19] and Barakat and Bradley reviewed a number of rule extraction techniques applied to SVMs [20].

###### 1.1.1. Rule Extraction from Neural Network Ensembles

Many rule extraction techniques from single neural networks have been introduced, but only a few authors have started to extract rules from neural network ensembles. Bologna proposed the Discretized Interpretable Multilayer Perceptron (DIMLP) to generate unordered symbolic rules from both single networks and ensembles [21, 22]. With the DIMLP architecture rule extraction is performed by determining the precise location of axis-parallel discriminative hyperplanes. Zhou et al. introduced the REFNE (Rule Extraction from Neural Network Ensemble) algorithm [23], which utilizes the trained ensembles to generate instances, and then extracted symbolic rules from those instances. Attributes are discretized during rule extraction and it also uses particular fidelity evaluation mechanisms. Moreover, rules have been limited to only three antecedents. For Johansson, rule extraction from ensembles is an optimization problem in which a trade-off between accuracy and comprehensibility must be taken into account [14]. He used a genetic programming technique to produce rules from ensembles of 20 neural networks. Ao and Palade extracted rules from ensembles of Elman networks and SVMs by means of a pedagogical approach to predict gene expression in microarray data [24]. More recently Hara and Hayashi proposed the two-MLP ensembles by using the “Recursive-Rule eXtraction” (Re-RX) algorithm [25] for data with mixed attributes [26]. Re-RX utilizes C4.5 decision trees and backpropagation to train MLPs recursively. Here, the rule antecedents for discrete attributes are disjointed from those for continuous attributes. Subsequently, Hayashi at al. presented the “three-MLP Ensemble” by the Re-RX algorithm [27].

###### 1.1.2. Rule Extraction from Ensembles of Decision Trees

Basically, rule extraction techniques applied to ensembles of decision trees belong to two distinguished groups. In the first, the purpose is to reduce the number of decision trees by increasing their diversity. Techniques for the optimization of diversity are reported in [28]; as an example Gashler et al. improved the ensemble diversity by combining different decision trees algorithms [29].

Techniques in the second group concentrate on the rules extracted during the ensemble construction. A well-known representative technique in this group is* RuleFit* [30]. The base learners are rules extracted from a large number of CART decision trees [31]. Specifically, these trees are trained on random subsets of the learning set, the main idea being to define a linear function including rules and features that approximates the whole ensemble of decision trees. At the end of the process this linear function represents a regularized regression of the ensemble responses with a large number of coefficients equal to zero.* Node Harvest* is another rule-based representative technique [32]. Its purpose is to find suitable weights for rules by performing a minimization on a quadratic program with linear inequality constraints. Finally, in [33], the rule extraction problem is viewed as a regression problem using the sparse group lasso method [34], such that each rule is assumed to be a feature, where the aim is to predict the response. Subsequently, most of the rules are removed by trying to keep accuracy and fidelity as high as possible.

###### 1.1.3. Rule Extraction from Support Vector Machines

To produce rules from SVMs, a number of techniques applied a pedagogical approach [35–38]. As a first step, training samples are relabeled according to the target class provided by the SVM. Then, the new dataset is learned by a transparent model, such as decision trees, which approximately learn what the SVM has learned. As a variant, only a subset of the training samples are used as the new dataset: the support vectors [39]. Before the training of a decision tree algorithm, Martens at al. generate additional learning examples close to randomly selected support vectors [38]. In another technique, Barakat and Bradley generate rules from a subset of the support vectors using a modified covering algorithm, which refines a set of initial rules determined by the most discriminative features [40].

Fu et al. proposed a method aiming at determining hyperrectangles whose upper and lower corners are defined by determining the intersection of each of the support vectors with the separating hyperplane [41]. This is achieved by solving an optimization problem depending on the Gaussian kernel. Núñez et al. determined prototype vectors for each class [15, 42]. With the use of the support vectors, these prototypes are translated into ellipsoids or hyperrectangles. An iterative process is defined in order to divide ellipsoids or hyperrectangles into more regions, depending on the presence of outliers and the SVM decision boundary. Similarly, Zhang et al. introduced a clustering algorithm to define prototypes from the support vectors [43]. Then, small hyperrectangles are defined around these prototypes and progressively grown until a stopping criterion is met. Note that for these two last methods the comprehensibility of the rules is low, since all input features are present in the rule antecedents.

#### 2. Material and Methods

In this section we present the models used in this work, which are DIMLP ensembles, Quantized Support Vector Machines, and shallow boosted trees. The rule extraction process of the last two models has been made possible by transforming them into particular DIMLP architectures.

##### 2.1. The DIMLP Model

DIMLP differs from MLP in the connectivity between the input layer and the first hidden layer. Specifically, any hidden neuron receives only a connection from an input neuron and the bias neuron, as shown in Figure 1. After the first hidden layer, neurons are fully connected. Note that very often DIMLPs are defined with two hidden layers, the number of neurons in the first hidden layer being equal to the number of input neurons.