Complexity

Volume 2018, Article ID 1591878, 12 pages

https://doi.org/10.1155/2018/1591878

## Learning the Structure of Bayesian Networks: A Quantitative Assessment of the Effect of Different Algorithmic Schemes

^{1}DISCo, Universitá degli Studi di Milano-Bicocca, 20126 Milano, Italy^{2}NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, Campus de Campolide, 1070-312 Lisboa, Portugal^{3}INESC Coimbra, DEEC, University of Coimbra, Coimbra, Portugal^{4}Department of Pathology, Stanford University, Stanford, California, USA

Correspondence should be addressed to Daniele Ramazzotti; ude.drofnats@ittozzamar.eleinad

Received 7 June 2018; Revised 1 August 2018; Accepted 7 August 2018; Published 12 September 2018

Academic Editor: Eulalia Martínez

Copyright © 2018 Stefano Beretta et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

One of the most challenging tasks when adopting Bayesian networks (BNs) is the one of learning their structure from data. This task is complicated by the huge search space of possible solutions and by the fact that the problem is *NP*-hard. Hence, a full enumeration of all the possible solutions is not always feasible and approximations are often required. However, to the best of our knowledge, a quantitative analysis of the performance and characteristics of the different heuristics to solve this problem has never been done before. For this reason, in this work, we provide a detailed comparison of many different state-of-the-art methods for structural learning on simulated data considering both BNs with discrete and continuous variables and with different rates of noise in the data. In particular, we investigate the performance of different widespread scores and algorithmic approaches proposed for the inference and the statistical pitfalls within them.

#### 1. Introduction

Bayesian networks (BNs) have been applied to several different fields, ranging from the water resource management [1] to the discovery of gene regulatory networks [2, 3]. The task of learning a BN can be divided into two subtasks: (1) structural learning, i.e., identification of the topology of the BN, and (2) parametric learning, i.e., estimation of the numerical parameters (conditional probabilities) for a given network topology. In particular, the most challenging task of the two is the one of learning the structure of a BN. Different methods have been proposed to face this problem, and they can be classified into two categories [4–6]: (1) methods based on detecting conditional independences, also known as constraint-based methods, and (2) score + search methods, also known as score-based approaches. It must be noticed that hybrid methods have also been proposed in [7] but, for the sake of clarity, here, we limit our discussion to the two mainstream approaches to tackle the task.

As discussed in [8], the input of the former algorithms is a set of conditional independence relations between subsets of variables, which are used to build a BN that represents a large percentage (and, whenever possible, all) of these relations [9]. However, the number of conditional independence tests that such methods should perform is exponential and, thus, approximation techniques are required.

Although constraint-based learning is an interesting approach, as it is close to the semantic of BNs, most of the developed structure learning algorithms fall into the score-based method category, given the possibility of formulating such a task in terms of an optimization problem. As the name implies, these methods have two major components: (1) a scoring metric that measures the quality of every candidate BN with respect to a dataset and (2) a search procedure to intelligently move through the space of possible networks, as this space is enormous. More in detail, as shown in [10, 11], searching this space for the optimal structure is an *NP*-hard problem, even when the maximum number of parents for each node is constrained.

Hence, regardless of the strategy to learn the structure, one wishes to pursue, greedy local search techniques and heuristic search approaches need to be adopted to tackle the problem of inferring the structure of BNs. However, to the best of our knowledge, a quantitative analysis of the performance of these different techniques has never been done. For this reason, here, we provide a detailed assessment of the performance of different state-of-the-art methods for structural learning on simulated data, considering both BNs with discrete and continuous variables and with different rates of noise in the data. More precisely, we investigate the characteristics of different scores proposed for the inference and the statistical pitfalls within both the constraint-based and score-based techniques. Furthermore, we study the impact of various heuristic search approaches, namely, hill climbing, tabu search, and genetic algorithms.

We notice that, although we here aim to cover some of the main ideas in the area of structure learning of BNs, several interesting topics are beyond the scope of this work [12]. In particular, we refer to a more general formulation of the problem [6], and we do not consider, for example, contexts where it is possible to exploit prior knowledge in order to make the tasks computationally affordable [13]. At the same time, in this work, we do not investigate in detail performance related to causal interpretations of the BNs [14].

This work extends previous investigation performed in the same area [15, 16] and is structured as follows. In the next two sections, we provide a background on both Bayesian networks and heuristic search techniques (see Section 2 and Section 3). In Section 4, we describe the results of our study, and in Section 5, we conclude the paper.

#### 2. Bayesian Networks

BNs are graphical models representing the *joint distribution* over a set of random variables through a *direct acyclic graph* (DAG) , where the nodes in represent the variables, and the arcs encode any statistical dependence among them. Similarly, the lack of arcs among variables subsumes statistical independence. In this DAG, the set of variables with an arc toward a given node is its set of “parents.” Formally, a Bayesian network [6] is defined as a pair over the variable , with arcs and real-valued parameter . When the structure of a BN is known, it is possible to compute the joint distribution of all the variables as the product of the conditional distributions on each variable given its parents.
where is a *probability density function*.

However, even if we consider the simplest case of binary variables, the number of parameters in each conditional probability table is still exponential in size. For example, in the case of binary variables, the total number of parameters needed to compute the full joint distribution is of size , where is the cardinality of the parent set of each node. Notice that, if a node does not have parents, the total number of parameters to be computed is 1, which corresponds to its marginal probability.

Moreover, the usage of the symmetrical notion of conditional dependence poses further limitations to the task of learning the structure of a BN: two arcs and in a network can in fact equivalently denote dependence between variables and ; this leads to the fact that two DAGs having a different structure can sometimes model an identical set of independence and conditional independence relations (*I-equivalence*). This yields to the concept of *Markov equivalence class* as a *partially directed acyclic graph* where the arcs that can take either orientation are left undirected [6]. In this case, all the structures within the Markov equivalence class are equivalently “good” in representing the data, unless a causal interpretation of the BN is given [17].

In the literature, there are two families of methods to learn the structure of a BN from data. The idea behind the first group of methods is the one of *learning the conditional independence relations* of the BN from which, in turn, the network is learned. These methods are often referred to as *constraint-based approaches*. The second group of methods, the so-called *score-based approaches*, formulates the task of structure learning as an optimization problem, with scores aimed at *maximizing the likelihood of the data* given the model. However, both the approaches are known to lead to *NP-*hard formulations and, because of this, heuristic methods need to be used to find near optimal solutions with high probability, in a reasonably small number of iterations.

##### 2.1. Constraint-Based Approaches

We now briefly describe the main idea behind this class of approaches. For a detailed discussion of this topic, we refer to [9, 18].

This class of methods is aimed at building a graph structure to reflect the dependence and independence relations in the data that match the empirical distribution. Notwithstanding, the number of conditional independence tests that such algorithms would have to perform among any pair of nodes to test all possible relations is exponential and, because of this, the introduction of some approximations is required.

We now provide some details on two constraint-based algorithms that have been proven to be efficient and of widespread usage: the *PC algorithm* [9] and the *Incremental Association Markov Blanket* (IAMB) [18].

###### 2.1.1. The PC Algorithm

This algorithm [9] starts with a fully connected graph and, on the basis of pairwise independence tests, it iteratively removes all the extraneous edges. To avoid an exhaustive search of separating sets, the edges are ordered to consider the correct ones early in the search. Once a separating set is found, the search for that pair ends. The PC algorithm orders the separating sets by increasing values of size , starting from 0 (the empty set) until (where is the number of variables). The algorithm stops when every variable has less than neighbors since it can be proven that all valid sets must have already been chosen. During the computation, the bigger the value of is, the larger the number of separating sets must be considered. However, as gets big, the number of nodes with degree or higher must have dwindled considerably. Thus, in practice, we only need to consider a small subset of all the possible separating sets.

###### 2.1.2. Incremental Association Markov Blanket Algorithm

Another constraint-based learning algorithm uses the Markov blankets [6] to restrict the subset of variables to test for independence. Thus, when this knowledge is available in advance, we do not need to test a conditioning on all possible variables. A widely used and efficient algorithm for Markov blanket discovery is IAMB which, for each variable , keeps track of a hypothesis set , which is the set of nodes that may be parents of . The goal is, for a given , to obtain at the end of the algorithm, a Markov blanket of equals to . IAMB consists of a forward and a backward phase. During the forward phase, it adds all the possible variables into that could be in , while in the backward phase, it removes all the false positive variables from the hypothesis set, leaving the true . The forward phase begins with an empty for each . Then, iteratively, variables with a strong association with (conditioned on all the variables in ) are added to the hypothesis set. This association can be measured by a variety of nonnegative functions, such as *mutual information*. As grows large enough to include , the other variables in the network will have very little association with , conditioned on . At this point, the forward phase is complete. The backward phase starts with that contains and false positives, which will have small conditional associations, while true positives will associate strongly. Using this test, the backward phase is able to iteratively remove the false positives, until all but the true negatives are eliminated.

##### 2.2. Score-Based Approaches

These approaches is aimed at maximizing the likelihood of a set of observed data , which can be computed as the product of the probability of each observation. Since we want to infer a model that best explains the observed data, we define the likelihood of observing the data given a specific model as

Practically, however, for any arbitrary set of data, the most likely graph is always the fully connected one, since adding an edge can only increase the likelihood of the data, i.e., this approach overfits the data. To overcome this limitation, the likelihood score is almost always combined with a *regularization term* that penalizes the complexity of the model in favor of sparser solutions [6].

As already mentioned, such an optimization problem leads to intractability, due to the enormous search space of the valid solutions. Because of this, the optimization task is often solved with heuristic techniques. Before moving on to describe the main heuristic methods employed to face such complexity (see Section 3), we now give a short description of a particularly relevant and known score, called *Bayesian Information Criterion* (BIC) [19], as an example of scoring function adopted by several the score-based methods.

###### 2.2.1. The Bayesian Information Criterion

BIC uses a score that consists of a log-likelihood term and a regularization term that depend on a model and data : where denotes the data, denotes the number of samples, and denotes the number of parameters in the model. We recall that in this formulation, the BIC score should be maximized. Since, in general, depends on the number of parents of each node, it is a good metric for model complexity. Moreover, each edge added to increases the complexity of the model. Thus, the regularization term based on favors graphs with fewer edges and, more specifically, fewer parents for each node. The term log essentially weighs the regularization term. The effect is that the higher the weight, the more sparsity will be favored over “explaining” the data through maximum likelihood.

Notice that the likelihood is implicitly weighted by the number of data points since each point contributes to the score. As the sample size increases, both the weight of the regularization term and the “weight” of the likelihood increase. However, the weight of the likelihood increases faster than that of the regularization term. This means that, with more data, the likelihood will contribute more to the score, and we may trust our observations more and have less need for regularization. Statistically speaking, BIC is a *consistent score* [6]. Consequently, contains the same independence relations as those implied by the true structure.

#### 3. Heuristic Search Techniques

We now describe some of the main state-of-the-art search strategies that we took into account in this work. In particular, as stated in Section 1, we considered the following search methods: hill climbing, tabu search, and genetic algorithms.

##### 3.1. Hill Climbing

Hill climbing (HC) is one of the simplest iterative techniques that have been proposed for solving optimization problems. While HC consists of a simple and intuitive sequence of steps, it is a good search scheme to be used as a baseline for comparing the performance of more advanced optimization techniques.

Hill climbing shares with other techniques (like simulated annealing [20] and tabu search [21]) the concept of neighborhood. Search methods based on this latter concept are iterative procedures in which a neighborhood is defined for each feasible solution , and the next solution is searched among the solutions in . Hence, the neighborhood is a function that assigns to each solution in the search space a (nonempty) subset of .

The sequence of steps of the hill climbing algorithm, for a maximization problem w.r.t. a given objective function , are the following: (1)Choose an initial solution in (2)Find the best solution in (i.e., the solution such that for every in )(3)If , then stop; else, set and go to step 2

As it is clear from the aforementioned algorithm, hill climbing returns a solution that is a local maximum for the problem at hand. This local maximum does not generally correspond to the global minimum for the problem under exam, that is, hill climbing does not guarantee to return the best possible solution for a given problem. To counteract this limitation, more advanced neighborhood search methods have been defined. One of these methods is tabu search, an optimization technique that uses the concept of “memory.”

##### 3.2. Tabu Search

Tabu search (TS) is a metaheuristic that guides a local heuristic search procedure to explore the solution space beyond local optimality. One of the main components of this method is the use of an adaptive memory, which creates a more flexible search behavior. Memory-based strategies are therefore the main feature of TS approaches, founded on a quest for “integrating principles,” by which alternative forms of memory are appropriately combined with effective strategies for exploiting them.

*Tabus* are one of the distinctive elements of TS when compared to hill climbing or other local search methods. The main idea in considering *tabus* is to prevent cycling when moving away from local optima through nonimproving moves. When this situation occurs, something needs to be done to prevent the search from tracing back its steps to where it came from. This is achieved by declaring *tabu* (disallowing) moves that reverse the effect of recent moves. For instance, let us consider a problem where solutions are binary strings of a prefixed length, and the neighborhood of a solution consists of the solutions that can be obtained from by flipping only one of its bits. In this scenario, if a solution has been obtained from a solution by changing one bit , it is possible to declare a *tabu* to avoid to flip back the same bit of for a certain number of iterations (this number is called the tabu tenure of the move). *Tabus* are also useful to help in moving the search away from previously visited portions of the search space and, thus, perform more extensive exploration.

As reported in [22], tabus are stored in a short-term memory of the search (the tabu list) and usually, only a fixed limited quantity of information is recorded. It is possible to store complete solutions, but this has a negative impact on the computational time required to check whether a move is a tabu or not and, moreover, it requires a vast amount of space. The second option (which is the one commonly used) involves recording the last few transformations performed to obtain the current solution and prohibiting reverse transformations.

While tabus represent the main distinguished feature of TS, this feature can introduce other issues in the search process. In particular, the use of tabus can prohibit attractive moves, or it may lead to an overall stagnation of the search process, with a lot of moves that are not allowed. Hence, several methods for revoking a tabu have been defined [23], and they are commonly known as aspiration criteria. The simplest and most commonly used aspiration criterion consists of allowing a move (even if it is tabu) if it results in a solution with an objective value better than that of the current best-known solution (since the new solution has obviously not been previously visited).

The basic TS algorithm, considering the maximization of the objective function , works as follows:
(1)Randomly select an initial solution in the search space , and set and , where is the best solution so far, and the iteration counter(2)Set and generate the subset of the *admissible* neighborhood solutions of (i.e., nontabu or allowed by aspiration)(3)Choose the best in and set (4)If , then set (5)Update the *tabu* and the aspiration conditions(6)If a stopping condition is met, then stop; else, go to step 2.

The most commonly adopted conditions to end the algorithm are when the number of iterations () is larger than a maximum number of allowed iterations or if no changes to the best solution have been performed in the last iterations (as it is in our tests).

Specifically, in our experiments, we modeled for both HC and TS the possible valid solutions in the search space as a binary adjacency matrix describing acyclic directed graphs. The starting point of the search is the empty graph (i.e., without any edge), and the search is stopped when the current best solution cannot be improved with any move in its neighborhood.

The algorithms then consider a set of possible solutions (i.e., all the directed acyclic graphs) and navigate among them by means of 2 moves: insertion of a new edge or removal of an edge currently in the structure. We also recall that in the literature, many alternatives are proposed to navigate the search space when learning the structure of Bayesian network (see [24, 25]). But, for the purpose of this work, we preferred to stick with the classical (and simpler) ones.

##### 3.3. Genetic Algorithms

Genetic algorithms (GAs) [26] are a class of computational models that mimic the process of natural evolution. GAs are often viewed as function optimizers although the range of problems to which GAs have been applied is quite broad. Their peculiarity is that potential solutions that undergo evolution are represented as fixed length strings of characters or numbers. Generation by generation, GAs stochastically transform sets (also called *populations*) of candidate solutions into new, hopefully improved, populations of solutions, with the goal of finding one solution that suitably solves the problem at hand. The quality of each candidate solution is expressed by using a user-defined objective function called *fitness*. The search process of GAs is shown in Figure 1.