Table of Contents Author Guidelines Submit a Manuscript
The Scientific World Journal
Volume 2014, Article ID 749150, 12 pages
http://dx.doi.org/10.1155/2014/749150
Research Article

Chain Graph Models to Elicit the Structure of a Bayesian Network

Dipartimento di Statistica, Informatica, Applicazioni “G. Parenti”, Università degli Studi di Firenze, Viale Morgagni 59, 50134 Firenze, Italy

Received 31 August 2013; Accepted 5 November 2013; Published 5 February 2014

Academic Editors: R. M. Rodríguez-Dagnino and M. Saberi

Copyright © 2014 Federico M. Stefanini. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Bayesian networks are possibly the most successful graphical models to build decision support systems. Building the structure of large networks is still a challenging task, but Bayesian methods are particularly suited to exploit experts’ degree of belief in a quantitative way while learning the network structure from data. In this paper details are provided about how to build a prior distribution on the space of network structures by eliciting a chain graph model on structural reference features. Several structural features expected to be often useful during the elicitation are described. The statistical background needed to effectively use this approach is summarized, and some potential pitfalls are illustrated. Finally, a few seminal contributions from the literature are reformulated in terms of structural features.

1. Introduction

Bayesian networks (BNs) are possibly the most successful graphical models to represent probabilistic and causal relationships [1, 2]. BNs are used in very different fields including medical domains [3], engineering [4], ecology [5], bioinformatics [6], and many others.

The core of this class of models is made by a directed acyclic graph (DAG) , where nodes in the graph are labels of modeled variables (elements of vector ), and oriented edges (arrows) capture probabilistic and/or causal relationships. The joint distribution of is represented by the product of conditional distribution functions following from the structure of . If substantial prior information is available on a given problem domain, it is possible that an expert defines the structure of and even the parameters inside the conditional distribution functions at a reasonable extent. Otherwise, structure and parameters have to be estimated from data (structural and parameter learning), for example, from a collection of exchangeable observations .

It is often the case that an expert knows some features of DAG , but knowledge is not enough to define a DAG because several of its aspects are affected by relevant uncertainty. Following the Bayesian paradigm [7, 8], the expert is invited to state his/her degree of belief about by eliciting a prior distribution on the set of DAGs which can be considered given a fixed set of variables (nodes). All other unknown quantities, like missing values and parameters, are considered in the prior distribution [9].

Learning the structure of a BN remains nowadays still challenging for the combinatorial explosion of candidate structures with the increase in the number of considered nodes. Following Robinson [10], the number of possible DAGs on 6 nodes is 3781503; thus the enumeration of all structures while eliciting expert beliefs is unfeasible. Computational difficulties in the full enumeration of DAGs follow from about a dozen of nodes on. For these reasons, several restrictions and simplifications in stating prior beliefs were considered in the past, with the aim of making structural learning tasks affordable in large networks. Widely adopted elicitation techniques are based on restrictions like a total ordering of nodes, the presence of sharp order constraints on nodes, the marginal independence of unknowns, or the existence of a prior network which is a good summary of expert beliefs.

In a recent work [11], graphical models were proposed to elicit beliefs about the structure of a BN. The approach is characterized by the possibility of expressing beliefs about limited (but relevant) aspects of the problem domain, called structural features [12], that by their own nature are often only indirectly related to oriented edges in the unknown DAG.

Here the most general approach based on chain graph (CG) models is reconsidered with the aim of establishing connections with seminal contributions from the literature on structural learning, of detailing a general parameterization, and of describing one way to perform the refinement of elicited beliefs. Common useful structural features are defined and some issues related to their implementation and revision are examined.

2. Methods

Notation and some background information on graphs and Markov properties are provided below. A comprehensive account may be found in [1315]. An approach to the elicitation of structural features by CG models is described, together with methods for the revision.

2.1. Graphs

A graph is a pair where is a finite set of nodes and is the set of edges. The set represents the structure of the graph because it defines which nodes are linked by an edge and if such edge is directed (arrow) or not. If and then an undirected edge joins and , indicated as , and is neighbor of . If but the ordered pair corresponds to the directed edge ; is said to be the child of and is a parent of . The set includes all parents of node , that is, all nodes originating arrows with end in , while the set collects all children of , namely, all nodes in which arrows originated from end.

A path is a sequence of vertices such that there is an edge for each pair of subsequent nodes in the sequence, that is, or or . A directed path is a path in which all edges maintain the head-to-tail orientation, for example, with .

In a directed graph all edges are directed. The ancestors of node are nodes on a directed path reaching , while descent nodes are nodes on a directed path starting from node . Note that and that . The extension of the above definition to with is obtained by union of sets for each . A similar extension holds for .

In a directed graph without cycles it is not possible to visit the same node more than one time by following a directed path, and in this case the graph is called directed acyclic graph. A moralized DAG is an undirected graph obtained by joining pairs of nodes sharing children (if not yet connected) with an undirected edge and by removing the direction of all edges. A subgraph on is obtained by removing all nodes in and all edges in which at least one node is in from the graph .

A graph without directed edges is called undirected graph (UG). An UG with is said to be complete. A subgraph on a subset of nodes is obtained by removing nodes not in and all edges reaching-leaving nodes not in . Note that a subset of nodes is indicated by capital letters. A clique is a maximal complete subgraph of an UG.

A chain graph, also called partially directed acyclic graph (PDAG), is made by an ordered collection of chain components which are undirected graphs and by directed edges between nodes located in different chain components, so that the arrow is allowed only if belongs to a chain component preceding the chain component in which is located. Therefore directed edges are forbidden within a chain component and in the direction from to , with .

The moralization of a chain graph mimics the moralization of a DAG. A moralized CG, indicated as , is obtained by the following steps:(1)let ;(2)join with undirected edges all pairs of nodes in , with being the union of parents sets for each node in chain component ;(3)iterate step (2) for ;(4)remove directions to all edges.

2.2. Some Markov Properties

Conditional independence [16] is fundamental to reason out highly structured stochastic systems and to simplify the representation of high dimensional distributions.

In this paper the random vector refers to random variables included in the BN. The notation used hereafter is based on nodes in ; thus is also indicated as and the sample space is the Cartesian product of sample spaces of considered variables.

The joint probability distribution of a random vector is Markov with respect to an UG if with being a collection of graph cliques in and with nonnegative functions called clique potentials; note that is the coordinate projection of vector on the subset of coordinates defined by ; is the partition function that normalizes the product of potentials in (1).

Markov properties for positive distributions with respect to an undirected graph may be read using the separation theorem [14]. Let , , be disjoint subsets of nodes. The separation theorem states that subvectors and are conditionally independent given the subvector if and only if all paths from a node in to a node in include nodes located in ; thus nodes in separate nodes in from nodes in .

The joint probability distribution of random variables indexed in is Markov with respect to a DAG if the following factorization holds: where is the random vector made by variables whose labels belong to the parents set of .

Markov properties on a DAG may be read using the separation theorem for directed graphs [14]. Given three disjoint sets of nodes , consider the subgraph defined on after moralization, say . Random subvectors and are conditionally independent given the subvector if and only if nodes in separate nodes in from nodes in in .

A joint probability distribution is Markov with respect to a CG if with , the chain components of . Furthermore factors on r.h.s. of (3) may be factorized by considering the subgraph on nodes defined by : where , are normalization constants, one for each conditioning value of the random subvector . The working UG in (4) is obtained by removing the orientation of edges from parents of nodes in and by joining them into a complete undirected subgraph.

Conditional independence relationships in CGs are also obtained from an extension of the separation theorem in UGs and DAGs for positive distributions [17]. Let be a chain graph and , , be three disjoint subsets of . The separation theorem states that subvectors and are conditionally independent given the subvector if and only if all paths from a node in to a node in in include nodes of ; thus nodes in separate nodes in from nodes in . Note that is the moral graph of the smallest ancestral set for , that is, a subgraph of described in Section 2.1.

2.3. Causal DAGs

A DAG may represent causal relations among variables. According to the causal semantic, an arrow indicates that is a direct cause of with respect to nodes included in , that is, at the considered model granularity. In principle the intervention on variable may bear an effect on . The intervention on a subset of variables indicates the external setting of variables in to prescribed values; thus the system or process is perturbed, not merely observed.

Pearl [2, pp. 27–32] starts with the definition of functional causal models, which are deterministic in nature, and he demonstrates [2, theorem 1.4.1] that such formulation induces the Markov factorization in (2), the so-called Markov causal assumption. An equivalent representation embeds exogenous variables into the node of interest and transforms the deterministic relationships into probabilistic conditioning, thus leading to Bayesian networks.

A key property of a casual DAG is the stability under external intervention: if a variable is manipulated all the other variables maintain their relationships as represented by . In other terms the intervention is local on manipulated variables and it does not break all the other relationships represented in a causal DAG. The intervention regime is in contrast with the plain observation of values taken by the random vector .

The granularity of a causal DAG depends on the variables included in the model. A variable not included in a causal DAG may eventually affect just one variable , with ; otherwise if several variables are affected then a more general class of models is needed, called semi-Markovian networks (not considered further in this work).

For an updated presentation of the approach see [18] while [19] warns against the blind definition of DAGs within the causal semantic in observational studies. He also reconsiders the foundations of causal inference by anchoring them to the extended conditional independence (C.I.), both at algebraic level and with a graphical counterpart based on influence diagrams.

2.4. Inference about the Structure of Bayesian Networks

In all cases in which strong prior information is absent, structural learning is performed by means of a database of exchangeable observations of the random vector .

Several algorithms have been proposed to infer causal and probabilistic relations, but in Bayesian inference key quantities enter into the joint probability distribution of and network’s unknowns given the context : where are vectors of parameters characterizing the conditional probability distributions of given ; variable indicates the unknown DAG, and it is built as a bijection from the set of DAGs (fixed ) to a subset of natural numbers.

The likelihood function is typically expressed as a product of multinomials by using sufficient statistics. Whenever expert beliefs are reasonably captured by Dirichlet prior distributions for elements of , and they are elicited as marginally independent, closed-form integration marginalizes out thetas. The resulting marginal distribution has a reduced dimensionality and may be optimized with respect to while looking for optimal structures characterized by the highest posterior probability values [9]: with being the number of states taken by and of ; sufficient statistics are for the th variable taking the th state while its parent configuration is in the th state; and . A clever choice of hyperparameters guarantees the likelihood equivalence; that is, all BNs equivalent as regards the set of encoded conditional independence relationships are equally scored by (6).

Equation (6) shows that the elicitation of the prior distribution is a key step to perform Bayesian structural learning using .

2.5. Plausible Network Features

A candidate structure on a fixed set of nodes is plausible if it has got structural features (SFs) believed to be relevant by the expert. Following Stefanini [11, Definition  1], we recall the formal definition.

Definition 1 (SF). A structural feature (SF) in a reference set for the set of DAGs on is a predicate describing a plausible probabilistic or causal characteristic of the unknown directed acyclic graph . Argument is in the partition of a given numeric domain of variable . An atomic structural feature (ASF) does not depend on any auxiliary variable .

A reference set is a collection of SFs indexed in a set , with being the number of considered SFs. A proposition might be defined to carry disbelief to a candidate structure, but the feature-rises-belief direction is conveniently adopted here.

An example makes the previous definition operational. Let us define “The number of immediate causes of is in ” to capture the expert belief about the number of parents of node in the unknown DAG. An expert may consider and . Given a candidate structure , its plausibility will be determined by the element that makes the predicate true. A simple representation of configurations taken by reference features is obtained by descriptors [11, Definition  2].

Definition 2 (descriptors). A descriptor for the SF is a map: so that for all and for all ; that is, a different integer is associated with each if . The vector defined on the Cartesian product is called vector of descriptors. The descriptor of an ASF is defined by and .

The th configuration of descriptors in is indicated as while a generic configuration is indicated as . The th configuration of a subvector of defined by indexes in is ; for example, .

Vector induces equivalence classes on ; that is, contains sets of structures, with made by all those DAGs sharing the same configuration .

Note that members of the same equivalence class must be associated with the same degree of belief by the principle of insufficient reason: they differ in irrelevant and unconsidered ways by construction.

Despite the generality of Definition 1 some features are expected to occur more often than others. Below, some of them are described without pretending to be exhaustive.

2.5.1. Indegree and Outdegree

In applications characterized by a small sample size, it is useful to impose a sharp constraint during the greedy search of top-scored candidate structures. The maximum number of arrows entering into a node, the indegree, is set to a small integer, say to , to exclude structures with a large number of parents from the consideration. A large number of parents implies a CPT with a huge number of parameters which are affected by large uncertainty after conditioning to observed data because of sampling zeros. A similar constraint may be set on the number of arrows leaving a node.

Two reference features naturally embed this kind of information.(i)“The maximum number of arrows reaching a node is in for all nodes in ,” where is a small set of integers close to elicited from the expert.(ii)“The maximum number of arrows leaving a node is in for all nodes in ,” where is a small set of integers close to elicited from the expert.

The above two features may be exploited to increase the plausibility of candidate structures which are sparsely connected. Higher control on connectivity is obtained by considering the fraction of nodes showing a given degree of connectivity, as described below.

2.5.2. Partitioned Connectivity

A different way of characterizing the connectivity is obtained by eliciting the minimum fraction of nodes in DAG showing a given number of parents (children) and by iterating the elicitation from parents (children) up to a small integer .

Let be a small integer representing the maximum number of parents (children) to be considered. Let be a vector of nondecreasing numbers in , with being the sample space. The elicitation of this feature is based on a vector , with , and on the induced partition where and .

Two reference features naturally embed the extended evaluation of connectivity:(i)“The -partitioned inconnectivity of degree is in ,” with , , and defined above; a candidate structure shows this feature if the cumulative relative frequency of nodes in with number of parents equal or less than is greater than ; that is, , .(ii)“The -partitioned outconnectivity of degree is in ,” with , , and defined above; a candidate structure shows this feature if the cumulative relative frequency of nodes in with number of children equal or less than is greater than ; that is, , .

Trained experts could prefer a conventional total number of nodes equal to to elicit cumulative percentages instead of cumulative fractions of nodes.

In Table 1, an example is shown where and the elicited vector is defined on fractions. A candidate structure has the partitioned inconnectivity feature if the fraction of root nodes is equal or above , while the proportion of nodes with at least parent is equal or above and so on.

tab1
Table 1: Elicited vector , with . In the top row, the number of parents is reported, while in the second row the correspondent minimum fraction of nodes is shown.
2.5.3. Direct Cause and Direct Effect

The reference feature “The variable is an immediate cause of variable ” refers to as a parent of so that by setting (intervening on) the value of the variable to a given value, the distribution of is modified. This relation holds at the level of selected granularity; thus it may change if the collection of variables (nodes in ) is modified.

2.5.4. Causal Ancestors

The “direct cause” feature may be extended by considering a variable which is on a causal (directed) path reaching node . In this case the reference feature is “The variable is an indirect cause of variable ”; thus the expert believes that one or more variables mediate the effect of on .

2.5.5. Causal Hubs

A hub node in a network is characterized by a high number of arrows leaving it. A reference feature which captures the local connectivity of node is “Node is a hub node of at least outdegree ,” with the outdegree indicating the number of arrows originated in and a set of integers. Note that the defined feature is a localized version of the outdegree feature.

The expert might believe that a hub node should be present, but without indicating a specific node. In this case the -partitioned outconnectivity feature (with a large ) can be exploited for this purpose.

2.5.6. Conditional Independence Relationships

A statement about C.I. among three disjoint subsets of random variables may take the following form: “The random vector is conditionally independent from the random vector given vector ,” with being disjoint subsets of nodes in .

2.6. The Degree of Belief

The prior distribution on the space of structures on is obtained by “extending the argument;” that is, By recognizing that induces the partition , it follows that where is the cardinality of the equivalence class in which is located and where the structural configuration of DAG is . In [12] the size of each equivalence class was estimated by Monte Carlo simulation to face the combinatorial explosion in the number of DAGs to be enumerated with the increase of the number of nodes: where is the total number of DAGs uniformly sampled from the space of all DAGs on , is the size of such space (see [10]), and is number of sampled DAGs showing configuration .

The numerator on the right of (11) represents the joint belief on each configuration of descriptors. While the elicitation in full generality becomes cognitively and numerically overwhelming around descriptors on, some parsimony is achieved if a small number of descriptors may be considered at one time, so that conditional independence relationships among descriptors may be exploited. This is a choice available to the expert through the definition of an order relation on descriptors.

Definition 3 (ordered partition). The ordered partition of descriptors is defined by the expert to indicate disjoints subsets of SFs to be jointly considered during the elicitation, from the first subset to the last.

Otherwise stated, the expert decomposes the whole elicitation problem following an order taken from the substantive content of the specific problem domain: features are grouped according to the priority in the elicitation.

The elicitation of the ordered partition is performed by formulating questions in the language typical of a given problem domain. It is indeed difficult to define those questions in general terms, because they result to be quite abstract and they are likely to be obscure for the domain expert (see [11]). We assume here that questions are properly phrased and that an ordered partition is defined.

If a strict order relation is elicited, each element of contains one descriptor: this is a special case addressed in [20]. Another special case is represented by a trivial partition of just one subset that contains all descriptors [21]. In the two special cases above it is possible to define, respectively, a Bayesian network and a Markov network on descriptors. The general case made by several subsets, each one of cardinality two or more, was addressed by using CG models in [11] for ASFs. Below details are provided about a reference set of nonatomic features.

The joint probability distribution of descriptors is assumed to be Markov with respect to the elicited partition where descriptors within group define the chain component of the CG model (see (3)). The elicitation keeps on by asking in the language of the domain expert which descriptors in subset should be jointly considered while defining the degree of belief. Nodes corresponding to related descriptors are joined by undirected edges and the collection of cliques defining the chain component are found. The elicitation is iterated for all elements in the ordered partition .

The resulting CG may support the elicitation if model parameters are cognitively suited for the quantitative step, that is, if they are interpretable and easy to assess for the expert, at least after some training.

The first chain component is elicited as a marginal distribution which is not conditioned on other descriptors. An undirected graphical model on descriptors in is defined through the multiplicative model in (4), under empty conditioning. Nevertheless some care is needed because potential functions on cliques are not uniquely defined. Here a log-linear parameterization is suggested, following [13]. For example, if the first chain component has just one clique made by three descriptors, say , we define the multiplicative model as below, after exponentiation and rearrangement of log-linear terms: Thus the odds value with respect to the no-feature configuration is explained by a multiplicative model where each factor, for example, , is equal to one if one or more descriptors are null, otherwise they are positive (the so-called treatment parameterization).

The elicitation in (14) is performed on the odds scale by asking the domain expert how many times the configuration is more plausible than for descriptors . This question is iterated for all configurations in the Cartesian product after exclusion of . Questions are posed from single main effects towards higher order interactions terms, so that one factor at a time is considered (see algorithm below).

The procedure is iterated for all cliques in the first CG component , and indeed factors already elicited in previously considered cliques are not reconsidered anymore. For example, the chain component is made by two cliques, and ; thus the shared factor is elicited just one time.

The general algorithm for the first chain component is summarized below.(1)Consider the undirected graph on descriptors elicited as the first chain component. Control questions include the following: “Are all the relevant features included in the elicitation?”, “Are all pairs of features jointly affecting the probability of a structure linked by an undirected edge?”. If needed, revise the order relation and (or) links within chain component.(2)Check out that each descriptor given its neighbors is independent of all other descriptors in the first CG component.(3)Find the cliques of such UG (model generator) [13].(4)For each clique, elicit parameter values by using odds.(i)Elicit main effects , one at a time for all configurations, by assigning an odds value with respect to the baseline with no features; that is, A control question for this step is: “How much above one is the odds value for the sole presence of feature ?”(ii)Elicit first order interactions , one at a time, by assigning a multiplicative term for each configuration under the question: “Which is the value of the multiplicative term needed to account for the interaction of feature with feature ?” The expression helping in this step is where , are already elicited; this is the cross product ratio of features , after dividing by , . It is clear that if the interaction is absent , while means that the interaction reduces the plausibility and raises the plausibility of the considered configuration.(iii)Iterate the step above with higher order interaction terms (for all configurations) with two constraints: before moving to higher interaction terms all the terms of the same degree must have been already elicited; moreover the maximum degree of interaction among a subset of features is defined by the size of the clique they belong to (model generator). For example, with the interaction of order two , the control question might be: “Having already elicited values of , , , , , and , which is the value of the multiplicative term needed to adjust the odds value after considering the interaction among features whose configurations are , , ?”; the helper expression is where factor is already elicited. Similar expressions follow straightforwardly for higher order interactions.(5)Calculate the baseline probability of a structure without features by exploiting .(6)Revise the elicited values following the guidelines of Section 2.7.(7)Move to the next chain component.

The algorithm presented above may be also applied to chain components , that is, with a nonempty set of parents. The elicitation of parameters in a conditional Markov network is performed by iterating the algorithm for the first chain component over each configuration of conditioning parents. It follows that the elicitation burden depends on the number of parents, more precisely, on the cardinality of the Cartesian product where factors are samples spaces of parents.

The algorithms defined above lead to a prior distribution on configurations, but the elicitation does not end before the revision of elicited values takes place.

2.7. Implementation and Revision Issues

The flexibility achieved by defining predicates entails potential pitfalls that should be considered during an elicitation run with structural features.

The revision of elicited beliefs is always needed because the expert is not expected to provide unbiased elicited values, especially with limited training or in complex problem domains. Revision and elaboration of the prior distribution [22, and references therein] are applied to control elicitation bias and other causes of poor elicitation.

Overelicitation is an important practice to check for the presence of elicited quantities that do not reflect the expert degree of belief. Nevertheless, in large networks elaboration of elicited probability values is the main resource for checking the quality of elicitation. In the proposed framework, the elicitation through reference features produces proper probability distributions; therefore one elaboration consists of inspecting one or more margins of the probability distribution , for example, the bivariate margin , to look for configurations whose plausibility causes surprise or disbelief in the expert. This operation is particularly meaningful if inspected margins do not straightforwardly relate to elicited odds, for example, taking descriptors belonging to different cliques. Surprise and disbelief ask for the revision of elicited values. If an association between selected margins and bias is suspected, random selection of margins is an option.

The stability of elicited values against different order relations on descriptors (reference features) should not be assumed. Nevertheless, in complex problem domains overelicitation made by the repetition of the interview with other ordered partitions not only seems unacceptable as regards the work load, but it could even be cognitively unfeasible, for example, if the original ordered partition selected by the expert is induced by a scientific hypothesis.

The core of the proposed approach is based on predicates representing structural features. The expert might believe that some configurations of features are plausible although they are incompatible with DAGs, for example, because they imply the presence of cycles. A positive probability value would be assigned to such a configuration , but this is not an instance of elicitation bias if the elicited distribution properly matches expert beliefs. If this is the case, because .

The way a predicate is specified determines the granularity of the elicitation and the cardinality of equivalence classes in . For example, let us consider two nodes and and the reference feature “Nodes , are not descendent of other nodes in ”.

Reworking the original proposition we have two simpler predicates:(i) “Node is not a descendent of other nodes in ”;(ii) “Node is not a descendent of other nodes in .”and they can be considered by conjunction. In Table 2, the relation between the original reference feature (right) and the conjoint components (left) is shown: collects three configurations generated by the conjunction of simpler predicates. It follows that simpler predicates are needed if the expert degree of belief changes over collapsed configurations in , that is, , , . While nothing prevents the expert from defining rich predicates, care should be taken to select a granularity suited to properly represent expert beliefs.

tab2
Table 2: Relation between the original feature (right) and the conjunction of two simpler structural features (left).

There are indeed several different ways of formulating a predicate. If two reference sets of features induce the same set of equivalence classes then they are operationally equivalent. Nevertheless, from a cognitive standpoint, substantial differences in the ease of elicitation might depend on the way propositions are formulated [22]. Let us consider the reference feature “Nodes precede all other nodes.” Does the expert use “precede” as “come before” all other nodes in the order relationship of nodes in ? Given a DAG with disconnected from other nodes how to answer? Does the expert use “precede” meaning “are ancestors of all other nodes in ” or to indicate that those nodes do not receive arrows coming from other nodes? This example makes clear that some training is mandatory in order to make an expert effective in the elicitation: a trained expert is expected to choose the right granularity in the elicitation and to define meaningful predicates, that is, statements straightforwardly true/false when applied to any DAG defined on .

Several reference features are jointly considered in actual applications, a number typically far beyond what the expert may simultaneously consider with success. Thus there is the possibility that exclusion and implication relations are not recognized. For example, let us consider two features: “The indegree is three or less” and is a sink node.” After reworking the last feature, the expert reformulates the statement as “Node has indegree ten or more.” Clearly the plausibility of should be null; otherwise the actual interpretation of is “The indegree of all but is three or less.”

Implication must be recognized to maintain the probabilistic coherence; for example, let “Variable is a direct cause of ” and “Variable is an ancestral cause of ” be two reference features. Clearly implies and the joint plausibility of both features is bound to the plausibility of . In a normalized reference set logical relationships among features are properly handled.

3. Results and Discussion

A few seminal approaches to the elicitation of beliefs on structures are reconsidered as special cases in the proposed framework based on structural features. A published case study on breast cancer (BC case-study) [23] will be (partially) exploited at illustrative purposes. The whole set of nodes includes: age (AGE), the proliferative index-marker Ki67/MIB-1 (PROLN), oestrogen receptors (ER), progesterone receptors (PR), the receptor tyrosine kinase HER2/neu (NEU), and the P53 protein.

3.1. Buntine 1991

In the seminal paper of Buntine [24], a prior distribution on the set of DAGs for a fixed set is defined by assuming a total ordering of nodes in the context . The probability of the parent set for node is defined by the product of probability for events like “There is an edge ,” shortly , extended to each preceding in the order relation. The subjective probability value elicited for a network structure is calculated by marginal independence of parent sets. The original formalization defines a total ordering , so that if it may belong to the parent’s set of ; then a full specification of beliefs for each edge in the directed graph is needed and measured in units of subjective probability; finally the independence of parent sets is assumed. The distribution on the set of structures is ([24], modified) Given the order relation on four nodes, , parent sets are , , , .

Despite the huge importance of Buntine’s seminal work [24], some limitations should be underlined. Under the causal semantic of BNs, the expert might fail in stating such node order which defines the “causal flow” along nodes. In large regulatory networks of system biology a lot of assignments are expected to be because expert beliefs may involve a small subset of arrows. We also expect that some plausibility assignments depend on what is already assigned, for example, due to biological substantive laws, but the need of such conditioning is not accounted for. Finally, we remark that several node orders are compatible with the same sparse DAG on ; therefore the specification of a strict order should not be enforced.

The above example may be straightforwardly cast in terms of reference features. The order relation is part of the context , and we define the set of reference features “An arrow is from to ,” with all possible pairs in which precedes in the total ordering equal to . The reference set is . In this case a trivial Bayesian network without arrows is equivalent to the prior distribution in (19) and (20) if conditional probability tables are defined by the same values specified in (20) for each feature; that is, features are marginally independent. To see this, note that the context puts a sharp constraint on the space of structures and that the cardinality of the equivalence classes for each configuration of reference features is equal to one; that is,

The formal approach proposed in this paper allows much more flexibility, for example, by restricting the set of nodes to be ordered and by introducing dependence among arrows. For example, let be an ordered subset of nodes taken from BC case study. The reference feature is the order on the relevant subset of nodes” induces two equivalence classes, the first made by DAGs on which do not satisfy , the second one is made by structures without arrows violating the left-to-right order defined in . Besides the sharp constraint obtained by setting , the expert might consider such feature uncertain, thus preferring a degree of belief in the set . Further features could be defined to build the reference set: and the ordered partition captures weaker causal relationships by features like “Variable is a causal ancestor of .” A CG model made by two components makes possible to elicit conditional beliefs about given the presence of an arrow and given the lack of , without assuming a strict order relation on nodes in .

3.2. Heckerman et al. (1995)

In Heckerman et al. [9], a prior network was elicited and compared to a candidate network by counting the number of different edges, , with a high degree of belief assigned to structures closely resembling the prior network. The authors suggested to elicit the hyperparameter and to define the prior distribution to be proportional to .

Among the limitations penalizing the use of this prior we found the following:(i)The impossibility of specifying the degree of belief if it depends not only on the number of different edges but also on their position and type; the presence/absence/direction of an arrow may have an impact on the belief about other edges.(ii)The elicitation about a subset of is not addressed.(iii)The causal semantic is natural for this approach because each arrow represents an immediate cause; it seems difficult to mix probabilistic and causal beliefs by counting differences in arrows because a DAG in the probabilistic semantic is just a member of an equivalence class of DAGs representing the same collection of conditional independence relationships.

A simple reformulation is oriented to computation and it involves three operators: arrow deletion, insertion, and change of direction. The reference set of atomic features is with “The application of operations produces ,” and where “The application of or more operations produces .” An undirected graph made by just one clique is associated with potentials represented in Table 3, where . On the right of Table 3, values of the potential function are shown. The normalization constant is . It is obviously possible to make the two approaches as close as desired by setting with .

tab3
Table 3: Potential function for the Heckerman et al. [9] prior.

An even simpler reformulation exploits the incompatibility of the above features and it is based on just one reference feature: “The application of operations produces ,” with , with arrow manipulations (insert-delete-change) defined as above. Feature essentially defines a plausible neighborhood with respect to a prior network. Further reference features may be introduced to refine such plausibility, for example, by also considering one causal, , and one C.I., , relationships. In other words, a candidate network “close” to the prior network could be associated with a high prior probability which is then tuned according to the presence/absence of two other relevant features. A natural ordered partition on descriptors could be ; thus a two-component chain graph model may support the elicitation.

A different kind of reformulation is based on the full-probabilistic semantic in which a prior DAG is just a way to define a collection of C.I. relationships. In this case, it is natural to define a structural feature for each conditional independence relationship if it is relevant according to the expert among those represented by . The reference set in this case is a collection of plausible C.I. statements taken from the prior DAG. Note that, although it is not essential to draw such prior DAG, it may be useful because a general collection of C.I. statements does not necessarily imply the existence of a compatible DAG.

If none among the C.I. relations is preeminent the ordered partition of descriptors contains just one element, say , and a CG model made by one component (undirected graphical model) is suited for the elicitation.

3.3. Imoto et al. (2003)

In the seminal paper of Imoto et al. [25], the authors developed a framework for combining microarray data and biological knowledge while learning the structure of a BN representing relationships among genes. The proposed criterion has two components, and the second one is particularly interesting because it captures the a priori biological knowledge.

Following their original notation with minor modifications, is the prior distribution of network . Then, the interaction energy of the edge from (gene) to (gene) is defined on a sample space which is categorized into values, say . For example if gene regulates gene then , but if not much is known about such potential regulation then . The total energy of a network is ; thus the sum is taken over existing edges in . The total energy may be rewritten by collecting the parents of each node; thus . The (prior) probability of network is modeled by the Gibbs distribution: where is a hyperparameter and is the normalizing constant, also called the partition function , with being the collection of all DAGs on a fixed set of nodes .

Operationally, the prior information is coded into a square matrix of size defined by the number of genes, with each corresponding to or according to the prior belief. Beliefs in protein-protein interactions are coded by . Protein-DNA interactions between the transcription regulator and the controlled gene are accounted by setting and . Some genes are controlled by a transcription regulator through a consensus motif in their DNA promoter region. If genes have the consensus motif and they are regulated by gene then and .

The seminal approach of Imoto et al. [25] suffers of two main limitations. They stated that the biological knowledge should suggest the partitioning of the underlining continuous energy function but it is not clear how, even after invoking the metaphor of energy from physics. In Example , they tried (but also ) and optimized the selection of , a procedure not much in line with pure preexperimental Bayesian elicitation. Moreover, the sum in the partition function is taken on the set of DAGs on , and it becomes quite intractable from nodes on. It follows that their approach is substantially a way to build a score function in a spirit similar to [26], but without providing a flexible support for the calibration of the score function.

The prior information considered by these authors can be also expressed in terms of reference features, for example, by considering features like “Gene regulates gene ,” for all relevant pairs of genes. If preeminent features are absent, an UG model formally resembles the expression based on energy functions but with some major differences:(1)the parameterization is related to subjective probability through odds values;(2)the normalization constant is easier to calculate, at least if the number of features is less than the total pair of genes (some genes omitted);(3)the general calibration constant disappears, although a similar tuning has been considered elsewhere [21] to smooth raw elicited odds values while trying to compensate for the well known human tendency towards overstating odds values.

3.4. Werhli and Husmeier (2007)

The authors [27], building on the work of Imoto et al. [25], defined a prior information matrix whose elements , with being a pair of integers for nodes and and the relation . If no prior preference about the presence of such arrow is elicited, then ; if elicited beliefs put more plausibility on the lack of arrow ; if higher plausibility favors the presence of the arrow . Note that elements in are not probabilities.

The calculation of the prior probability for a candidate DAG is straightforward if it is represented through an adjacency matrix in which the element in row and column is if the DAG has an arrow form to , and it is zero otherwise. At first the energy of the DAG is calculated as with and being, respectively, the elements of and . The probability elicited for is where is a hyperparameter regulating the overall strength of the elicited degree of belief and with being the partition function which depends on a sum of energy values over the collection of all DAGs on a fixed set of nodes . It follows that an increase of energy is related to a larger mismatch with the prior . In the limit of a noninformative prior is obtained, while for diverging to infinity peaked priors are defined.

The relationship with the seminal approach of Imoto et al. [25] is clear but despite the easier parameterization chosen to define the energy function, several limitations are still present. First of all, the overall strength is not straightforwardly related to (subjective) probability; therefore at some point the expert has to play numerically with fake examples to get the feeling on reasonable values for such hyperparameter. The calibration of the whole approach is difficult because it depends on the calculation of the normalization constant which is hard to obtain due to the calculation of energy values for all DAGs on a fixed set of nodes . Despite the shortcut proposed by the authors, the calculation of the normalization constant remains as a bottleneck of the approach.

The authors increased the flexibility of their approach in representing prior beliefs by using two matrices and : with , being the energy functions, respectively, depending on and . In the elicitation based on a widely adopted database of metabolic pathways, values of are defined as the ratio , with being the number of times two genes appear in a pathway and the number of times that they are linked inside such pathway. Results of a study on genes measured at time points suggested that the procedures using prior information outperformed those without it. Nevertheless, the above mentioned limitations get even worse under such generalization; for example, the explosion in the number of terms entering the partition function must be taken under control by introducing a limitation on the number of arrows entering a node, a technical constraint which may lead to biased elicited beliefs.

The expressivity obtained by these authors through the use of two matrices may be similarly obtained using reference features with some advantages. Back to the BC case study, a major hypothesis under which the elicitation may be performed could be “ER is a hub gene,” so that given the configuration the expert has to express further conditional beliefs about the other markers; for example, “PR regulates P53,” “P53 acts on KI67,” “P53 acts on NEU.” In this context it is natural to consider the order relation on features ; thus a chain graph model is suited for the elicitation with just node in the first chain component and as many arrows leaving from as needed to capture changes of conditional beliefs due to a switch in the major hypothesis, from to . Note that, using a chain graph on features, it is possible to tune the flexibility exactly of the amount needed, without the unnecessary and blind consideration of all pairs of genes.

3.5. Discussion

The expressivity achieved through reference features is wide whether probabilistic or causal information is elicited. Many limitations found in other approaches depend on the consideration of arrows as the key building block of the elicitation. Further restrictions are due to the use of marginal relationships among arrows, so that severe constraints on the expressivity of the approach follow. The elicitation based on reference features has maximum resolution because, for a suitable set of reference features, it is possible to define a prior distribution characterized by a probability value for each DAG defined on .

There are useful side effects in the approach based on structural features. First, the elicitation effort does not depend on the size of the space of DAGs on , but on the size of the collection of features. Another side effect is related to the cardinality of equivalence classes in . Two candidate DAGs and characterized by two different configurations of structural features may receive a very different prior probability value just because the cardinality of the equivalence classes they belong to is very different, say (see (11)). Note that the cardinality of equivalence classes must be taken into account to preserve probabilistic coherence.

The above description of the elicitation process did not deal with computational issues that are very important for applications. The proposed approach is suited to large networks if the number of DAGs within each equivalence class is available, for example, as a Monte Carlo point estimate. Monte Carlo simulation makes it possible to explore the space of DAGs and it provides evidences about the presence of features which are logically incompatible; thus it may also suggest predicates on which the expert should focus to improve the definition of reference features. Moreover, the elicitation defines a proper prior distribution, so that different MCMC algorithms could be developed to obtain the posterior distribution of . New greedy search algorithms to find plausible structures in could be investigated to exploit the elicited reference features, for example, by developing an algorithm that generates candidate DAGs belonging to different equivalence classes in at each iteration of the optimization.

The approach to elicitation based on graphical models does not necessitate very esoteric software, although software libraries for platforms commonly adopted in scientific computing would offer the opportunity of performing extensive testing and of investigating human heuristics specifically relevant in this elicitation framework. It is well known that graphical user interfaces facilitate the elicitation, especially if experts are not much trained in Bayesian statistics, and this is a resource which is anyway almost mandatory to face the elicitation in problem domains involving a large number of structural features.

4. Conclusions

Graphical models may be exploited to elicit beliefs about the structure of an unknown BN from experts. The joint plausibility on configurations of structural features is decomposed according to conditional independence relationships that are considered plausible by an expert. The expert may use CG models to elicit structural prior information in quite complex domains without leaving a full Bayesian framework. From the elicited CG model, and eventually by using an auxiliary Monte Carlo simulation to estimate the cardinality of equivalence classes, a (proper) subjective prior distribution on the space of DAGs is built and ready to be used with the likelihood function in order to find BN structures supported both by expert beliefs and by collected observations.

No surprise that in complex domains the definition of a prior distribution may be costly. A trade-off should be found by considering the goal of the analysis, how much prior information is available, and the cost and importance of collected data. Here system biology and medicine are expected to be fields in which the proposed approach might be useful, because subjective prior information, besides providing the above mentioned benefits, also tempers the curse of dimensionality caused by structures defined on a high number of variables.

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work was partially funded by a grant from the University of Florence.

References

  1. J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San Francisco, Calif, USA, 1988.
  2. J. Pearl, Causality, Cambridge University Press, 2000.
  3. D. E. Heckerman, E. J. Horvitz, and B. N. Nathwani, “Toward normative expert systems: part I. The Pathfinder project,” Methods of Information in Medicine, vol. 31, no. 2, pp. 90–105, 1992. View at Google Scholar · View at Scopus
  4. K. McNaught and A. Chan, “Bayesian networks in manufacturing,” Journal of Manufacturing Technology Management, vol. 22, no. 6, pp. 734–747, 2011. View at Publisher · View at Google Scholar · View at Scopus
  5. B. G. Marcot, R. S. Holthausen, M. G. Raphael, M. M. Rowland, and M. J. Wisdom, “Using Bayesian belief networks to evaluate fish and wildlife population viability under land management alternatives from an environmental impact statement,” Forest Ecology and Management, vol. 153, no. 1–3, pp. 29–42, 2001. View at Publisher · View at Google Scholar · View at Scopus
  6. N. Friedman, M. Linial, I. Nachman, and D. Pe'er, “Using Bayesian networks to analyze expression data,” Journal of Computational Biology, vol. 7, no. 3-4, pp. 601–620, 2000. View at Publisher · View at Google Scholar · View at Scopus
  7. A. O’Hagan and J. Foster, Bayesian Inference, vol. 2 of Kendall’s Advanced Theory of Statistics, John Wiley & Sons, Chichester, UK, 2nd edition, 2004.
  8. D. V. Lindley, “The philosophy of statistics,” Journal of the Royal Statistical Society D, vol. 49, no. 3, pp. 293–337, 2000. View at Google Scholar · View at Scopus
  9. D. Heckerman, D. Geiger, and D. M. Chickering, “Learning Bayesian networks: the combination of knowledge and statistical data,” Machine Learning, vol. 20, no. 3, pp. 197–243, 1995. View at Publisher · View at Google Scholar · View at Scopus
  10. R. W. Robinson, “Counting labeled acyclic digraphs,” in New Directions in the Theory of Graphs, pp. 239–273, Academic Press, New York, NY, USA, 1973. View at Google Scholar
  11. F. M. Stefanini, “Graphical models for eliciting structural information,” in Classification and Data Mining, A. Giusti, G. Ritter, and M. Vichi, Eds., Springer, Berlin, Germany, 2013. View at Google Scholar
  12. F. M. Stefanini, “Eliciting expert beliefs on the structure of a Bayesian network,” in Proceedings of the Probabilistic Graphical Models (PGM '08), Hirtshals, Denmark, 2008.
  13. J. Whittaker, Graphical Models in Applied Multivariate Statistics, John Wiley & Sons, Chichester, UK, 1990.
  14. R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter, Probabilistic Networks and Expert Systems, Springer, New York, NY, USA, 1999.
  15. D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques, MIT Press, Cambridge, Mass, USA, 2009.
  16. A. P. Dawid, “Conditional independence in statistical theory,” Journal of the Royal Statistical Society B, vol. 41, no. 1, pp. 1–31, 1979. View at Google Scholar
  17. M. Frydenberg, “The chain graph Markov property,” Scandinavian Journal of Statistics, vol. 17, no. 4, pp. 333–353, 1990. View at Google Scholar
  18. J. Pearl, “Causal inference in statistics: an overview,” Statistics Surveys, vol. 3, pp. 96–146, 2009. View at Publisher · View at Google Scholar · View at Scopus
  19. A. P. Dawid, “Beware of the DAG,” Journal of Machine Learning Research, vol. 6, pp. 59–86, 2008. View at Google Scholar
  20. F. M. Stefanini, “Prior beliefs about the structure of a probabilistic network,” in Proceedings of the Statistical Methods for the Analysis of Large Data-Sets (SIS '09), Pescara, Italy, 2009.
  21. F. M. Stefanini, “The revision of elicited beliefs on the structure of a Bayesian network,” in Proceedings of the Complex Data Modeling and Computationally Intensive Statistical Methods for Estimation and Prediction (SCo '09), Milano, Italy, 2009.
  22. P. H. Garthwaite, J. B. Kadane, and A. O'Hagan, “Statistical methods for eliciting probability distributions,” Journal of the American Statistical Association, vol. 100, no. 470, pp. 680–700, 2005. View at Publisher · View at Google Scholar · View at Scopus
  23. F. M. Stefanini, D. Coradini, and E. Biganzoli, “Conditional independence relations among biological markers may improve clinical decision as in the case of triple negative breast cancers,” BMC Bioinformatics, vol. 10, supplement 12, article S13, 2009. View at Publisher · View at Google Scholar · View at Scopus
  24. W. Buntine, “Theory of refinement on Bayesian networks,” in Proceedings of the 7th Conference on Uncertainty in Artificial Intelligence, pp. 52–60, 1991.
  25. S. Imoto, T. Higuchi, T. Goto, K. Tashiro, S. Kuhara, and M. Satoru, “Combining microarrays and biological knowledge for estimating gene networks via Bayesian networks,” in Proceedings of the IEEE Computational Systems Bioinformatics (CSB '03), pp. 104–113, IEEE Computer Society, 2003.
  26. M. Mascherini and F. M. Stefanini, “Using weak prior information on structures to learn Bayesian networks,” in Knowledge-Based Intelligent Information and Engineering Systems, B. Apolloni, R. J. Howlett, and L. Jain, Eds., vol. 4692 of Lecture Notes in Computer Science, pp. 413–420, 2007. View at Publisher · View at Google Scholar
  27. A. V. Werhli and D. Husmeier, “Reconstructing gene regulatory networks with Bayesian networks by combining expression data with multiple sources of prior knowledge,” Statistical Applications in Genetics and Molecular Biology, vol. 6, no. 1, article 15, 2007. View at Publisher · View at Google Scholar · View at Scopus