Abstract

Bayesian Networks are graphic probabilistic models through which we can acquire, capitalize on, and exploit knowledge. they are becoming an important tool for research and applications in artificial intelligence and many other fields in the last decade. This paper presents Bayesian networks and discusses the inference problem in such models. It proposes a statement of the problem and the proposed method to compute probability distributions. It also uses D-separation for simplifying the computation of probabilities in Bayesian networks. Given a Bayesian network over a family of random variables, this paper presents a result on the computation of the probability distribution of a subset of using separately a computation algorithm and D-separation properties. It also shows the uniqueness of the obtained result.

1. Introduction

Bayesian networks are graphical models for probabilistic relationships among a set of variables. They have been used in many fields due to their simplicity and soundness. They are used to model, represent, and explain a domain, and they allow us to update our beliefs about some variables when some other variables are observed, which is known as inference in Bayesian networks.

Given a Bayesian network [1] relative to the set of random variables, we are interested in computing the joint probability distribution of a nonempty subset (called target) of .

The computation of the probability distribution of requires marginalizing out a set of variables of from the joint distribution corresponding to the Bayesian network.

In large Bayesian networks, the computation of probability distributions and conditional probability distributions may require summations relative to very large subsets of . Consequently there is a need to order and segment, if possible, these computations into several computations that are less intensive and more accessible to a parallel treatment. These segmentations are related to the graphic properties of the Bayesian networks.

This paper describes the computation of using a specific order described by a proposed algorithm, and the segmentations of .

We consider discrete random variables, but the results presented here can be generalized to continuous random variables with the density of relative to a finite measure (the summations will be replaced by integrations relative to those measures ).

The paper is organized as follows. Section 2 introduces Bayesian networks and Level two Bayesian networks. We, then, present in Section 3 the inference problem in Bayesian networks and the proposed computation algorithm. Section 4 outlines D-Separation in Bayesian networks and describes graphical partitions that will allow the segmentations of the computations of probability distributions. Section 5 proves the uniqueness of the results obtained in Sections 3 and 4.

2. Bayesian Networks and level two Bayesian networks

2.1. Bayesian Networks

A Bayesian network (BN) is a family of random variables such that: (i)the set is finite and endowed with a structure of a directed acyclic graph (DAG), , where, for each : (a) is the set of parents of () (b) is the set of children of () (c) is the set of descendants of (); (ii)for each , is independent of conditional on (for more details see, e.g., [15]).

We know that this is equivalent to the equality where is the joint probability distribution of and is the probability distribution of conditional on .

The joint probability distribution corresponding to the BN in Figure 1 can be written as

2.2. level two Bayesian networks

We consider the probability distribution of a family of random variables in a finite space . Let be a partition of and let us consider a DAG on .

We say that there is a link from to (where and are atoms of the partition ) if . If , we denote by the set of parents of , that is, the set of such that .

The probability is defined by the Level Two Bayesian network (BN2), on , , if for each , we have the conditional probability , or the probability of conditioned by (which, if , is the marginal probability ), so that

The probability distribution associated to the BN of level 2 in Figure 2 can be written as

2.3. Close Descendants

We define the set of close descendants of a node (denoted ) as the set of vertices containing the children of and the vertices located on a path between and one of its children.

In the example below (Figure 2), we have

2.4. Initial Subset

For each subset , we denote by the initial subset defined by , that is, the set consisting of itself and the such that there is a path in from to .

We can identify this subset with the union of all such that is an ancestor of

For each , the Initial subset is a BN, in other words the restriction of a BN to an initial subset is a BN

In the example above (Figure 2), we have

3. Inference in Bayesian Networks

Consider the BN in Figure 1.

Suppose we are interested in computing the distribution of , in other words all variables are in the target except .

By marginalizing out the variables from the joint probability distribution , the target distribution can be written as

is a function that depends on , and but has nothing to do with joint probability distribution of .

By doing this we loose the structure of the BN.

If we do the marginalization as follow we obtain (according to Bayes' theorem):

In other words .

Which provides with a structure of a level two Bayesian network shown in Figure 3.

The variables used in the marginalization above, to keep a structure of a BN2, is the the set of close descendants defined above.

More general, if we have to sum out more than one variable there is a need to order the variables first. The aim of the inference will be to find the marginalization, or elimination, ordering for the arbitrary set of variables not in the target. This aim is shared by other node elimination algorithms like “variables elimination” [6], “bucket elimination” [7].

The main idea of all these algorithms is to find the best way to sum over a set of variables from a list of factors one by one. An ordering of these variables is required as an input. The computation depends on the ordering elimination; different elimination ordering produce different factors.

The algorithm we proposed to solve this problem is called the “Successive Restrictions Algorithm” (SRA) [8]. SRA is a goal-oriented algorithm that tries to find an efficient marginalization (elimination) ordering for an arbitrary joint distribution.

The general idea of the algorithm of successive restrictions is to manage the succession of summations on all random variables out of the target in order to keep on it a structure less constraining than the Bayesian network, but which allows saving in memory; that is the structure of Bayesian Network of Level Two. This was possible using the notion of close descendants.

The principle of the algorithm was presented in details in [8].

4. D-Separation and Computations in a Bayesian Network

We have introduced an algorithm which makes possible the computation of the probability distribution of a subset of random variables of the initial graph. It is also possible to use the SRA to compute any probability distribution of a set of variables conditionally to another subset ().

This algorithm tries to achieve the target distribution by finding a marginalization ordering that takes into account the computational constraints of the application. It may happen that, in certain simple cases, the SRA would be less powerful than the traditional methods [6, 913], but it has the advantages of adapting to any subset of nodes of the initial graph, and also to present in each stage interpretable result in terms of conditional probabilities, and thus technically usable.

In addition to the SRA we propose, especially for large Bayesian networks, to segment the computations into several less heavier computations that could be carried independently. These segmentations are possible using the D-separation.

4.1. D-Separations and Classical Results

Consider a DAG . A chain is a sequence of elements of such that for all , or .

are called intermediate nodes on this chain.

On an intermediate node a chain can have three connexions as illustrated in Figure 4.

Let be a DAG, , and be distinct nodes in . A chain between and is d-separated by if there is an intermediate node satisfying one of the two properties: (i) and the connection is, on , serial or diverging, (ii) and the connection is converging on .

In other words, A chain is not d-separated by if it is in a converging connection at each intermediate node of , and in a serial or a diverging connection at each intermediate node that has no descendants in .

Classic Result  1
If and are d-separated by . then the variables and are independent conditional on .

4.2. Notions

Given a subset of . Markov Blanket. of a subset is made of the parents of , the children of and the variables sharing a child with .Markov Boundary of . Close Descent of . is the set of all close descendants of the elements in other that those in itself.Exterior Roots of . is the set of the parents of the elements of other that those in itself.

As we can see on the following example:

, , , .

Classic Result  2
is d-separated from the rest of the variables conditional on the Markov blanket of . The proof of these two results and other results related to D-separation can be found in [11].

4.3. Moral and Hypermoral Graphs

Another classic graphic property used with some inference algorithms that we can find in the literature is the notion of the Moral graph.

Moral Graph
Given a DAG
its associated moral graph is the undirected graph , where is the set
(i)of pairs such that or , (ii)of pairs such that and have a child in common.

In a similar way, we define what we call the hypermoral graph defined as follow:

Hypermoral Graph
Given a DAG
its associated hypermoral graph is the undirected graph , where is the set:
(i)of pairs such that or , (ii)of pairs such that and have a close descendant in common.

In Figure 5, there is a link between 4 and 7 in the hypermoral graph because they share 9 as a close descendant.

4.4. Moral and Hypermoral Partitions

The moral graph helps defining the moral partition as follow.

(i) We call a -moral partition the partition of , denoted , defined by the equivalence relation , where means “ there exists a chain from to , in , not blocked by .”

In an equivalent way “there exists a chain, in the moral graph , connecting to without an intermediate node in .”

In a similar way we define the hypermoral partition.

(ii) We call an -hypermoral the partition of , denoted , defined by the equivalence relation , where means “there exists a chain, in the hypermoral graph , connecting to without an intermediate node in ”. As Illustrated in Figure 6.

4.5. Results

The following results show the possibility of segmenting the computation of the probability distribution .

Theorem 4.1. Let be a BN, and let be a subset of . Let be the -hypermoral partition of and let be the set of elements of which are not close descendants of any element of , that is, . Then, where

The proof of the theorem can be found in [14].

Theorem 4.2. The set of singletons , where (if ), and of subsets , where , constitutes a partition of . As Illustrated in Figure 7.

5. Unique Partition

We have seen in the last two sections that the application of the SRA for the computation of provides a structure of BN2, and that the use of D-Separations properties allows the segmentation of the computation of and provides also a structure of a level two Bayesian network on . In fact the two obtained structures are same, this results is giving by the following theorem.

Theorem 5.1. The following two sets. (1)The subsets , where . (2)The set of singletons , where (if ) constitute a unique partition defining a BN2 on .

Interpretation
This theorem indicates that the level two Bayesian network, characterizing the probability distribution , obtained by application of the SRA, is unique independently of the choices done while running the algorithm. This unique partition is constituted of sets of the two types 1 and 2 mentioned above.

Proof. Let us show that the partition of the target consists of the subsets of types 1 and 2 as mentioned above in the theorem, in other words consists of where , and for all .
As is a BN-containing , without a loss of generality, we can limit ourselves to the case where .
The application of the SRA for the computation of requires marginalizing out the set of variables of following a specific order. Let try to show that the obtained partition by application of the SRA is same mentioned above.
Let us proof this result by induction on the cardinality of .
Let us assume that Card. In this case has only one element, .
On one side, marginalizing out the variable (á) , by application of the SRA, creates a new node that contains the close descendants of (i.e., ).
The BN2 resulting from this marginalization is formed of the new node along with all other remaining nodes, in other words all the such that , which is shown in Figure 8.
On the other side, since Card, is constituted of a unique equivalent class, , so, by definition, the partition of is constituted of
(1), (2)all other nodes, in other words the nodes such that This shows the result in this first case.
Let us suppose now that Card, and as an hierarchical order on .
We are going to sum out following the inverse order of the given hierarchical order.
Let us assume that the result is right till step (in other words marginalizing out ) and let's proof the result for step (in other words marginalizing out ).
What justify the proof by induction is that once the marginalizing out these last elements will not interfere in the next steps of marginalizing out .
The result is right till step means that we dispose on , of a partition that contains the subsets for , and the nodes such that , (i.e., ), where is the -hypermoral partition associated to .
Showing the result for means to find a partition of constituted of subsets of type 1 et 2 mentioned above.
On one hand, let's first try to find the partition obtained by application of the SRA. We know that, the marginalization of creates a new node, , which contains the close descendants of in the BN2 on (), shown by Figure 9.
So if we write , the set of such that for all , and , the set of such that for all , .
In this case, the partition of is constituted of the new node , and all the other nodes that are left isolated, in other words all the nodes and all the where .
If we write so the partition of by application of the SRA is composed of the following two types of sets:
(1), (2)all other isolated nodes, that is, the nodes of .
On the other hand, let's now try to determine the partition of using the D-Separation properties.
We have on a partition composed of nodes where , and all nodes such that .
Since , and no descendant of is in , (by definition of the hierarchical order), so only one equivalent subset is associated, , it results by definition, the partition of is composed of the following types of subsets:
(1), (2)all other isolated nodes such that
So we have .
This shows that this partition is same partition obtained by application of the SRA.