Abstract

Learning Bayesian network (BN) structure from data is a typical NP-hard problem. But almost existing algorithms have the very high complexity when the number of variables is large. In order to solve this problem(s), we present an algorithm that integrates with a decomposition-based approach and a scoring-function-based approach for learning BN structures. Firstly, the proposed algorithm decomposes the moral graph of BN into its maximal prime subgraphs. Then it orientates the local edges in each subgraph by the K2-scoring greedy searching. The last step is combining directed subgraphs to obtain final BN structure. The theoretical and experimental results show that our algorithm can efficiently and accurately identify complex network structures from small data set.

1. Introduction

Bayesian networks (BNs), also known as belief networks, are becoming a popular tool for representing uncertainty in artificial intelligence. They have been applied to a wide range of tasks such as natural spoken dialog systems, vision recognition, expert systems, medical diagnosis, and genetic regulatory network inference [15]. A BN consists of two important components: a directed acyclic graph (DAG) representing the dependency structure among the variables and a conditional probability table (CPT) for each variable given its parent set. There has been a lot of work in the last ten years on the learning of BNs for both graph structure and probability parameters. However, learning the structure is harder and, arguably, more critical [68]. Most of these algorithms can be grouped into two different categories: constraint-based methods and search-and-score methods. The algorithms based on constraint generate a list of conditional independence (CI) relationships among the variables in the domain and attempt to find a network that represents these relationships as far as possible [911]. The number, complexity, and reliability of the required independence tests are the main concerns regarding this type of algorithms. The algorithms based on the score-and-search paradigm see the learning task as a combinatorial optimization problem, where a search method operates on a search space associated with BNs. They evaluate the degree of fitness between each element and the available data by a scoring function [1214]. Among the search-and-score algorithms, the K2 algorithm [15] is commonly used which learns the network structure from data by requiring a prior ordering of nodes as input. However, searching for the best structure is difficult because the search space increases exponentially with the number of variables [16].

In this paper, we propose a BN structure learning algorithm, which combined with the merits of the constraint-based method and the K2 algorithm. The proposed algorithm not only employs constraint knowledge to decompose the search space but also uses the K2 score as heuristic knowledge to induce the process of local greedy search. It uses the property that the local information of variables in each maximal prime sub-graph cannot be destroyed by decomposing the undirected independence graph. At the same time, the K2 algorithm can be used to learn the local structure of each undirected subgraph and obtain a group of directed subgraphs. Thus, by combining these directed subgraphs, we obtain the final structure. Theoretical results and large number of experiments show that the new algorithm is effective and efficient, especially in small data set. Moreover, the proposed algorithm uses maximal prime decomposition to identify the whole graph structure, which reduces the search space significantly and greatly enhances learning speed.

The remainder of this paper is organized as follows. Section 2 introduces notation and definitions. We describe our algorithm and its theoretical proofs in Section 3. Section 4 discusses how to construct the moral graph from observed data. Simulation studies are conducted to demonstrate the performance of our algorithm and existing algorithms in Section 5. Finally, in Section 6, we conclude and outline our future work. All proofs will be presented in the Appendix section.

2. Notation and Definitions

In this section, we provide some basic technical terminologies that are sufficient for understanding this paper.

A BN is a tuple , where is a directed acyclic graph (DAG) with nodes representing the random variables in and a joint probability distribution on . A node is called a parent of if the directed edge . The set of all parents of is denoted as . A path from to is a sequence of distinct nodes such that or for . We say that is an ancestor of and is a descendant of if there is a directed path from to in . In addition, and must satisfy the Markov condition: every variable is independent of any subset of its nondescendant variables conditioned on the set of its parents, denoted by . We denote the conditional independence of the variable sets and given in some distribution with . A path is said to be d-separated by a set in a DAG , if and only if (1) contains a “head-to-tail meeting”: or a “tail-to-tail meeting”: such that the middle node is in , or (2) contains a “head-to-head meeting”: such that the middle node is not in and no descendant of is in . Two distinct sets and of nodes are said to be d-separated by a set in if d-separates every path from any node in to any node in . We use to denote the assertion that is d-separated from given in .

For a DAG , its moral graph is an undirected graph obtained by connecting every pair of nodes with a common child which are not already connected in and then dropping the directionality of all directed edges. An undirected path is said to be separated by a set in a moral graph , if and only if passes through . A BN satisfies the faithfulness condition if the d-separations in identify all and only the conditional independencies in , that is, if and only if . We will drop the subscript or in the notation of conditional independence when the faithfulness condition holds. Let denote a graph consisting of a finite set of nodes.

Definition 2.1 (see [17]). Suppose the triplet denote a partition of where ; if every path in between and contains a node in and is complete, then is decomposable and is a decomposition of into subgraphs and ; otherwise is prime. Moreover, is called a complete separator of with respect to .

Furthermore, an undirected graph is said to be decomposed maximally if and all its sub-graphs can be decomposed recursively until all sub-graphs are not decomposable. These sub-graphs are the maximal prime sub-graphs of . Transformation of a BN into its maximal prime sub-graphs is equivalent to recursively decomposing the moral graph of DAG. The most well-known algorithm is the MPD-JT algorithm [17] which first triangulates the moral graph by adding a fill-in edge to every cycle whose length is greater than three and then identifies all the cliques of triangulation graph and arranges them as a junction tree (JT), and finally, recursively aggregates cliques connected by incomplete separators (in the moral graph). Figure 2 schematically illustrates the process of the MPD-JT algorithm.

Example 2.2. Consider the Asia network [18] in Figure 1(a). Its moral graph is shown in Figure 1(b). By our definition, is decomposable. Figure 1(c) provides one of its triangulation graphs of and Figure 2(a) gives its corresponding junction tree. Since is an incomplete separator in the moral graph , we aggregate cliques BLS and BEL according to the MPD-JT algorithm. The resulting graph is shown in Figure 2(b). From Figure 2(b), we can see that all the separators are complete in , so, there are no cliques that need to be aggregated. We obtain five maximal prime sub-graphs (Figure 2(c)).

The K2 algorithm is a greedy search algorithm that learns the network structure from data. It attempts to select the network structure which maximizes the network's posterior probability, given the observed data. Its scoring function can be expressed as follows: is the number of cases in where is in its th state and its parents are in their th state. denotes the number of cases in where is in its th state, denotes the number of states of variable , and is the number of parent configurations of . It is obvious that the search space is the main element influencing the performance of the K2 algorithm. In theory, the more constraint conditions, the smaller the search space of BNs. And then the search efficiency will be much higher. Hence, it is very necessary to reduce the search space. Considering the BN's own characteristic, we combine the constraint approach with maximal prime decomposition to learn a BN structure.

3. The Structure Learning Algorithm and Its Correctness

In this section, theoretical results are presented for learning the structure of BN. We show how the problem of learning the structure over the full set of all variables can be split into its subproblems. Below we first give two theorems based on which we propose the decomposition algorithm for structural learning of BNs.

Theorem 3.1. Let be a moral graph for a DAG . Then a subset of variables d-separates from in if and only if separates from in .

By Theorem 3.1, we can know that the condition of d-separation in a DAG is equivalent to the separation in its moral graph.

Example 3.2. Consider the Asia network in Figure 1(a). By the definition of d-separation, we can see that and are d-separated by , because the path is blocked at . Since there is a path and is a head-to-head node, and are not d-separated by . Thus, we can conclude that in the moral graph of Asia network in Figure 1(b), and are separated by and does not separate from . In fact, it is obviously held in Figure 1(b).

Theorem 3.3. Let be a moral graph for a DAG and let be maximal prime sub-graphs of . For any variable , there exists at least one maximal prime sub-graph of which contains and .

A consequence of Theorem 3.3 is that a problem of structural learning from a data set can be decomposed into problems of local structural learning from a subdata set. The guarantee of such approach is that each subdata must contain sufficient data information of variable and its parent set.

Based on the previous analysis, an improved algorithm was proposed combined with the merits of previous two basic methods for BN structure learning. First, according to the observed data or domain knowledge, we construct the independent graph (the moral map) of the target DAG applying constraint-based approach and then decompose the independent graph. It is shown that each maximal prime sub-graph contains the sufficient information of local variables, so, the search space of the score function can be effectively reduced. Second, the local structure of each sub-graph is learned by using the score function and a directed acyclic graph is obtained for each sub-graph. Finally, we combine all these directed acyclic sub-graphs. Theoretical and experimental results show that the new algorithm is effective and reasonable. Now we formalize our algorithm in Algorithm 1.

(1) Input: Data set ; Variable set ; Node ordering .
(2) Construct the moral graph from the data set .
(3) Decompose the moral graph into its maximal prime subgraphs .
(4) For each sub-graph of , call the K2 algorithm with the local node
 ordering of to construct a directed acyclic graph .
(5) Combine into a directed graph where
and .
(6) Output: A directed acyclic graph .

As shown in Algorithm 1, the proposed algorithm first decomposed the entire variable set into its subsets. Then the local directed graph of each subset is recovered by K2 algorithm. Unlike the case of the X-G-Z algorithm [19], the final result of our algorithm is a DAG, not a partial directed graph, and the procedure of finding minimal d-separators is avoided. Furthermore, the computational complexity of Algorithm 1 is less than that of the K2 algorithm. In fact, the triangulation used to construct a junction tree from an undirected independence graph is the main cost of the MPD-JT algorithm. Although the problem of optimally triangulating an undirected graph is NP hard, suboptimal triangulation methods may be used provided that the obtained tree does not contain too large cliques. Two most well-known algorithms are the lexicographic search [20] and the maximum cardinality search [21], and their complexities are and , respectively, where is the number of nodes and is the number of edges in the graph. Thus, decomposition of is a computationally simple task compared to the exhaustive search using the K2 score. Let denote the number of variables in , and the number of cases in data set . In the worst case, the complexity of the K2 algorithm is [15] where is the largest number of states of a variable in . Suppose that is decomposed into subgraphs, where . Let denote the number of variables in the largest subgraph, that is, , where denotes the number of variables in . The complexity for constructing a directed acyclic subgraph in Algorithm 1 is , and thus that of all directed subgraphs is . Since is usually much less than , our algorithm is less computationally complex than the K2 algorithm. We now establish the correctness of our algorithm by showing that our final result is a DAG.

Theorem 3.4. Given a data set and a variable (node) set , then the graph returned by Algorithm 1 is a DAG.

4. Construction of the Moral Graph from Observed Data

In general, the moral graph of an underlying BN can be obtained from the observed data based on the conditional independence tests. An edge is included in the moral graph if and only if and are not conditionally independent given all other variables. However, these require sufficient data for estimating the parameters and for improving the power of tests. To avoid testing high order conditional independence relations, we propose a Markov boundary discovery algorithm which is based on a subroutine MMPC [22]. The Markov boundary of a variable , denoted as , is a minimal set of variables conditioned on which all other variables that are probabilistically independent of the target . Furthermore, the set of parents, children, and spouses of is its unique Markov boundary under the faithfulness condition [23]. The MMPC algorithm is sketched in the appendix section.

According to Algorithm 2, we can see that our algorithm starts with a two-phase approach. A shrinking phase attempts to remove the most irrelevant variables to , followed by a growing phase that attempts to add as many dependent variables as possible. The growing phase is interleaved with the shrinking phase. Interleaving the two phases allows to eliminate some of the false positives in the current boundary as the algorithm progresses during the growing phase. Theoretically, the more the constraint knowledge obtained by CI tests, the smaller the search space, and the higher the searching efficiency. However, the results of higher-order CI tests may be unreliable. Steps 3 and 4 of Algorithm 2 only use order-0 and order-1 CI tests to reduce the search space whose number of CI tests is bound by . In Steps 6 and 7 of Algorithm 2, we only condition on subsets of sizes up to one instead of conditioning on all subsets of the . The order of the complexity is where is the largest set of parents and children over all variables in and is the number of variables in , . Thus, the total complexity of Algorithm 2 is .

(1) Input: Data set ; Variable set ; Target variable .
(2) Initialization: .
(3) Order-0 CI test: for each variable , if is hold, then .
(4) Order-1 CI test: for each variable , if there is a variable such that
, then .
(5) Find superset of spouses: for each variable , if there is a variable
, such that , then .
(6) Find parents and children of : call the MMPC algorithm to get
. For each , if ,
 then .
(7) Find spouses of : for each variable , if there is a variable
and a subset , such that
and , then .
(8) Return .
(9) Output: A Markov boundary of .

Theorem 4.1. Suppose () satisfies the faithfulness condition, where is a DAG and is a joint probability distribution of the variables in . For each target variable , Algorithm 2 returns the Markov boundary of .

By Theorem 4.1, a moral graph can be constructed from observed data.

On the other hand, can be constructed based on the prior or domain knowledge rather than conditional independence tests. The domain knowledge may be experts' prior knowledge of dependencies among variables, such as Markov chains, chain graphical models, and dynamic or temporal models. Based on the domain knowledge of dependencies, data patterns of databases can be represented as a collection of variable sets , in which variables contained in the same set may associate each other directly but variables contained in different sets associate each other through other variables. This means that two variables that are not contained in the same set are independent conditionally of all other variables. From the data patterns, we get separately undirected subgraphs. Combining them together, we obtain the undirected graph .

5. Experimental Results

In this section, we present the experimental results carried out with our algorithm on two standard network data sets (Alarm [24] and Insurance [25]). The first one is a medical diagnostic system for patient monitoring. It consists of 37 nodes and 46 edges. The random variables in the Alarm network are discrete in nature and can take two, three, or four states (Figure 3). The second example is a network for estimating the expected claim costs for a car Insurance policyholder. It consists of 27 nodes and 52 edges connecting them. Our implementation is based on the Bayesian network toolbox written by Murphy [26] and the Causal Explorer System developed by Aliferis et al. [27]. The experimental platform was a personal computer with Pentium 4, 3.06?GHz CPU, 0.99 GB memory, and Windows XP.

5.1. Learning Networks with Node Ordering

In this subsection, we show simulations of the Alarm network when the node ordering is known. Although our algorithm is based on a constraint-based method and a search-and-score method, given that the result of network returned by Algorithm 1 is a DAG, which has some similarities with the search-and-score methods, it is interesting to include one of these methods in comparison. We have selected the well-known K2 algorithm. We begin with a BN which is completely specified in terms of structure and parameters, and we obtain a data set of a given size by sampling from . Then, using our algorithm and the K2 algorithm, respectively, we obtain two learned networks and , which must be compared with the original network . More precisely, the experiments are carried out on different data set randomly selected from the database (100000 data points). The size of the data set is varied from 500 to 10000, and 10 replicates are done for each of the different network parameters and sample sizes. The average number of missing edges, extra edges, and reversed edges in the learned networks with respect to the original one is computed.

We compare our methods with two different significance levels on Alarm network, that is, and. The results are shown in Table 1 for different sample sizes 500, 1000, 2000, 5000, and 10000. In the second row of table, four values in a bracket denote the number of missing edges, extra edges, reversed edges, and computation time, respectively, and other rows give values relative to the second row, which are obtained by dividing their real values by the values in the second row. A relative value larger than 1 denotes that its real value is larger than the corresponding value in the second row. From Table 1, the first thing that can be observed is that these results seem to confirm our intuition about the need to use Algorithm 1 with a smaller significance level than those typically used for independence tests, since Algorithm 1 with the value offers better results than that with . It also shows that the structure obtained by our algorithm () has the least number of missing edges, extra edges, and reversed edges. And further, our algorithm costs the least time.

Table 2 displays the K2 scores obtained for the original Alarm network, network returned by Algorithm 1 and network returned by K2 algorithm, respectively. It is easy to see that the larger the dataset size is, the closer the score to the original one. However, when the dataset size is smaller than 5000, the score returned by our algorithm is more approximate to the original one. At the same time, it can be found that our algorithm performs better than the K2 algorithm in terms of the running results on all data sets. Moreover, the advantage is very obvious when the data set is small; namely, the smaller the data set size, the more obvious the improvement.

5.2. Learning Networks without Node Ordering

From the previous experimental results, we can see that our algorithm is capable of identifying structures that are close to optimal ones, given a prior ordering of the variables. However, this ordering information may not be available in real-life applications. Thus, in this subsection, Algorithm 1 is extended to manage the problem in which the node ordering is not available. In fact, we only need to delete the directed cycles after combining all directed subgraphs at the expense of introducing a little complexity.

Now we compare our extended method with the X-G-Z [19], RAI [28], and MMHC [22] algorithms. Similar with Section 5.1, the size of the data set is varied from 1000 to 10000, and 10 replicates are done for each of the different network parameters and sample sizes. We use two significance levels ( and ) in the simulations. Unlike the situation of Section 5.1, algorithms that return a DAG are converted to the corresponding partial directed acyclic graph (PDAG) before comparing the quality of reconstruction. The PDAG is equivalent to its DAG. The average number of missing edges, extra edges, and reversed edges in the learned PDAG with respect to the underlying PDAG is computed.

We summarize the simulation results in Tables 3 and 4. In the second row of tables, five numbers in a bracket denote the number of missing edges, extra edges, reversed edges, the sum of the first three values, and computation time, respectively. Other rows give values relative to the second row, which are obtained by dividing their real values by the values in the second row. A relative value larger than 1 denotes that its real value is larger than the corresponding value in the second row. In each column, the best of the eight results are displayed in bold. From a general point of view, we can see that the X-G-Z algorithm obtains the least number of missing edges, the MMHC obtains the least number of extra edges, and Algorithm 1 has the least number of reversed edges. In terms of the sum of the three kinds of edges, the RAI and MMHC algorithms perform better than the X-G-Z algorithm and Algorithm 1 performs best. Although it can be seen that Algorithm 1 seems to have a similar performance with RAI in most cases, the running time on all data sets cost for Algorithm 1 is least. Moreover, the advantage is very obvious when the data set is large; namely, the bigger the sample size, the more obvious the improvement. The main reason is that Algorithm 1 uses lower-order CI tests and employs maximal prime decomposition technique to effectively decompose the search space which cut down many computings of statistic factors, scorings of the structures, and comparisons of the solutions and thus greatly enhances the time performance. Contrast Table 4 with Table 3, since the Insurance network has more variables than the Alarm network, it can be found that the running time of X-G-Z fast increases as the number of variable increases. However, Algorithm 1 is not sensitive to the increase of the variable capacity. In conclusion, our algorithm has a better overall performance compared to the other state-of-the-art algorithms.

6. Conclusions

In this paper, we have given a more accurate characterization of moral graph and proposed a new algorithm for structural learning, which substantially improves on the K2 algorithm. We also extend our algorithm to manage networks without node ordering. Although the new algorithm depends on the quality of constructed moral graph, simulation studies illustrate that our method yields good results in a variety of situations, especially when the underlying data set is small.

The results in this paper also raised a number of interesting questions for future research. We briefly comment on some of those questions here. First, maximal prime decomposition plays a key role in Algorithm 1. Although decomposition of an undirected graph into its maximal prime sub-graphs has been discussed a lot, we believe that there is room for further improvements. Second, we have applied the K2 algorithm for the learning of local structures in Algorithm 1. It will be interesting to see whether there exists some alternative approach to serve the same purpose here. Finally, although we assume in this paper that the data are completely observed, missing data or data with latent variables may arise in practice. Generalization of the proposed algorithm to missing data or data with latent variables is of great research interest.

Appendix

Proofs of Theorems

In this appendix, we give the proofs of all the theorems.

Proof of Theorem 3.1. The proof of Theorem 3.1 can be obtained from the document [29].

Proof of Theorem 3.3. If is empty, it is trivial.
If has only one parent, since no set can separate from a parent, there must be a sub-graph of that contains and the parent. Thus we obtained the theorem.
If has two or more parents, we suppose, by reduction to absurdity, that has two parents and which are not contained in a single clique but are contained in two different cliques, say and , respectively, since all variables in appear in . On the path from to in , all separators must contain ; otherwise they cannot separate from . By Theorem 3.1, d-separates from in . Thus we got a contradiction.

Proof of Theorem 3.4. It is easy to see that is a directed graph. We need only show the absence of cycles in the graph .
Without loss of generality, we suppose that there is a cycle from node to and the global node ordering is : . Because each sub-graph returned by Step 4 is a directed acyclic graph, there exist at least two direct paths and which are contained in two different sub-graphs, say and , respectively. By the definition of global node ordering , we conclude in the local node ordering of graph . Furthermore, according to the method in K2 algorithm for constructing sub-graph with the local node ordering , we know that the only edges pointing at the direction toward are those from each variable in preceding . This contradicts the supposition that graph has the direct path .

Proof of Theorem 4.1. Before proving Theorem 4.1, we give the definition of the embedded faithfulness condition and three lemmas.

Definition A.1 (see [23]). Let be a joint probability distribution of the variables in where , and let be a DAG. satisfies the embedded faithfulness condition if entails all and only conditional independencies in for subsets including only elements of .

Lemma A.2. Let be a joint probability distribution of the variables in with , and let be a DAG. If satisfies the faithfulness condition and is the marginal distribution of , then satisfies the embedded faithfulness condition.

The proof of Lemma A.2 can be found in [23].

Lemma A.3. Suppose satisfies the faithfulness condition, where is a DAG and is a joint probability distribution of the variables in . For each target variable , returned by the Algorithm 2 is a superset of .

Proof. By the faithfulness condition, , where is the set of parents and children of and is the set of spouses of . We only need to show that and . If , then because of the faithfulness condition, for any subset , . Thus, will be not removed by Steps 3 and 4 of Algorithm 2. Similarly, if , then, for each variable , there is a subset , and . We set , where is the d-separation set between and . By Step 5 of Algorithm 2, . Thus, .

Remark A.4. denotes that is not independent of conditioned on .

Lemma A.5. Suppose satisfies the faithfulness condition, where is a DAG and is a joint probability distribution of the variables in with . For each target variable , if , then, there is a unique Markov boundary of over and .

Proof. By the faithfulness condition and Lemma A.2, it is obvious that admits a unique Markov boundary over . We only need to show that is held under the condition . Clearly, as . Next, we show .
Without loss of generality, we suppose that there is a variable , . Then, on the one hand, we would have because is a Markov boundary of in . On the other hand, since , we consider the problem from two aspects.
If is a parent or child of in , that is, , then we would not have in . This contradicts the faithfulness condition.
If is a spouse of in , that is, , let be their common child in . If , we again would not have in . If , we would have in , but we would not have in because is a parent of in . So again we would get a contradiction.

Now we are ready to prove Theorem 4.1.

Proof. Suppose Step 6 of Algorithm 2 returns the parents and children of . According to the definition of spouse, it is easy to see that Step 7 of Algorithm 2 identifies the spouse of in . We only need to show that Step 6 of Algorithm 2 returns all and only the parents and children of in .
From Lemmas A.2 and A.3, we have that satisfies the embedded faithfulness condition. We set , . Since MMPC is correct under the faithfulness condition, Step 6 of Algorithm 2 returns the parents and children of in , denoted by . We next show .
By Lemma A.5, we know that , thus, . Similar to proof of Lemma A.5, we show below that . Without loss of generality, we suppose that there is a variable , . Because all nonadjacent nodes may be d-separated in by a subset of its Markov boundary, then such that . As owing to Lemma A.5, could be d-separated in . Therefore, cannot be adjacent to in , that is, . We got a contradiction.

The MMPC algorithm is sketched in Algorithm 3.

(1) Input: Data set ; Variable set ; Target variable .
(2) ;
(3) repeat
(4)  for     do
(5)           ;
(6)  end for
(7)   ;
(8)  if     then
(9)           ;
(10)        end if
(11) until   has not changed;
(12) for   do
(13)        if   for some   then
(14)          ;
(15)        end if
(16) end for
(17) return  
(18) Output: The parents and children of .

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 60974082,61075055), the National Funds of China for Young Scientists (no. 11001214), and the Fundamental Research Funds for the Central Universities (no. K5051270013).