Abstract

Bayesian network is an important theoretical model in artificial intelligence field and also a powerful tool for processing uncertainty issues. Considering the slow convergence speed of current Bayesian network structure learning algorithms, a fast hybrid learning method is proposed in this paper. We start with further analysis of information provided by low-order conditional independence testing, and then two methods are given for constructing graph model of network, which is theoretically proved to be upper and lower bounds of the structure space of target network, so that candidate sets are given as a result; after that a search and scoring algorithm is operated based on the candidate sets to find the final structure of the network. Simulation results show that the algorithm proposed in this paper is more efficient than similar algorithms with the same learning precision.

1. Introduction

Bayesian network (BN), as a graphic model handling uncertainty issues, has been discussed by many researchers through these years. It has been applied successfully in many areas such as fault detection, medical diagnosis, and traffic management [13]. It had been years that people focused on finding a data structure to compress the storage of joint probability density and developing inference algorithms based on that data structure, and then BN was brought up. After that, when BN had been a successful tool in this area, researchers began to follow with interest structure learning algorithms of BN based on sample data. Essentially, the problem of structure learning of BN is part of combinatorial optimization issues, and it is proved theoretically that learning structure from data was NP hard [4]. Nonetheless, some heuristic methods have been proposed and performed well in several areas [5, 6].

Currently, there are two approaches for BN structure learning. One is CI-test method [7, 8] and the other is scored-searching method [9, 10]. The first one uses conditional independence tests (CI test) to determine the conditional independence relationships among all the variables and build networks based on these relationships. The scored-searching methods attempt to find the network by maximizing a scoring function which indicates how well the network fits the data.

Both methods above have their own advantages and disadvantages. CI-test algorithms are simple and easy to operate. Because low-order CI test is quite computational effective and has high precision, they are very helpful to build a hyper graph of the target (it will be discussed in the following sections). The main drawback about these methods is about performing high-order CI test, which needs large sample sizes and has low accuracy along with the orders of getting higher [11, 12]. The scored-searching methods may have higher precision in structure learning than the CI-test methods. But they are relatively slow, especially when the scales of networks become large, as the structures space increases super-exponentially with the number of nodes.

It is obvious that if it is possible to combine the learning efficiency of CI test and prediction accuracy of scored-searching algorithm, we will get a better algorithm to deal with BN learning issues. In view of the above reasons, some hybrid methods have been proposed [1317]. These methods may use CI-test algorithms to learn a network structure pattern at first and then use some scored-searching algorithms to find the final BN structures based on the previous pattern. These hybrid methods may perform better in some applications, but there are still some problems unsolved, as fusion in algorithm level does not always mean promotion in performance. Take MMHC (max-min hill climbing) as an example. It includes two steps: the first one called MMPC (max-min parents and children), which constructs parents and children sets of each node via CI-test method, is to provide a partial skeleton frame. While in the second step a hill climbing algorithm is operated to refine every edge in the network. To ensure the precision of partial skeleton frame given by MMPC, high-order CI test must be involved, which unfortunately is unstable [11, 12]. So in the searching phase, it is not based on the prior structure given by MMPC strictly but operates in a relatively open space. This manner seems somewhat wasteful for computational resources.

The upper-lower bounds candidate sets searching algorithm (UBCS) which is proposed in this paper can provide a more instructive set of candidate networks through constructing the upper-lower bounds of the structure space by low-order CI test. In this framework, we get the final network structure by using the greedy search algorithm. Simulation shows that it could guarantee precision and reduce the time complexity at the same time.

2. Background

2.1. Definitions

Definition 1 (Bayesian network definition). Given a joint distribution of random variables and DAG (directed acyclic graph), is a Bayesian network, if vertices set is in one-to-one correspondence with random variables and sacrifice Markov condition [18].

Because nodes in BN have no difference with random variables, they will not be distinguished in this paper, and they will be both called as node. In addition, let denote directed edge , and let denote undirected edge .

Definition 2 (V-structure). Let , where , construct a V-structure, if , , , and .

Definition 3 (conditional mutual information ). Conditional mutual information of random variables sets is defined as where

means random variables sets are conditional independence given , which can be expressed as too. Therefore, it usually uses as CI test among random variables and calls cardinal number of as orders of CI test. Furthermore, its zero order CI test, if .

Definition 4 (Markov equivalence). Two DAGs are graph equivalent if and only if (1) both of them have the same skeleton frame and (2) they have the same V structure.
Two BN and are Markov equivalent, if and are graph equivalent.

The characteristics of Markov equivalence have been given by Frydenberg [19], while Verma and Pearl expanded these into DAG [20]. Based on the Markov equivalence, all the DAGs composed by the same nodes set can be divided into different equivalent classes, which are called Markov equivalent classes. Each equivalent class indicates a unique statics model, and it can be represented by a PDAG (partial directed acyclic graph), which is called complete PDAG.

3. Method

Given data sets , structure learning methods are devoted to find the best network structure of . The reference [21] proved the structure quantity of the BN which contains nodes is

From the formula above it can be seen that the potential network structure space rises exponentially with the node increasing. So searching for the candidate sets of network structure is a good approach to reduce dimensions effectively. Based on it, we provide an algorithm named upper-lower bounds candidate sets searching algorithm (short for UBCS), which can get the ultimate network model by constructing the upper-lower bounds of the target network pattern to find candidate sets of network and using search and scoring method. In the following section we will give the first part of the UBCS which is called upper-bound of graph learning algorithm (UGLA), prove the output is the upper bound of moral graph of the target network, and then bring in principle of nonincreasing for 0-order mutual information to reach the second section of the UBCS which is called lower bound of graph learning algorithm (LGLA). After that the searching algorithm will be discussed.

3.1. UGLA and LGLA

We will first give the algorithm description in Algorithm 1.

(1) Input: Data set D; Variable set ;
(2) Initialization: undirected graph , where ;
(3) Order-0 CI test: for each pair variables , if is hold, then , where
   is an undirected edge;
(4) For each triplet variables
(5)     For each pair variables , and , if , , , and is hold,
   then ,
(6) End for;
(7) Triangulate undirected graph , what results with
(8) Call the junction tree algorithm to obtain maximal prime sub-graphs from ;
(9) Output and as final result.

UGLA processing indicates that is a triangulating graph, and for triangulating graph, we have the following theorem being tenable.

Theorem 5. Any undirected graph is complete PDAG, if and only if is a triangulating graph.

The theorem which has been proved in [22] shows that is a complete PDAG; that is, in the best situation, obtained by UGLA is the PDAG of the target BN. Certainly, this condition is too strong, and we will give a theorem below which has more generality.

Theorem 6. Given sample dataset , let the optimal structure of to be learned be , moral graph of the BN is , and then the undirected graph obtained by UGLA is the upper bound of partially ordered set .

Proof. It only needs to prove that holds for each , where and . Theorem 5 tells that if the complete PDAG of is triangulating graph, is tenable. So the following task is to get proved. As all the graphs have the same nodes set, it only needs to prove that there is in any case that undirected edge . For all the undirected edges in , it is clear that it can be divided into two classes. One is composed by the undirected edges transformed by directed edges in ; let it be . The other one is being constructed by the moral edges adding between the nodes which have the same parent; let it be . It is obvious that, for , 0-order CI test ensures that must be tenable; for , the fifth step of UGLA is the assurance of . Proof is completed.

Theorem 7. Any V-structure in exists in a subgraph decomposed from by the method MPD (maximal prime subgraph decomposition) [23].

Theorem 7 was proved in reference [17]. This theorem guarantees that the obtained by UGLA covers all the V-structure in the target graph.

The above section discussed the upper bound of , from what we can get the candidate sets for searching the structure. In the following part, the lower bound of BN structure space will be debated for choosing a relatively precise initial value.

We will start with a lemma as blow.

Lemma 8. For any two random variables and subset of , is tenable, and the equity holds if and only if .

The proof is omitted.

Theorem 9. Given , where , holds for any , where . If the connecting relationship among is , , , and .

Proof. As , without loss of generality, let , and it is only necessary to prove that . According to the definition of mutual information, and .
Then
It can be seen that from the relationship among , where scarifies , so the equation above can be expressed as
Lemma 8 shows , while the equity holds for .
Proof is completed.

We named Theorem 9 as principle of nonincreasing for 0-order mutual information (principle NZMI). The condition of the theorem indicates that it is not suitable for the situation when there is V-structure. For the BN structure shown in Figure 1, it cannot tell whether is bigger than or not, only from the Theorem 9. But, if we can eliminate all the V-structure first and then come to consider the connected relationship between node and nodes , we will notice that must hold when is the biggest. So it only needs 1-order CI test to rule out the possibility of connection, so it turns out that are connected. As a matter of fact, principle NZMI provides a new approach for ascertaining whether there is an undirected edge between two nodes without using the V-structure methods.

Algorithm 2 gives the learning algorithm (LGLA) based on the discussion above.

(1) Input: Data set D; Variable set ; ;
(2) Initialization: , where ;
(3) For each , If there is a cycle, then call VSTA;
(4) For each , Find makes the highest, ;
(5) If there is another makes is the highest
(6)    If is hold, then ,
(7) End if;
(8) Output as final result.

The VSTA (V-structure test algorithm) is involved in LGLA list as in Algorithm 3.

(1) Input: Data set D; Variable set ; ; , where ;
(2) If and
(3)  Orient and delete in ,
(4) Else if and , then
(5)  Orient and delete in ,
(6) End if;
(7) Output .

For V-structure test algorithm (VSTA) see Algorithm 3.

VSTA is a testing method which only provides “best effort” services. It only involves 0-order and 1-order CI tests, the high accuracy of which guarantees the existence of V-structure detected. For the situation that there is more than one edge between two father nodes in a V-structure, the detecting will not be operated. This approach avoids bringing in high computation and additional interference edges.

These is a theorem that holds for the output from LGLA.

Theorem 10. Let all the complete PDAGs from and inclusion relationship compose a partially ordered set , and then the generated from LGLA is a PDAG.

Proof. For the undirected edges included in , the principle NZMI guarantees that holds for any . VSTA gives the directed edges of , and according to the Theorem 7, any V-structure in is surely contained in a . The characteristic of VSTA makes sure that a V-structure must exist in if it is contained by , which does not hold water certainly conversely. It is obvious that an acyclic graph will be still acyclic no matter any directed edges are deleted. So is a PDAG.
Proof is completed.

Theorem 11. is the lower bound of if output from VSTA is entirely accurate.

Proof is omitted.

The condition of Theorem 11 is relatively strong. As a matter of fact, it can be considered that is the lower bound of in many cases. Take network Asia as an example; Figure 2 shows that all the edges in exist in the PDAG of original network.

3.2. Searching Method

Hill-climbing algorithm which is based on search and scoring method is one of the greedy searching algorithms for BN structure learning. It contains three searching operators: adding edge, subtractive edge, and reversing edge. The hill-climbing method is also involved in the UDCS algorithm, but the searching processing is restricted by the upper and lower bounds given by UGLA and LGLA, which means abandoning the new structure got form searching operators if the new one beyond the bounds is given by UGLA nor LGLA. In case of trapping in local optimum too fast, bring in suboptimal competitive mechanism and retain the top structures which got higher scores each round to the next iteration. The is decided by the scale of network in principle, which means the greater the scale of network is, the bigger the is. But it should be noticed that the oversize candidate sets will lead to an increase in the time complexity of algorithm. For the BN network of which scale is as big as alarm, recommend empirical value is . In order to make a comparison with the BENA algorithm mentioned in [17], the BDeu score function is used as the objective function of searching.

4. Experiment

We test the performances of UBCS with BNEA and MMHC together in Alarm network. The comparison of scores is shown in Table 1. For ease of observation, we present a normalization to deal with the results.

Table 1 shows the results averaged over 10 runs, where SS represents sample size. As can be seen from Table 1, the performance of UBCS is the best among three methods when SS is small, and the scores from all three methods tend to be very close with the increase of SS. Although the VSTA which is involved by LGLA cannot be adequate to assure the facticity of the detection, it has little impact on the learning performance according to the simulation result. This phenomenon is caused by two reasons: on one side, the upper bound given by UGLA is very stable, and on the other side, the effect is reduced by the process of search and scoring. It should be noticed that the BENA shows “over learning” when SS becomes larger (). The “over learning” is considered as a phenomenon only occurs in small sample size typically, while, as the combinatorial optimization with high dimensions (such as BN), it is hard to get plenty of samples, and the time cost is also unacceptable when the current algorithms operate on extremely large datasets. So it is reasonable to find the algorithms that could get a balance between precision and generalization in dataset with appropriate size, which is the intention of UBCS, as the restriction of the upper-lower bounds.

Figure 3 shows that UBCS has an obvious advantage over the other two algorithms in time complexity. The experiment was operated on a typical desktop computer. Comparing with MMHC, both BNEA and UBCS perform better in time complexity because of using MDP to reduce the dimensions of searching space. On the other hand, because MMPC which is involved in MMHC is used in BNEA, MMHC should have the same time complexity with BNEA, in the worst situation. While BNEA only involves 0-order and 1-order CI test, therefore it has better performance in time complexity.

5. Conclusion

We propose a hybrid method for Bayesian network structural learning (UBCS). In this method, two constructional algorithms are given to build the upper and lower bound of the BN structure and theoretical proof is completed as well. UGLA which is the first part of UBCS outputs upper bound of the moral graph of the target structure, while the following part named LGLA offers lower bound of the target structure’s PDAG. Principle NZMI is also proved in this paper, which indicates the hidden information in 0-order CI test that could be used for reducing the reaching space. As only involving low order CI test, UBCS has an advantage in time complexity comparing with other hybrid learning methods, which is also supported by simulation results.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.