Abstract

The application of existing datasets to construct a probabilistic network has always been the primary research focus for mobile Bayesian networks, particularly when the dataset size is large. In this study, we improve the K2 algorithm. First, we relax the K2 algorithm requirements for node order and generate the node order randomly to obtain the best result in multiple random node order. Second, a genetic incremental K2 learning method is used to learn the Bayesian network structure. The training dataset is divided into two groups, and the standard K2 algorithm is used to find the optimal value for the first set of training data; simultaneously, three similar suboptimal values are recorded. To avoid falling into the local optimum, these four optimal values are mutated into a new genetic optimal value. When the second set of training data is used, only the best Bayesian network structure within the five abovementioned optimal values is identified. The experimental results indicate that the genetic incremental K2 algorithm based on random attribute order achieves higher computational efficiency and accuracy than the standard algorithm. The new algorithm is especially suitable for building Bayesian network structures in cases where the dataset and number of nodes are large.

1. Introduction

The effective expression of uncertain knowledge is an important content of knowledge intelligent learning. In this research field, Bayesian network has always been the focus of attention. Bayesian network is a probability graph model. It represents the dependency relationship between a group of random variables through a directed acyclic graph. The conditional probability table (CPT) formed by each variable represents the probability relationship between variables [1]. It has strong uncertainty reasoning ability, which can realize top-down prediction analysis and bottom-up diagnostic inference [2]. Machine learning technologies have become more and more important in many applications, such as medicine [3], e-commerce [4, 5], transportation [6], and image denoising [7]. As one of the machine learning technologies, Bayesian network is widely used in many fields, such as machine vision, biomedicine, classification, fault diagnosis, prediction, natural language processing, and data mining [8].

The Bayesian network is primarily based on Bayesian network learning, which is divided into two steps: structural learning and parameter learning. Structure learning is to obtain a directed acyclic graph that can represent attribute dependencies based on training data and a priori knowledge. Parameter learning is to obtain the conditional probability of each node based on the directed acyclic graph. It is usually called conditional probability table. In these two learning, structural learning is more difficult, and it is also a research hotspot. It mainly focuses on how to avoid falling into local optimization and find the best structure when there are many attributes and few sample data [9, 10].The Bayesian network structure learning algorithm can be divided into a method based on scoring and searching, the conditional independence test, and a hybrid of the two methods.(1)Method based on scoring and searching: it uses the scoring function to measure the matching degree between the Bayesian network structure and the training sample set. After defining the scoring function, apply the search strategy to find the network structure with the highest score. K2 algorithm is the common one [11]. Due to the constraint of node order, K2 algorithm can effectively avoid the problem of likelihood equivalence and is better than most classical algorithms in running speed and accuracy. However, in most cases, the node order is unknown and usually needs to be determined according to expert knowledge. The difference of expert knowledge is large, which cannot ensure objectivity and accuracy, and it is difficult to achieve when there are a large number of nodes. Therefore, researchers have put forward many solutions. It is proposed that conditional frequency is used to determine the node ranking of K2 algorithm. The algorithm does not need complex search strategy and effectively reduces the time complexity. However, it has high requirements for the quality of data set and is not easy to obtain the accurate model [10]. Combining the maximum spanning tree and ant colony algorithm, the MUST-ACO-K2 (MAK) algorithm is proposed to search the node order, but the algorithm needs to substitute the obtained node order into K2 algorithm to get the network structure before scoring, resulting that its running time is too long [10].(2)Method based on conditional independence test: this method abstracts the learning process of the Bayesian network structure as the process of discovering a set of variables hidden in the network structure that satisfy the independence condition test. Spirtes proposed the SGS algorithm in 1989 [12], which uses conditional independence to test the existence and direction of edges and eliminates the prior constraint of the K2 algorithm that requires a given order of nodes, but the cost of the test calculation is exponential. In the second year, he proposed a PC algorithm [13], which improved upon the search strategy of the SGS algorithm. It requires less computational load when learning a sparse network structure, and it was used by Tsagris [14]. Cheng combined the idea of information theory with an independent testing method [15], and the proposed learning method exhibited good performance in structural learning.(3)Hybrid algorithm: because the method based on score search and the method based on constraint have their own advantages and disadvantages, the hybrid optimization algorithm combining the two has gradually become the mainstream of research. The improved whale optimization algorithm is used to optimize the structure of Bayesian network. The optimization efficiency and accuracy of this method are good, but the complexity is very high [16]. The improved particle swarm optimization (PSO) is proposed to learn the Bayesian network structure. After the initial network is constrained by mutual information, the improved PSO algorithm is used to search the optimal Bayesian network, which improves the optimization efficiency. However, due to the instability of the algorithm, the accuracy of the structure cannot be guaranteed [17]. The bird swarm algorithm is used as the search strategy to improve the search strategy, which makes the search ability stronger and the convergence further improved [18].

For the K2 algorithm, as stated by Cooper, the K2 algorithm can reconstruct a moderately complex belief network rapidly, but it is sensitive to the ordering of the nodes [11]. Further information can be found in [19, 20].

This paper presents a new Bayesian network structure learning method based on random node order and genetic incremental search for an optimal path and compares this method with the K2 algorithm. Experiments demonstrate that the method of random node order can yield a better Bayesian network structure without expert knowledge, and the genetic incremental structure learning method can greatly improve the computational efficiency when tested on big datasets, especially when the number of samples and nodes are large. It always exhibits a runtime that is shorter than that of the K2 algorithm.

2. K2 Algorithm

The K2 algorithm [21] effectively integrates prior information in the search process and exhibits good time performance. It is a classic structure learning algorithm based on scoring search.

2.1. Scoring Function

The node sequence is given in advance, and each node greedily searches its parent node set from its predecessor node according to the Bayesian scoring function and finally obtains the network structure with the best score.

The learning of the network structure can be attributed to the given dataset and finds a network structure with the largest a posteriori probability, that is, is set to maximise . Additionally, , the denominator is unrelated to , so the ultimate objective is to find the maximum of . Through a series of derivations (for the specific derivation process, see [6]), we obtain where is the priori probability of . This is the probability set for each structure without providing data. denotes the unique instantiation of relative to . Suppose there are such unique instantiations of . Define to be the number of cases in in which variable has the value and is instantiated as ; the value of can be obtained using

For equation (1), the first multiplicative symbol traverses each random variable by , and denotes the number of random variables.

The second multiplicative symbol traverses all parent variables of by , and indicates the number of types of parent variable instances.

The last multiplicative symbol traverses all possible values of the current variable , where denotes the number of possible values.

It can be assumed that the probability of each structure obeys a uniform distribution, that is, the probability is the same constant as . Using constant to replace , equation (1) changes to

The objective is to obtain that can maximise the posterior probability, as follows:

As can be seen from the above equation, as long as the local maximum for each variable is provided, the overall maximum can be obtained. The component of each variable is presented as a new scoring function, as follows:

2.2. Search Strategy

The core of Bayesian network structure optimization is to narrow the search scope through search strategy after determining the scoring function. Greedy search algorithm is the most commonly used method. But it is easy to fall into local optimization. In 2017, the authors of [1] proposed adding disturbance factor to local greedy search, and using the idea of genetic algorithm, the metaheuristic method was used to improve the performance of local greedy search. In 2020, the authors of [22] introduced microbial genetic algorithm into Bayesian network structure learning. The undirected graph with most correct edges is calculated by the maximum information coefficient, and it is used as the initial population, and then the excellent individuals in the initial population are retained by using the operator of microbial genetic algorithm. Through the combination of the two, the purpose of learning close to the real network structure from a small amount of data can be achieved. Search strategy assumes the nodes are ordered. If precedes , there can be no edges from to . Simultaneously, it assumes that the maximum number of parent variables per variable is . Each time the largest parent variable of the scoring function is selected and inserted into the set, the loop is terminated, and the scoring function cannot be increased further.

3. Incremental K2 Algorithm Based on Random Attribute Order

3.1. Random Generate Nodes Order

The K2 algorithm must initialise the order of nodes such that only the node in front of node can be the parent of node . This is defined as

The disadvantages are as follows. First, the order of nodes is not easy to obtain in most actual network structures, and the expression of this a priori knowledge is not conducive to the understanding of expert knowledge. Second, the fault tolerance for the order of nodes is poor. If the order of nodes that is input into the K2 algorithm is dissimilar to that in the real structure, the accuracy of the K2 algorithm will be greatly reduced; this is owing to its algorithmic theory. In this study, our first objective is to reduce the dependence on the order of nodes. For nodes, each iteration randomly generates an array of nodes, and the generation procedure (see Algorithm 1) is as follows.

(1)Procedure randomarray(len, start, end) (
(2) {Input n, n is nodes number.}
(3) {Output: an array that has n values between 1 and n.}
(4)  Define the array[1,n]: order
(5)   for1 1-n {
(6)    Random generation of a number between 1 and n
(7)    If (the number is the first number) {insert the number into the array; }
(8)     Else{
(9)     for2 1-length(array){
(10)    If(the number exits){ regenerate a random number, loop for2}
(11)   else{insert the number into the array; loop for1}
(12)   End for2}
(13)  End else }
(14) End for1}
(15)End randomarray }
3.2. Genetic Incremental K2 Algorithm

The basic idea is to divide the training data into two groups and use the first set of training data to learn a basic Bayesian network structure using the K2 algorithm. In the process of learning, not only the current optimal value, i.e., the decision-making of the algorithm each time, but also several suboptimal values are saved [23], a GA is applied, and the current four optimal values are mutated to a new optimal value. The number of suboptimal values can be adjusted, for example, three or four. This study selects three suboptimal values. When using the second set of incremental data, it will not research; instead, it will take the four optimal values and the new genetic value as the next search space. The algorithm skilfully eliminates the low-level model, reduces the search space, and improves the efficiency of the algorithm.

In addition, the optimal score function value is preserved in each iteration. After the iteration, the node order of the Bayesian network with the optimal score function value is considered to be the best node order.

The algorithm is divided into two parts. The first part (see Algorithm 2)is to generate one optimal value and three suboptimal values, mutate a new optimal value from the first set of data, and store all values in the candidate matrix. The following pseudocode expresses the first part.

(1)Procedure GCDK2{
(2){Input: K2 algorithm need parameter initialisation, mutation rate pm = 0.5}
(3){Output: a bayesian network and a matrix contain optimal value and location }
(4) For  = 1-n{
(5)
(6) While and
(7)  {Potential parents}
(8)  
(9)   For -{
(10)  
(11)  End for}
(12){is a location, is the max first score}
(13)  Get and
(14)Get and
(15)   Get and
(16)   Mutate optimal values get a new genetic value
(17)   If () {;
(18)   Input , ,
(19)  , and into candidate matrix
(20)  End if}
(21) End while}
(22)End for}
(23)End GCDK2}

The second part (see Algorithm 3) of the algorithm uses the second set of data to optimize the suboptimal value. The core content is as follows.

(1)Procedure GIMK2(part){
(2) For  = 1-
(3)
(4) For  = k to k+4 {from k to k+4 because the candidate stored 5 values}
(5)  
(6)  
(7)  Get from and
(8) End for
(9)
(10) End for
(11)
(12)End GIMK2}

4. Experimental Results

To test the algorithm, the general ALARM, Asia, and CANCER networks were selected in the experiment. Under different sample numbers, the running time and structural hamming distance [24] were used to evaluate different algorithms.

The experiment adopted the Bayesian network toolbox in the MATLAB platform. The operating environment was Windows 7, Intel (R) Core(TM) i3-4170, 3.70 GHz CPU, 8.00 GB RAM. The results of the experiment are listed in Tables 1 and 2.

This algorithm relaxes the strict requirement of the K2 algorithm on node order and improves the efficiency of learning the Bayesian network structure.(1)In the ALARM network (comprising 37 nodes), the experiment began with a sample size of 4000, the running time of K2 was 7.928, GAK2 was 5.675, and GIMK2 was 0.861; when the size varied to 50000, the times were 29.877, 18.429, and 3.128; when the size was 100000, the times were 55.659, 37.839, and 5.987. SHD is large at first; however, it eventually reduces to zero.(2)In the Asia network (comprising 8 nodes), the experiment began with a sample size of 200; however, when the size varied to 5000, the running time of K2 was 0.232, GAK2 was 0.236, and GIMK2 was 0.078; when the size varied to 50000, the times were 0.593, 0.279, and 0.235; when the size was 100000, the times were 1.084, 0.884, and 0.423.(3)In the CANCER network (comprising 5 nodes), the experiment began with a sample size of 200; however, when the size was 5000, the running time of K2 was 0.158, GAK2 was 0.139 and GIMK2 was 0.053; when the size varied to 50000, the times were 0.248, 0.218, and 0.188; when the size was 100000, the times were 0.353, 0.245, and 0.191.

The results of the program operation indicate that the genetic incremental K2 algorithm has a shorter running time than K2 and GAK2 for the same sample size. When the mobile Bayesian networks with different number of nodes, particularly when the dataset size increases with the number of nodes, the running time of K2 and GAK2 becomes extremely long.

5. Conclusion

As data analysis is now being conducted on big data, if we need to analyse big data with uncertain knowledge, especially in the case of numerous attributes, the genetic incremental K2 algorithm can reduce the search space and considerably improve the efficiency of the algorithm. The improved algorithm in this paper is effective; however, it has disadvantages such as the fact that the search space of each algorithm depends on the current optimal path; thus, it is easy to fall into local optimum. The algorithm should thus combine particle swarm optimization, ant colony, or another optimization algorithm to avoid falling into local optimum.

Data Availability

The labeled dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the training object of “Blue Project” in Jiangsu Universities in 2019.