Abstract
Naive Bayes classifier is a simple and effective classification method, but its attribute independence assumption makes it unable to express the dependence among attributes and affects its classification performance. In this paper, we summarize the existing improved algorithms and propose a Bayesian classifier learning algorithm based on optimization model (BCOM). BCOM uses the chisquared statistic to estimate the dependence coefficients among attributes, with which it constructs the objective function as an overall measure of the dependence for a classifier structure. Therefore, a problem of searching for an optimal classifier can be turned into finding the maximum value of the objective function in feasible fields. In addition, we have proved the existence and uniqueness of the numerical solution. BCOM offers a new opinion for the research of extended Bayesian classifier. Theoretical and experimental results show that the new algorithm is correct and effective.
1. Introduction
With the development of information technology, in particular the progress of network technology, multimedia technology and communication technology, massive data analysis, and processing become more and more important. Since Bayesian network as classifier has a solid mathematical basis and takes the prior information of samples into consideration, it is now one of the hottest areas in machine learning and data mining fields. Moreover, it has been applied to a wide range of tasks such as natural spoken dialog systems, vision recognition, medical diagnosis, genetic regulatory network inference, and so forth [1–8]. Naive Bayes (NB) [9–11] is a simple and effective classification model. Although its performance can be comparable with other classification methods, such as decision trees and neural network, its attribute of independence assumption limits its real application. Extending its structure is a direct way to overcome the limitation of naive Bayes [12–14], since attribute dependencies can be explicitly represented by arcs. Treeaugmented naive Bayes (TAN) [9] is an extended treelike naive Bayes in which the class node directly points to all attribute nodes and an attribute node can have only one parent from another attribute node. On this basis, Cheng et al. presented BayesainnetworkAugmented naive Bayes (BAN) [15, 16] which further expanded the treelike structure of TAN classifier and allowed the dependency relation between any two attribute nodes. In constructing BAN, they use a scoring function based on the minimum description length principle. Unfortunately, the search for the best network is performed in the space of all possible networks, and the number of elements in this space increases exponentially with the number of nodes, finding the best structure is NPhard [17, 18].
Based on above analysis, this paper presents a Bayesian classifier learning algorithm based on optimization model (BCOM) for the first time, inspired by constraintbased Bayesian network structure learning method [19–22]. We discuss the classification principles of Bayesian classifier from a new view. Because chisquared tests are a standard tool for measuring the dependency between pairs of variables [23], BCOM first introduces the chisquared statistic to define the dependence coefficients of variables. Then, it uses the dependence coefficients to construct an overall measure of the dependence in a classifier structure, from which the objective function for our optimization model can be derived. Therefore, a problem of searching for an optimal classifier can be turned into finding the maximum value of the objective function in feasible fields. The function extremum corresponds to the best classifier. Finally, BCOM improves the efficiency of classification and delete irrelevant or redundant attributes by using the dseparation rule of Bayesian network. Theoretical and experimental results show the proposed algorithm is not only effective in improving the accuracy, but also has a high learning speed and simple solving procedure.
The remainder of this paper is organized as follows. Section 2 reviews the existing Bayesian network classifiers. We describe our algorithm and its theoretical proofs in Section 3. Section 4 details the experimental procedures and results of the proposed algorithm. Finally, in Section 5, we conclude and outline our future work.
2. Background
In this section, we discuss previous work that is relevant to this paper, and describe some of the notations used firstly. We use boldface capital letters such as , , for sets of variables. General variables are denoted by italic capital letters or index italic capital letters , , ; specific values taken by these variables are denoted , , . Specially, we use the same italic letters , , for graph nodes which corresponds with the random variables.
Classification is a basic task in data analysis and pattern recognition that requires the construction of a classifier, that is, a function that assigns a class label to instances described by a set of attributes. The induction of classifiers from data sets of preclassified instances is a central problem in machine learning. Let represent the variable set which corresponds with the training data set . We assume that is the class variable and is the set of attribute variables. Bayesian networks are often used for classification problems, in which the main task is to construct the classifier structure from a given set of training data with class labels and then compute the posterior probability , where is the value that takes. Thus, it only needs to predict the class with the highest value of probability , that is, According to Bayes theorem, maximizing is equivalent to maximizing . The difference between the existing Bayesian classifiers is the computing mode of .
Figure 1 schematically illustrates the structures of the Bayesian classifiers considered in this paper. In naive Bayes, each attribute node has the class node as its parent, but does not have any parent from attribute nodes. Computing is equal to . Because the values of and can be easily estimated from training examples, naive Bayes is easy to construct. However, its prerequisite of condition independence assumption and data completeness limit its real application. TAN takes the naive Bayes and adds edges to it in which the class node directly points to all attribute nodes and an attribute node can have only one parent from another attribute node. Computing is equivalent to , where . It is an efficient extend of naive Bayes. BAN is a specific case of general Bayesian network classifier, in which the class node also directly points to all attribute nodes, but there is no limitation on the arcs among attribute nodes (except that they do not form any directed cycle). It is clear that TAN and BAN are useful to model correlations among attribute nodes that cannot be captured by naive Bayes. They embody a good tradeoff between the equality of the approximation of correlations among attributes and the computational complexity in the learning stage. In addition, existing algorithms using the same idea to construct the structure of Bayesian classifier which first learn the dependent relationships among attribute variables using Bayesian network structure learning algorithm, then add the class variable as the root node of the network. It is equivalent to learning the best Bayesian network among those in which is a root. Thus, even if we could improve the performance of a naive Bayes classifier in this way, the computational effort required may not be worthwhile.
(a) NB
(b) TAN
(c) BAN
Based on above analysis, this paper presents an optimization model to learn the structure of Bayesian classifier, which inspired by constraintbased Bayesian network structure learning method. It is the first time that a problem of structural learning for a Bayesian classifier is transformed into its related mathematical programming problem by defining objective function and feasible region. And, we also propose a new method to measure the dependent relationships between attributes. The theoretical basis of this method is established by Theorem 1 [24].
Theorem 1. Given a data set and a variable set , if the hypothesis that and are conditionally independent given is true, then the statistics approximates to a distribution with degrees of freedom, where , , and represent the number of configurations for the variables , , and , respectively. is the number of cases in where , , and . is the number of cases in where and and is the number of cases in where .
3. A Bayesian Classifier Learning Algorithm Based on Optimization Model
3.1. Optimization Model Design
In this subsection, we give some basic concepts and theorems which is the foundation of the method proposed in this paper.
A Bayesian classifier is a graphical representation of a joint probability distribution that includes two components. One is a directed acyclic graph , where the node set represents the class and attribute variables, and the edge set represents direct dependency relationships between variables. The other is a joint probability distribution that quantifies the effects of has on the variable in , where . We assume that is the class node and is the set of attribute nodes. The structure of reflects the underlying probabilistic dependence relations among the nodes and a set of assertions about conditional independencies. The problem of data classification can be stated as follows: the learning goal is first to find the classifier structure that best matches and estimate the parameters using the training data set , then to assign class label to test instances. Since is a directed acyclic graph, it can be represented by a binary nodenode adjacency matrix . Entry is 1 if there is a directed arc from node to node , and 0 otherwise. That is,
Let be the sum of powers of the adjacency matrix. Entry is equal to the number of directed paths from node to node in the graph [25].
We wish to be able to use a mathematical programming formulation, and this formulation requires that we are able to measure the impact of adding or removing a single arc from the network. In order to approximate the impact of adding such an arc, we define the dependence coefficient.
Definition 2. Given a data set and a variable set , we define the dependence coefficient between variables and as , where is the statistics of and given , is the critical value at the significance level of a distribution with degrees of freedom.
Obviously, is a conservative estimate of the degree of dependence between two nodes. If , then, regardless of the other variable involved, there is statistically significant dependence between and , so there should be an arc between them. If , then there is at least one way of conditioning the relationship so that significant dependence is not present. We define the dependence coefficient matrix corresponding to the variable set , that is,
Lemma 3. Given a data set and a variable set , and are locally conditionally independent at the significance level if and only if there is a node such that .
The proof of Lemma 3 can be obtained directly by Definition 2 and chisquare hypothesis test. According to Lemma 3, and are locally conditionally independent at the significance level if and only if there is a node such that . Further, and are globally conditionally independent at the significance level if and only if for any . Based on this, we use the dependence coefficients to construct an overall measure of the dependence which will be treated as the objective function for our mathematical program.
Definition 4. For a Bayesian classifier with adjacency matrix , the global dependence measure of the network is given by
According to the measure of Definition 4, if and are conditionally independent, by Lemma 3 , and hence, adding an arc between and decreases the value of . Thus, we wish to find the feasible solution which increases . The optimal solution corresponds to the best classifier structure. We next explain what constitutes feasible network.
Given the variable set , is the class node and is the set of attribute nodes. A directed network is a feasible classifier structure if and only if the following conditions are satisfied:(1)for any attribute node , there is no directed edge from to ;(2)for any node , there is no directed path from to , namely, the graph is acyclic;(3)there exists at least one attribute node which is dependent with class node , namely, there is an attribute node such that can be reached from by a directed path.
In order to incorporate the requirements of the above three conditions into a mathematical programming formulation, we express them by the following constrains:(1);(2);(3).
The feasible classifiers are those that satisfy constrains (1)–(3). Thus, learning best Bayesian classifier can be transformed into the following related mathematical programming problem, where the objective function is a global dependence measure of the network, and the feasible region is the set of classifiers with reachability constrains (1)–(3), that is,
3.2. BCOM Algorithm and Its Correctness
In this subsection, we present the main algorithm of this paper. Our method starts with finding the best Bayesian classifier by solving the above optimization model. Second, we use the dseparation rule of Bayesian network to delete irrelevant or redundant attributes in the network which have low dependence degree with the class variable. The parameters of modified network can be estimated. Third, classification is done by applying obtained classifier to predict the class label of test data. We prove the correctness of proposed method under the faithfulness assumption for the data distribution.
Given a directed acyclic graph where is the node set and the set of directed edges. A path between two distinct nodes and is a sequence of distinct nodes in which the first node is , the last one is and two consecutive nodes are connected by an edge, that is where denotes or for .
Definition 5. A path is said to be dseparated by a set in a directed acyclic graph if and only if (1) contains a “headtotail meeting”: or a “tailtotail meeting”: such that the middle node is in , or (2) contains a “headtohead meeting”: such that the middle node is not in and no descendant of is in . Specially, two distinct sets of nodes and are said to be dseparated by a set in if dseparates every path from any node in to any node in [26].
In this paper, we assume that all the distributions are compatible with [27]. We also assume that all independencies of a probability distribution of variables in can be checked by dseparations of , called the faithfulness assumption [26]. The faithfulness assumption means that all independencies and conditional independencies among variables can be represented by . Now we formally describe our method in the following Algorithm 1.

From the detailed steps of BCOM, we can see that BCOM classifier relaxes the restrictions on condition variable and further meets the need of practical application. Since its network structure is similar to that of BAN’s, BCOM does not need to build all possible networks in which class node is a root and removes irrelevant or redundant nodes from the network before the process of estimating the network parameters, which greatly reduces the calculation for posterior probability of class variable. In fact, the training process of BCOM is different from other BN classifiers. Its main task is to solve the mathematical programming . To create the dependence coefficient matrix corresponding to , BCOM needs to compute the conditional statistics . Moreover, just as other constraint based algorithms, the main cost of BCOM is the number of conditional independence tests for computing the dependence coefficients of any two variables in step 2. The number of conditional independence tests is and the computing complexity is . The total complexity of BCOM is bound by , where is the number of variables in the network and is the number of cases in data set . In principle, BCOM is a structureextensionbased algorithm. In BCOM, we essentially extend the structure of TAN by relaxing the parent set of each attribute node. Thus, the resulting structure is more complex than TAN, but more simple than BAN. Therefore, BCOM is a good tradeoff between the model complexity and accuracy compared with TAN and BAN. Next, we prove the correctness of BCOM algorithm under the faithfulness assumption.
The next two results establish the existence and uniqueness properties of solution to .
Theorem 6. Let . There always exists an such that is a feasible point of .
Proof. Given the set of variables where is the class variable and are the attribute variables. We give a matrix as follows: Obviously, the adjacency matrix always satisfies the constrains (1)–(3). In fact, the graph represented by is the Naive Bayes classifier. Thus, is a feasible solution of .
According to Theorem 6, we can prove that there exists a feasible classifier which satisfy constrains (1)–(3). Theorem 7 further shows that such classifier is unique under certain condition.
Theorem 7. Let be the optimal solution of , and be the coefficient sets where is the element of . is the unique solution of if and only if any element in cannot be expressed as the sum of any number of elements in .
Proof. Without loss of generality, we suppose, by reduction to absurdity, that and are two optimal solutions of . The values of the objective function is the same in both solutions, that is, Let , . According to the assumption of and ,, there must exist such that , namely, . Then, by (7), Since then, by (8), there must exist such that and , namely, where . This contradicts with the known condition that any element in cannot be expressed as the sum of any number of elements in .
Theorem 8. Let be the classifier structure obtained by step 4 of BCOM, where is the class variable and are attribute variables. denotes the final output of BCOM, then the classification results obtained by and are consistent.
Proof. Without loss of generality, suppose is an example to be classified. The classifier represented by is given as follows: We write the right side of (11) as in short. We can suppose that redundant variables were deleted in step 5 of BCOM, say the last variables , . Then, . According to step 5, dseparates and . Thus, is conditional independent with given . Equation (11) can be reduced as follows: We get the results.
Theorem 8 reveals that it is effective and correct to remove redundant or irrelevant attributes using dseparation rule, and the performance of Bayesian classifier can be improved.
4. Experimental Results
We run our experiments on 20 data sets from the UCI repository of Machine Learning datasets [28], which represent a wide range of domains and data characteristics. Table 1 shows the description of the 20 data sets which are ordered by ascending number of samples. In our experiments, missing values are replaced with the modes and means of the corresponding attribute values from the available data. For example, if the sex of someone is missing, it can be replaced by the mode (the value with the highest frequency) of the sexes of all the others. Besides, we manually delete three useless attributes: the attribute “ID number” in the dataset “Glass”, the attribute “name” in the dataset “Hayesroth”, and the attribute “animal name” in the dataset “Zoo”.
The experimental platform is a personal computer with Pentium 4, 3.06 GHz CPU, 0.99 GB memory, and Windows XP. Our implementation is based on the BayesNet Toolbox for Matlab [29], which provides source code to perform several operations on Bayesian networks. The purpose of these experiments is to compare the performance of the proposed BCOM with Naive Bayes, TAN and BAN in terms of classifier accuracy. The accuracy of each model is based on the percentage of successful predictions on the test sets of each data set. In all experiments, the accuracy of each model on each data set are obtained via 10 runs of 5fold cross validation. Runs with the various algorithms are carried out on the same training sets and evaluated on the same test sets. In particular, the crossvalidation folds are the same for all the experiments on each data set. Finally, we compared related algorithms via twotailed test with a 95 percent confidence level. According to the statistical theory, we speak of two results for a data set as being “significantly different” only if the probability of significant difference is at least 95 percent [30].
Table 2 shows the accuracy (and standard deviation of accuracy) of each model on each data set, and the average values and standard deviation on all data sets are summarized at the bottom of the table. In each row, the best of the four classifier results are displayed in bold. If another’s performance is not significantly different from the best, it is also highlighted, but if the differences between all four classifies are not statistically significant, then none of them is highlighted. From our experiments, we can see that BCOM is best in 6 cases. NB, TAN, and BAN are best in 6, 5, and 3 cases, respectively. When the number of samples is larger than 400, the performance of TAN and BAN is better than that of NB, and BCOM is best. Although it can be seen that the performance of BCOM and TAN becomes similar as the sample size increase, BCOM has a higher accuracy on average. From a general point of view, we can see that from the first data set to the last one, the highlighted numbers change from few to even more in the sixth column of Table 2. It means the advantage of BCOM is more evident with the increase of data size.
Table 3 shows the compared results of twotailed test, in which each entry means that the model in the corresponding row wins in data sets, ties in data sets and loses in data sets, compared to the model in the corresponding column. From Table 3, we can see that BCOM significantly outperforms NB (9 wins and 4 losses), TAN (12 wins and 5 losses) and BAN (11 wins and 5 losses) in accuracy. Figures 2, 3, and 4 show two scatterplots comparing BCOM with NB, TAN, and BAN, respectively. In the scatter plot, each point represents a data set, where the coordinate of a point is the percentage of misclassifications according to NB or TAN or BAN, and the coordinate is the percentage of misclassifications according to BCOM. Thus, points below the diagonal line correspond to data sets on which BCOM performs better. From Figures 2 and 3, we can see that BCOM generally outperforms NB and TAN as is also demonstrated in Table 3. It provides strong evidence that BCOM is performing well against the other two classifiers both in terms of accuracy as well as the percentage of misclassifications. Figure 4 also shows BCOM outperforming BAN, though the difference in performance is not as marked as in the results of Figures 2 and 3. In other words, the performance of BCOM and BAN is similar in terms of the percentage of misclassifications. However, BCOM has a higher accuracy and a more simple graph structure, which suggests that BCOM is able to handle very large data sets and is a more promising classifier.
5. Conclusions
In many realworld applications, classification is often required to make optimal decisions. In this paper, we summarize the existing improved algorithms for naive Bayes and propose a novel Bayesian classifier model: BCOM. We conducted a systematic experimental study on a number of UCI datasets. The experimental results show that BCOM has a better performance compared to the other stateoftheart models for augmenting naive Bayes. It is clear that in some situations, it would be useful to model correlations among attributes. BCOM is a good tradeoff between the quality of the approximation of correlations among attributes and the computational complexity in the learning stage. Considering its simplicity, BCOM is a promising model that could be used in many field.
In addition, we use the chisquared statistic to estimate the dependence coefficients among attributes from dataset. We believe that the use of more sophisticated methods could improve the performance of the current BCOM and make its advantage stronger. This is the main research direction for our future work.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (nos. 60974082 and 61075055), the National Funds of China for Young Scientists (no. 11001214), and the Fundamental Research Funds for the Central Universities (no. K5051270013).