Abstract

Of the numerous proposals to improve the accuracy of naive Bayes (NB) by weakening the conditional independence assumption, averaged one-dependence estimator (AODE) demonstrates remarkable zero-one loss performance. However, indiscriminate superparent attributes will bring both considerable computational cost and negative effect on classification accuracy. In this paper, to extract the most credible dependencies we present a new type of seminaive Bayesian operation, which selects superparent attributes by building maximum weighted spanning tree and removes highly correlated children attributes by functional dependency and canonical cover analysis. Our extensive experimental comparison on UCI data sets shows that this operation efficiently identifies possible superparent attributes at training time and eliminates redundant children attributes at classification time.

1. Introduction

Bayesian networks (BNs) are a key research area of knowledge discovery and machine learning. A BN consists of two parts: a qualitative part and a quantitative part. The qualitative part denotes the graphical structure of the network, while the quantitative part consists of the conditional probability tables (CPTs) in the network. Although BNs are considered efficient inference algorithms, the quantitative part is considered a complex component and learning an optimal BN structure from existing data has been proven to be an NP-hard problem. The graphical structure of naive Bayes (NB) is simple and definite because of the conditional independence assumption between attributes, making NB efficient and effective [1, 2]. However, violations of this conditional independence assumption can make the classification of NB suboptimal. Numerous algorithms have been proposed to retain the desirable simplicity and efficiency of NB while alleviating the problems of the independence assumption. Averaged one-dependence estimator (AODE) [3, 4] utilizes a restricted class of one-dependence estimators (ODEs) and aggregates the predictions of all qualified estimators within this class. A superparent attribute is indiscriminately selected from an attribute set as the parent of all the other attributes in each ODE. By averaging the estimates of all of the three-dimensional estimators, AODE makes a weaker conditional independence assumption than NB. Previous studies that compared different variations of NB techniques prove that AODE is significantly better than other NB techniques in terms of zero-one loss reduction [5]. Since its introduction in 2005, AODE has enjoyed considerable popularity because of its capability to improve the accuracy of NB [5].

Another strategy to remedy violations of the attribute independence assumption is to eliminate highly correlated attributes. Backward sequential elimination (BSE) [6] uses a simple heuristic wrapper approach that selects a subset of the available attributes to minimize zero-one loss on the training set. BSE is effective especially for data sets with highly correlated attributes. Forward sequential selection (FSS) [7] uses the reverse search direction to BSE. However, both FSS and BSE have high computational overheads, especially on learning algorithms with high classification time complexity, because they apply the algorithms repeatedly until no accuracy improvement occurs. Subsumption resolution (SR) [8] identifies pairs of attribute values, such that one appears to subsume (be a generalization of) the other and deletes the generalization. Near-subsumption resolution (NSR) [8] is a variant of SR; it extends SR by deleting not only generalizations but also near-generalizations. For different instances, SR and NSR may find different attributes to remove, making them much more flexible than BSE and FSS. Since generalization mainly deals with pairs of attribute, there is no solution for more complicated situation and loop relationship. In this paper, we present a new type of seminaive Bayesian operation, which selects parent (SP) attributes by building maximum weighted spanning tree and removes children (RC) attributes by functional dependency and canonical cover analysis. Thus this algorithm has the advantages of BSE, FSS, and SR.

The remainder of the paper is organized as follows. Section 2 introduces the basic ideas of NB, AODE, and related background theory. Section 3 introduces the SP and RC techniques for attribute selection and elimination with AODE and presents the theoretical justification. Section 4 shows the experimental results on UCI data sets and a detailed analysis of different attribute selection techniques. The final section concludes the paper.

2.1. NB and AODE

The aim of supervised learning is to predict from a training set the class of a testing instance , where is the value of the th attribute. We estimate the conditional probability of by selecting , where are the classes. From Bayes theorem, we have

NB simplified the estimation of by conditional independence assumption

Then, the following equation is often calculated in practice rather than (1):

The corresponding network structure is depicted in Figure 1(a). One advantage of NB is avoiding model selection because selecting between alternative models can be expected to increase variance and allow a learning system to overfit the training data [3]. In consequence, changes in the training data will not lead to any change in NB, which leads in turn to lower variance [4]. By contrast, the underlying conditional probability tables will change correspondingly for those approaches (e.g., NB) with a definite model form when the training data changes, resulting in relatively gradual changes in the pattern of classification.

Numerous techniques have sought to enhance the accuracy of NB by relaxing the conditional independence assumption while attaining the efficiency and efficacy of one-dependence classifiers. Among them, averaged one-dependence estimator (AODE) [3, 4] utilizes a restricted class of one-dependence estimators (ODEs) and aggregates the predictions of all qualified estimators within this class. A superparent attribute (e.g., ) is selected as the parent of all the other attributes in each ODE, since

Independence is assumed among the remaining attributes given and . Consider

Hence, can be classified by selecting

The corresponding network structure of AODE is depicted in Figure 1(b). AODE maintains the robustness and much of the efficiency of NB and at the same time exhibits significantly higher classification accuracy for many data sets. Therefore, it has the potential to be a valuable substitute for NB over a considerable range of classification tasks.

2.2. Related Background Theory

In the following discussion, Greek letters are used to denote sets of attributes. Lowercase letters represent the specific values used by corresponding attributes (e.g., represents ). denotes the probability and denotes the probability estimation of . Given a relation (in a relational database), attribute of is functionally dependent on attribute of and of functionally determines of (in symbols ). Armstrong (1974) proposed in [9] a set of axioms (or, more precisely, inference rules) to infer all the functional dependencies (FDs) on a relational database, which represent the expert knowledge of the organizational data and their interrelationships. The axioms mainly include the following rules.(i)Augmentation rule: if holds and is a set of attributes, then .(ii)Transitivity rule: if holds and holds, then .(iii)Union rule: if holds and holds, then .(iv)Decomposition rule: if holds, then , .(v)Pseudotransitivity rule: if holds and holds, then holds.

Based on the aforementioned rules, we use the FD rules of probability in [10, 11] to link FD and probability theory. The following rules are included in the FD-probability theory link.(i)Representation equivalence rule of probability: suppose data set consists of two attribute sets and can be inferred by ; that is, the FD holds; then the following joint probability distribution holds: (ii)Augmentation rule of probability: if holds and is a set of attributes, then the following joint probability distribution holds: (iii)Transitivity rule of probability: if and hold, then the following joint probability distribution holds: (iv)Pseudotransitivity rule of probability: if and hold, then the following joint probability distribution holds:

In the 1940s, Claude E. Shannon introduced information theory, the theoretical foundation of modern digital communication. Although Shannon was principally concerned with the problem of electronic communications, the theory has much broader applicability. Two commonly used definitions of information theory are described as follows.

Definition 1. Mutual information measures the information quantity that is transferred between attributes and : where is the joint probability distribution function of and , and and are the marginal probability distribution functions of and , respectively. High mutual information indicates a great relationship between and ; and zero mutual information between two random variables means they are independent.

Definition 2. Conditional mutual information (CMI) measures the dependence between each pair of attributes given , which is shown as follows:

Theorem 3. Given instance and class label , if there exists FD , then is extraneous for classification. That means , where “” represents the set difference.

Proof. By applying the augmentation rule and the decomposition rule, from we can obtain
By applying the representation equivalence rule of probability and the augmentation rule of probability, we can obtain
Then,
We can also prove Theorem 3 from the viewpoint of information theory, since
End of the proof.

As for AODE, (4) will turn to be

Thus, if FD is neglected, the contribution of to classification will be calculated repeatedly for each ODE and the classification result may be wrong.

3. Attribute Selection and Elimination

AODE makes a weaker attribute conditional independence assumption than that of NB. It selects one attribute as superparent in turn for each ODE submodel, and the other attributes are supposed to be conditionally independent. Previous studies have demonstrated that AODE has a considerably lower bias than that of NB with moderate increases in variance and time complexity [5]. The same attribute may play different roles (either parent or child) in different ODE submodels. In the following discussion, we will repair harmful interdependencies from two viewpoints: select parent () attributes (SP) by building maximum weighted spanning tree; remove children () attributes (RC) by functional dependency analysis.

3.1. How to Select Parent Attributes

SP selects the branch nodes as the attributes from maximum weighted spanning tree (MST), the learning procedure of which can be summarized as follows.(1)Use CMI to measure the weights of edges between each pair of attributes. Sort the edges into descending order by CMI. Let be the set of edges comprising the MST. Set .(2)Find the edge with the greatest weight and add this edge to if and only if it does not form a cycle in . If no remaining edges exist, exit and report MST to be disconnected.(3)If has edges (where is the number of vertices in MST) stop and output . Otherwise go to step .

The attributes selected must satisfy the criterion that they either appear as the branch nodes in MST or as the leaf nodes but with stronger relationship with other attributes. Figures 2(a), 2(b), and 2(c) show the original spanning tree, procedure of selecting edges, and final MST, respectively. As shown in Figure 2(c), attributes , , and are branch nodes and can be used as attributes. In addition, , , and are leaf nodes with corresponding CMIs of 7, 6, and 2, respectively. The CMIs are then sorted into descending order. In this paper, if the sum of CMIs of the first leaf nodes is greater than 85% of the sum of CMIs of all leaf nodes, we suppose that they represent the most important marginal relationships and can also be selected as attributes. For example, since then and can be used as attributes. This criterion helps to ensure that strong, and only strong, relationships among attributes will be retained. By contrast, AODE [4] indiscriminately uses each attribute as superparent even if some attributes may be independent of others. Besides, SP supports incremental learning because it may reselect the subset of attributes when a new training instance becomes available.

At training time SP needs only to form the tables of joint attribute value, class frequencies to estimate the probabilities , , and , which are required for estimating , , and in turn. Calculating the estimates requires a simple scan through the data, an operation of time complexity , where is the number of training instances and is the number of attributes. To build maximum weighted spanning tree, SP must first calculate the CMI, requiring consideration for each pair of attributes, every pairwise combination of their respective values in conjunction with each class value. The time complexity to build a MST is . The resulting time complexity is and space complexity is , where is the number of classes and is the average number of values for an attribute. At classification time SP needs only to store the probability tables, space complexity . This compression over the table required at training time is achieved by storing probability estimates for each attribute value conditioned by the parent selected for that attribute and the class. The time complexity of classifying a single instance is .

3.2. How to Eliminate Children Attributes

Kohavi and Wolpert [12] presented a bias-variance decomposition of expected misclassification rate, which is a powerful tool from sampling theory statistics for analyzing supervised learning scenarios. Suppose and are the true class label and that class generated by a learning algorithm, respectively; the zero-one loss function is defined as where if and 0 otherwise. The bias term measures the squared difference between the average output of the target and the algorithm. This term is defined as follows: where is the combination of any attribute value. The variance term is a real valued nonnegative quantity and equals zero for an algorithm that always makes the same guess regardless of the training set. The variance increases as the algorithm becomes more sensitive to changes in the training set. It is defined as follows:

Moore and McCabe [13] illustrated bias and variance through shooting arrows at a target, as described in Figure 3. The perfect model can be regarded as the bull’s eye on a target and the learned classifier as an arrow fired at the bull’s eye. Bias and variance describe what happens when an archer fires many arrows at the target. Bias means that the aim is off and the arrows land consistently off the bull’s eye in the same direction. Variance means that the arrows are scattered. Large variance means that repeated shots are widely scattered on the target. They do not give similar results but differ widely among themselves.

It is reported that removing redundant children attributes from within ODEs can help to decrease both bias and zero-one loss [3, 14]. Subsumption resolution (SR) [8] identifies pairs of attribute values such that one can replace the other. Deleting from a Bayesian classifier should not be harmful when is a generalization of ; that is, . Only the attribute value is necessary for classification; that is, . Such deletion may improve a classifier’s estimate if it makes unwarranted assumptions about the relationship of to the other attributes when estimating intermediate probability values, such as NB’s independence assumption. Since can be represented as FD: , SR and FD have the same meaning but from different viewpoints.

SR mainly considers pair relationship of one-one. However, four basic relationships exist in the real world: one-one, one-many, many-many, and many-one. These four relationships can be grouped into two sets: one-one and many-one. Thus, SR cannot resolve interdependencies when a loop appears in the many-one relationship. The data presented in Table 1 shows a loop example with four attributes and class label . For the first instance , suppose , , and are generalizations of , , and , respectively. The loop relationship is described in Figure 4(a), where “” represents the one-one relationship and “” represents the many-one relationship. After SR, only attributes are used for classification and NB will misclassify the first instance as “”, even though it occurs in the training data. For different testing instances, different correlated attributes will be deleted. These instances will be illustrated from the viewpoint of FD. can be replaced by three FDs:(1),(2),(3).

The following results can be generated. We can obtain from FD by applying union rule. As shown in Figure 4(b), disappears and the arc that once connects and now extends to connect and .

We can obtain from FD by applying augmentation rule. As Figure 4(c) shows, the arc from to is removed to avoid a loop relationship. Thus, we can infer and obtain the other two attribute values from two attribute values of . Correspondingly the first instance will be correctly classified.

It should be noted that SP selects attributes from the probabilistic viewpoint by calculating CMI while RC selects attributes from the logical viewpoint by inferring FDs from the training data. That is, the learning procedure of SP + RC is divided into two parts; SP roughly describes the basic structure of each submodel which uses attributes as superparents and other attributes as the children; and for different instances, RC further refines the model by deleting redundant children attributes and thus makes the final model much more flexible and robust. Suppose an extreme instance, the CMIs of all attributes are all small and equal, and then all the attributes will be selected as attributes. The structure will be just the same as AODE after applying SP. But for different testing instances, different FDs can help to make each submodel express the key dependencies. For example, suppose holds for instance-1 and holds for instance-2; Figures 5, 6, and 7 show the original AODE structure after applying SP and corresponding structures for instance-1 and instance-2, respectively.

Discovering FD from existing databases is an important issue. This issue has long been investigated and has been recently addressed with a data mining viewpoint in a novel and efficient way. Rather than exhibiting the set of all functional dependencies which hold in a relation, related work aims to discover a smaller cover equivalent to this set. This problem is known as FD inference. Association rules can be used to discover the relationships and potential associations of items or attributes among huge data. These rules can be effective in uncovering unknown relationships, thereby providing results that can be the basis of forecast and decision. They have proven to be useful tools for an enterprise as they strive to improve their competitiveness and profitability.

4. Experimental Study

We expect AODE with SP and RC to exhibit low zero-one loss and low bias. Thus, we compare the performance of the system with the following attribute selection methods. First is attribute addition (PAA), which starts with and initialized to the empty and full sets, respectively. It adds one attribute to each ODE at each step. Second is attribute addition (CAA), which begins with and initialized to the full and empty sets, respectively. It adds one attribute to every ODE at each step. Third is attribute elimination (PAE), which starts with and initialized to the full set and deletes one attribute from every ODE at each step. Fourth is attribute elimination (CAE), which deletes one attribute from every ODE at each step. Fifth and sixth are SR and NSR, respectively.

Table 2 summarizes the characteristics of each data set, including the numbers of instances, attributes, and classes. Missing values for qualitative attributes are replaced with modes and those for quantitative attributes are replaced with means from the training data. We estimate the base probabilities , , and using the Laplace estimate as follows [15]: where is the frequency with which a combination of terms appears in the training data, is the number of training instances for which the class value is known, is the number of training instances for which both the class and attribute are known, and is the number of training instances for which all of the class and attributes and are known. is the number of attribute values of class , is the number of attribute value combinations of and , and is the number of attribute value combinations of , , and . As NB and AODE require discrete valued data, all data were discretized using minimum description length (MDL) discretization [16]. Classifier is formed from data set and bias, variance, and zero-one loss estimated from the performance of those classifiers on the same data set. Experiments are performed on a dual-processor 3.1 GHz Windows XP computer with 3.16 Gb RAM. All algorithms are applied to the 22 data sets described in Table 2.

4.1. Zero-One Loss, Bias, and Variance Results

Table 3 presents for each data set the zero-one loss, which is estimated by 50 runs of twofold cross-validation to give an accurate estimation of the average performance of an algorithm. The advantage of this technique is that it uses the full training data as the training set and the testing set. Moreover, every case in the training data is used for the same number of times in each of the roles of training and testing data. Tables 4 and 5 provide the bias and variance results, respectively. The zero-one loss, bias, or variance across multiple data sets provides a gross measure of relative performance.

The basic relationships among attributes can be clearly observed by building MST. If one attribute is connected with several other attributes, the attribute is supposed to have crossed functional zones and will be selected as attribute to retain complementarity. If one attribute is connected with only one other attribute, its independence characteristics may be obvious and will be reconsidered by the weight of CMI. Besides, RC helps to detect the situation in which the relationships that hold in MST need to be refined. Table 3 shows that the advantage of SP + RC is significant compared with SR and NSR in terms of zero-one loss. However, SR and NSR have a significant advantage over CAA, PAA, CAE, and PAE. The disappointing performances of CAA, PAA, CAE, and PAE can be ascribed to their susceptibility to getting trapped into poor selections by local minima during the first several additions or deletions.

The records in Table 4 show that all the attribute selection algorithms applying SR, NSR, or FDs have a significant advantage in bias over CAE and PAE. In addition, CAE and PAE outperform CAA and PAA. However, comparing SP + RC with SP again does not show obvious difference. This result indicates that SP takes the main role for classification and its effect differs greatly in different data sets. The same result can also be inferred by comparing SP + RC with SR. The training sets containing only 25% of each data set for bias-variance evaluation are small because the data sets are primarily small. The bias of SP + RC decreases as training set size increases because more data will lead to more accurate probability distribution estimates and hence to more appropriate attribute selection. Of these algorithms, RC, SR, and NSR have the weakest sensitivity to the changes in training data because they can utilize the testing set to infer rules for attribute elimination. By contrast PAA, PAE, CAA, and CAE perform model selection and their biases differ greatly with different training data.

With respect to variance, as Table 5 shows, the variance of SP + RC does not show obvious advantage to other algorithms. Low-variance algorithms tend to enjoy an advantage with small data sets, whereas low-bias algorithms tend to enjoy an advantage with large data sets. Cross data set experimental studies of the traditional form presented above also support this hypothesis. The main reason may be that the relationship inferred based on MST may overfit the training data because SP needs to calculate CMI to construct MST. This requires enough instances to achieve precise probability estimation.

In the following discussion, canonical cover analysis [17], which can use limited number (e.g., 100) of instances to infer credible FD set, is applied in tandem with functional dependency analysis. Let be a canonical cover for a set of simple FDs and the procedure of computing is described as in Algorithm 1.

= FDs
repeat
 Use the union rule to replace any dependencies in of the form
   and with .
 Find a functional dependency in with an extraneous
  attribute either in or in .
 If an extraneous attribute is found, delete it from in .
until ( does not change)

The chosen canonical cover is the set of minimal dependencies. Such a cover is information lossless and is considerably smaller than the set of all valid dependencies. These qualities are particularly important to decrease variance while zero-one loss and bias are not affected negatively because of the provision of relevant knowledge wherein redundancy is minimized and extraneous information is discarded. Five large data sets with number of instances >10,000 are selected for variance comparison. The experimental results are shown in Table 6, and we can see that canonical cover analysis can help to project AODE to be competitive with other attribute selection strategies from the viewpoint of variance comparison.

4.2. Elimination Ratio

Statistically a win/draw/loss record (W/D/L) is calculated for each pair of competitors and with regard to a performance measure . The record represents the number of data sets in which , respectively, beats, loses to, or ties with on . Small improvements in leave-one-out error may be attributable to chance. Consequently, it may be beneficial to use a statistical test to assess whether an improvement is significant. A standard binomial sign test, assuming that wins and losses are equiprobable, is applied to these records. A difference is considered to be significant when the outcome of a two-tailed binomial sign test is less than 0.05. To observe the effect of attribute elimination of SP on zero-one loss, we used the following criterion: where is the number of attributes eliminated and is the number of attributes. Table 7 shows the comparison results. The of SP is much higher than that of the other two attribute elimination algorithms, PAA and PAE. The statistical records in Table 3 show that the corresponding zero-one loss of SP is lower generally. We suggest that the reason for SP’s outstanding performance on zero-one loss reduction is that it greatly utilizes the probabilistic dependency relationship on training data.

For example, the elimination ratio of SP is as high as 61.5% for data set “anneal.” However, the corresponding zero-one loss is lower than that of PAA and PAE. The main reason may be that, for as many as 38 attributes and only 894 instances, some attributes may have cross-functional zones, and only a few attributes may play the decisive role. After calculating and comparing the sum of CMI between one attribute and all the other attributes, most of the eliminated attributes have weak relationships to other attributes or are even nearly independent of them.

SP selects attributes based on MST; attributes with a strong relationship among them will be selected first. If any attribute is removed by mistake, the classification results will not be affected greatly. However, for different training sets, especially when their sizes are very small, the conditional distribution estimates may differ greatly and different structures of MST may be obtained. Different attributes will be selected for classification. The number of FDs extracted from RC is less than that from SR because numerous attributes are eliminated during SP, especially for data set “audio,” with fewer attributes and much more complicated FDs to remove attributes. In the W/D/L records, the advantage in zero-one loss is significant with respect to SP versus PAE or SP versus PAA, but not SP + RC versus SP. This result shows that the advantage of SP + RC is from SP but not from RC.

With the increasing number of attributes, more RAM is needed to store joint probability distributions. An important restriction of our algorithm is that the number of the left side of FD should be no more than 2. To observe the effect of SP + RC and SR on each data set, we calculate the attribute elimination ratios by the following criterion: where is the number of attributes eliminated for the th instance, and is the size of data set. Table 8 shows the comparison results of of SP + RC with the other three attribute elimination algorithms, CAA, CAE, and SR. Table 3 shows that SP + RC has a significant advantage in zero-one loss over SR and NSR while SR and NSR outperform CAA and CAE. Comparing Table 3 with Table 8 reveals that both RC and SR can help to decrease zero-one loss. However, the effectiveness of RC relies greatly on SP while SR can always improve the performance of AODE. If SP removes attributes by mistake, some valuable FDs will not be extracted by RC. However, if just redundant nodes are eliminated, RC can extract more reliable FDs than SR because RC has considered all possible situations of SR.

For example, on data set “hypothyroid” the of RC is 32%, which indicates that RC just uses approximately 30% of all attributes as attribute. The reason for this high ratio is that SP has eliminated 21% of the attributes from the data set. For data set “anneal,” the is also as high as 34%, and the zero-one loss is much higher than that of the other three algorithms. This result means SP has removed some attributes by mistake; RC cannot extract FDs that are dependent on those deleted attributes. Hence, the experimental results of SP + RC can still be improved if we can find other methods to keep more valuable attributes for classification.

5. Conclusion and Future Work

AODE provides an attractive framework by averaging all models from a restricted class of one-dependence classifiers. The class of all such classifiers that have all other attributes depends on a common attribute and the class attribute. The current work aims to improve the accuracy derived by MST and FD from weakening the attribute independence assumption without high computational overheads.

Overall, this study developed a classification learning technique that retains the simplicity and direct theoretical foundation of AODE while reducing computational overhead without incurring a negative effect on classification performance. The attribute of AODE can also be considered the parent of the class attribute. Therefore, we hypothesize that the success of AODE and its variations may be attributed to the fact that AODE not only aggregates all other restricted classes of models but also extends NB to handle the parent of class attribute. If this hypothesis can be proven, we can design a novel and perhaps more effective Bayesian classifier than AODE by constructing the Markov blanket of class attributes.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the National Science Foundation of China (Grants nos. 61272209 and 61300145) and Postdoctoral Science Foundation of China (Grants nos. 20100481053 and 2013M530980).