Abstract

The problem of extracting knowledge from a relational database for probabilistic reasoning is still unsolved. On the basis of a three-phase learning framework, we propose the integration of a Bayesian network (BN) with the functional dependency (FD) discovery technique. Association rule analysis is employed to discover FDs and expert knowledge encoded within a BN; that is, key relationships between attributes are emphasized. Moreover, the BN can be updated by using an expert-driven annotation process wherein redundant nodes and edges are removed. Experimental results show the effectiveness and efficiency of the proposed approach.

1. Introduction

The challenge of extracting knowledge for an uncertain inference is related to the study of Bayesian networks (BNs), which represent one of the key research areas of knowledge discovery and machine learning [1, 2]. The advantages of using prior knowledge by Bayesian statistical methods and clear semantic expression have increased the application of BN in medical diagnosis, manufacturing, and pattern recognition.

A BN consists of two parts: a qualitative part and a quantitative part. The qualitative part denotes the graphical structure of the network, whereas the quantitative part consists of the conditional probability tables (CPTs) in the network. Although BNs are considered efficient inference algorithms, the quantitative part is considered a complex component, and learning an optimal BN structure from existing data has been proven to be a NP-hard problem [3, 4]. Researchers have used qualitative research to improve the efficiency of probabilistic inferences by introducing domain knowledge. Domain experts have proposed ideas on qualitative relations that should be required in the model, such as the requirement of an arc between two nodes. However, domain knowledge may have a negative effect when attributes have similar properties and crossed functional zones. Thus, some attributes are necessary but redundant. Although these attributes are not biased to the network structure because they do not contribute any negative information, these attributes increase the computational complexity of the network. Given the limited instances for parameter estimation, these necessary attributes should be removed to build a robust structure. Furthermore, the definition and application of expert knowledge in the learning procedure of BN are still unsolved.

In the real world, the widespread use of databases has created a considerable need for knowledge discovery methodologies. Database systems are designed to manage large quantities of information, define structures for storage, and provide mechanisms for mass data manipulation. Functional dependencies (FDs) are a key concept in relational theory and are the foundation of data organization in relational database design. A FD is treated as a constraint that needs to be enforced by the database system to preserve data integrity. A FD can also be used as domain knowledge for knowledge representation and reasoning.

Researchers have recently suggested the linking of the relational database and probabilistic reasoning model to construct a BN from a new perspective. FDs are important data dependencies that provide conditional independency information in relational databases. FDs can be generated from an entity-relationship diagram instead of being mined from data. Therefore, constructing a BN from FDs is interesting and useful, particularly when data are incomplete and inaccurate. Researchers have indicated the similarities between the BN and relational model. Jaehui and Sang-goo [5] propose a probabilistic ranking model to exploit statistical relationships that exist in relational data of categorical attributes. To quantify the information, the extent of the dependency between correlative attribute values is computed on a Bayesian network. Thimm and Kern-Isberner [6] take an approach of lifting maximum entropy methods to the relational case by employing a relational version of probabilistic conditional logic. They address the problems of ambiguity that are raised by the difference between subjective and statistical views, and develop a comprehensive list of desirable properties for inductive model-based probabilistic inference in relational frameworks. Liu et al. [7] discussed the relationship between fuzzy FD and BN.

Considering the 0/1 data analysis, the association rule mining technique is quite popular and has been studied extensively from computational and objective perspectives by using measures such as frequency and confidence. Many algorithms have been designed to compute frequent and valid association rules. The algorithm proposed in this paper, namely, the tree-augmented Naive Bayes (TAN) classifier with FDs (i.e., TAN-FDA), applies extracted FDs based on association rule analysis. TAN-FDA has two focuses: the use of association rules to infer belief FDs; redundant attributes deduced from FDs can be proven from the viewpoints of information theory and probability theory; the effective simplification and estimation of BN structures and probability distributions, respectively. The feasibility and accuracy of the proposed method are also proven to explain the necessity of finding and eliminating redundant attributes.

This paper is organized as follows. Section 2 introduces related background theories wherein the FD rules of probability are proposed to link FD and probability distribution. Sections 3 and 4 introduce the theoretical foundation used in this paper and the learning procedure of TAN-FDA, respectively. Section 5 compares TAN-FDA with other algorithms. Section 6 concludes.

2.1. Functional Dependency Rules of Probability

In the following discussion, we use Greek letters to denote attribute sets. Lower-case letters represent the specific values used by corresponding attributes (e.g., represents ). P() denotes the probability. Given a relation (in a relational database), attribute of is functionally dependent on attribute of , and of functionally determines of (in symbols ). Armstrong [8] proposed a set of axioms (inference rules) to infer all FDs on a relational database that represents the expert knowledge of organizational data and their inherent relationships. The axioms mainly include the following rules.(i) Augmentation rule: if is true and is a set of attributes, then .(ii) Transitivity rule: if and are true, then .(iii) Union rule: if and are true, then .(iv) Decomposition rule: if is true, then and .(v) Pseudotransitivity rule: if and are true, then .

On the basis of the aforementioned rules, we use the FD rules of probability in [9, 10] to link FD and probability theory. The following rules are included in the FD-probability theory link.(i) Representation equivalence of probability: assume that data set consists of two attribute sets and that can be inferred by ; that is, FD is true; then the following joint probability distribution is true: (ii) Augmentation rule of probability: if is true and is an attribute set, then the following joint probability distribution is true: (iii) Transitivity rule of probability: if and are true, then the following joint probability distribution is true: (iv) Pseudotransitivity rule of probability: if and are true, then the following joint probability distribution is true:

2.2. Unrestrictive BN of Naive Bayes and TAN

BN is a directed acyclic graph that represents a joint probabilistic distribution wherein the nodes denote the domain variables and a graph of conditional dependencies denotes a network structure among variables in . A conditional dependency connects a child variable with a set of parent variables. This dependency is represented by a table of conditional distributions of the child variable for each combination of the parent variables. Dependency-analysis-based algorithms construct a BN by using dependency information. The mutual information is used to determine the BN structure, which explains the causal relationship between random variables. Hence, dependency-analysis-based algorithms construct a BN by testing the validity of any independence assertions, which lead to a NP-hard computational problem.

Definition 1. Entropy is a measure of uncertainty of random variable : where is the probability distribution of .

Definition 2. Mutual information is the reduction of entropy about variable after observing

The mutual information between and measures the expected information on after observing the value of . If two nodes are dependent in BNs, knowledge of the value of one node will provide information about the value of the other node. Hence, the mutual information between two nodes can show the dependency of the two nodes and the degree of their relationship. To solve the NP-hard computational complexity, Naïve Bayes (NB) [11] assumes that the attributes are independent in a given class. NB has simple structures that contain arcs from the class node to each other node and do not have arcs between other nodes (Figure 1). Although the independence assumption is unrealistic in many practical scenarios, NB has exhibited competitive accuracy with other learning algorithms. Researchers have tried to adjust the Naïve strategy to allow for violations of independence assumptions and improve the prediction accuracy of NB. One straightforward approach in overcoming NB limitations involves the extension of the NB structure to represent explicitly the dependencies among attributes. Friedman et al. [12] presented a compromise representation, that is, TAN, which allows arcs between the children of the class attribute, thereby relaxing the assumption of CI (Figure 2). Given the independence assumption, NB and TAN are both considered restrictive BN.

2.3. Association Rule to Belief Functional Dependency

Discerning FDs from existing databases is an important issue and has been investigated for a number of years [13]. This issue has recently been addressed in a novel and efficient manner by a data-mining viewpoint. Rather than exhibiting the set of all FDs in a relation, related works have aimed to discover a small cover equivalent of this set. This problem is known as the FD inference problem. Association rules are used to discover the relationships and potential associations of items or attributes in large quantities of data. These rules can be effective in uncovering unknown relationships and providing results that can be the basis of forecasts and decisions. Association rules have also been proven useful tools for enterprises in improving competitiveness and profitability.

Agrawal and Srikant [14] proposed the apriori association rule algorithm, which can discover meaningful item sets and construct association rules within large databases. However, this algorithm generates a large number of candidate item sets from single item sets and requires comparisons to be performed against the whole database level by level in creating association rules. Given a data set , an association rule is a rule in the form , which satisfies the following:(i), ,(ii), (iii) and .and are attribute-value vectors, is the number of samples in , is the number of samples that contain the set of items , and is the number of samples that contain . and are the support and confidence of the rule, respectively. The pseudocode of the apriori association rule algorithm is shown in Algorithm 1.

: candidate item set of size
: frequent item set of size
= frequent items;
For ( ) do begin
= candidates generated from ;
 for each transaction in database do
 increment the count of all candidates in that are contained in
= candidates in with min support
end
return    ;

Association rules can then be extracted from frequent item sets. When an association rule has a nonzero support and a confidence of 100%, such an association rule can be interpreted as a FD and introduced in this paper for preprocessing. Considering that probability theory is one of the bases of BN, BN requires mass data to ensure precise parameter estimations. For restrictive BN, the structure can only be learned from the training data because the testing data is incomplete, thus making the confidence level relatively low. Some researchers have tried to apply a semisupervised learning or active learning to use useful information in the testing data and increase the robustness of the final model. These algorithms may cause noise propagation. However, FDs transformed from the association rule will naturally decrease noise to a minimum. Given data set with variables and class label , suppose an association rule has been deduced from a whole data set and , such an association rule is transformed to be a FD as . From the FD rule of probability, we can obtain the following:

By applying the augmentation rule of probability,

Thus, FD still occurs in the training data. Thus, we can use the whole data set to extract FDs with high confidence.

3. BN Learning Method

3.1. Redundant Attributes for Restrictive and Unrestrictive BN

Given that a BN is a complete model for variables and their relationships, a BN can be used to answer probabilistic queries about variables. For example, the network can be used to find updated knowledge on the state of a subset of variables when other variables (e.g., evidence variables) are observed. This process of computing the posterior distribution of variables with evidence is called probabilistic inference. If FD exists, the following will be obtained in the augmentation rule of probability:

Thereafter, conditional entropy H() = 0 and conditional mutual information reach their maximum values. Thus, a definite arc exists between and in the restrictive structure of BN.

Furthermore, from the viewpoint of information theory, the information quantity supplied by and to any other attribute sets is expressed as follows:

Hence, from the viewpoint of probability and information theory, is a redundant attribute that does not need to be considered during structure learning.

If the information supplied by X1 is implicated in the information supplied by , and have a strong relationship, but is redundant for modeling. By contrasts, if the information boundaries of X1 and are crossed, a relationship exists at some point between and . If the boundaries are disjointed, and are independent. This result is clearly displayed in Figure 3.

3.2. Structure Simplification for BN Learning

In relational database theory, the minimal FD set can be inferred by canonical cover analysis. Let be a canonical cover for a set of simple FDs over and closure of denoted as , which represents all attributes functionally determined by . Redundant attributes can be found from the viewpoints of chain formula and FD rules of probability.

A joint probability distribution can be expressed by a chain formula that includes construction information for a BN. However, constructing the structure solely from a joint probability distribution without using any conditional independencies is impractical because such an approach will require an exponentially large number of variable combinations. We can use FDs to simplify the structure of a BN which is represented by a set of probability distributions [15] (see Algorithm 2).

Input: FDs and probability distribution sets ,
where , .
Output: A chain formula with independence conditions implied by FDs.
Begin
For every Do
 If FD = , substitute with ,
 where is a minimal set of .
 End For
End.

We use the following example to elaborate the algorithm.

Example 3. Let be a probabilistic scheme; and are sets of FDs over .
According to the order of , the joint probability distribution will be expressed as follows:
The following results can be generated based on Algorithm 2.(1) By using FD , .(2) By using FD , .
The joint probability distribution can then be simplified as follows:
The corresponding BN structure of the joint probability distribution is shown in Figure 4(a). From the FD rule of probability, given that is true, (11) is changed into the following expression:
The corresponding structure is shown in Figure 4(b).

Theorem 4 (see [9]). Given two attribute sets and class label , if is true, then the following conditional probability distribution is true:

Theorem 5 (see [16]). Let be a set of attributes and an acyclic set of simple FDs over . One canonical cover will exist for F; that is, if and are canonical covers for .

The chosen canonical cover is the set of minimal dependencies. Such a cover is information lossless and is considerably smaller than the set of all valid dependencies. These qualities are particularly important for the end user because of the provision of relevant knowledge wherein redundancy is minimized and extraneous information is discarded.

Lemma 6. Let be a canonical cover for , which is a set of simple FDs, and let and be the union of the left-hand sides of all dependencies in and , respectively. The following conditional probability distribution is true:

Lemma 7. Let be a set of attributes and the class label. Let be a canonical cover for , which is a set of simple FDs, and let be the union of the left-hand sides of all dependencies in . The following conditional probability distribution is true: where represents the difference set between and . Therefore, the difference set is redundant for classification. From Lemma 7 and chain formula rule,

The corresponding structure is shown in Figure 5(a). By applying the FD rule of probability, (17) is changed into the following:

The corresponding structure is shown in Figure 5(b).

4. TAN-FDA

TAN allows tree-like structures to be used in representing dependencies among attributes. The class node directly points to all attribute nodes, and an attribute node can have only one parent from another attribute node (in addition to the class node).

The architecture of the TAN model can be depicted as shown in Algorithm 3.

Input: Training set , attribute set , and class .
Output: TAN constructed by conditional mutual information metric.
Step 1. Compute the conditional mutual information between
each pair of attributes , .
Step 2. Build a complete undirected graph wherein the vertices are attributes .
Annotate the weight of an edge connecting to by .
Step 3. Build a maximum weighted spanning tree.
Step 4. Transform the resulting undirected tree to a directed tree by choosing a root node
and setting the direction of all edges outward from the directed tree.
Step 5. Build a TAN model by adding a vertex labeled by and adding an arc from to each .

The proposed TAN-FDA has three phases: drafting, thinning, and thickening. Given that TAN performs well in the experimental study, the drafting phase does not start from the empty graph but from the TAN structure that is expected to be close to the correct graph. Moreover, we propose another method for performing the thinning phase. This new methodology uses a data-mining technique to infer the number of FDs, which describe the correlation of events and can be viewed as probabilistic rules. Two events are correlated if they are frequently observed together. Each FD can be used to remove redundant attributes, thus simplifying the network structure. Some relationships can be neglected for TAN with a limited number of training samples. However, after the thinning phase, the attribute space decreases, and the training samples can provide useful information. In the thickening phase, the conditional mutual information is computed to obtain a new TAN structure with the rest of the attributes. We then compared the differences between TAN structures before and after the thickening phase. If a new edge exists, the edge is added to the graph in the thickening phase. The graph produced by this phase will contain all the edges of the underlying dependency model.

The learning procedure of the TAN-FDA algorithm is described as shown in Algorithm 4.

Input: Training set , testing set , and attribute set .
Output: Restrictive TAN model.
Step 1. In the drafting phase, the TAN model is used as the basic structure.
Step 2. Mine association rules from data set , and transform these rules into FDs.
Thereafter, obtain the closure of FDs.
Step 3. In the thinning phase, remove redundant attributes and corresponding edges that
originate from these attributes. Thereafter, obtain the simplified structure.
Step 4. Learn the TAN model as the mapping structure with the rest of the attributes.
Step 5. In the thickening phase, compare the mapping structure and simplified structure.
If a new edge exists in the mapping structure, add the edge to the simplified structure.

5. Experiments

To verify the efficiency and effectiveness of the proposed TAN-FDA, we conduct experiments on nine data sets from the UCI machine-learning repository (Table 1). For each benchmark data set, we compare two Bayes models with the selected attributes. The following abbreviations are used for the different classification approaches: SNB-FSS [17]: selective Naive Bayes classifier with forward sequential selection; TAN-CFS [18]: tree-augmented Naive Bayes classifier with classical floating search.

SNB-FSS selects a subset of attributes by using leave-one-out cross validation as a selection criterion and establishes an NB with these attributes. SNB-FSS uses the reverse search direction; that is, SNB-FSS iteratively adds the attribute to improve accuracy, starting with the empty set of attributes. Independence is assumed among the resulting attributes in a given class.

TAN-CFS includes new features that maximize the criterion by means of the sequential forward selection procedure and starts from the current feature set. Thereafter, the conditional exclusions of previously updated subsets are implemented. If no feature can be excluded, the algorithm proceeds with the sequential forward selection algorithm. The floating methods are allowed to correct the wrong decisions made in the previous steps. These methods also approximate the optimal solution better than sequential feature selection methods.

The current implementation of TAN-FDA is limited to categorical data. Hence, we assess only the relative capacities of these algorithms with respect to categorical data, and all numeric attributes are discretized. When MDL discretization [19], which is a common discretization method for NB, is used to discretize quantitative attributes within each cross-validation fold, many attributes will have only one value. In these experiments, we discretize quantitative attributes by using three-bin equal-frequency discretization prior to classification.

The base probabilities are estimated by using m-estimation [20] , which often results in more accurate probabilities than the Laplace estimation for NB and TAN. The above experiments are coded in Matlab 7.0 on an Intel Pentium 2.93 GHz computer with 1 GB of RAM.

The main advantage of the proposed method is that it uses FD to simplify the learning procedure and builds a robust BN structure. For different testing samples, different FDs may be applied and the final BN structure may differ greatly, which make TAN-FDA much more flexible. During our experiments, we try to use the most general FDs as inductive rules wherein the number of attributes (denoted as in the following discussion) of the left side of each FD is less than three. We first study the performance of the state-of-the-art classifiers NB and TAN to reveal how performance varies with changing . To explore how the classification performance of TAN-FDA compares with SNB-FSS and TAN-CFS, we estimate Pcost, which is the mean posterior probability of the submodels. The probabilistic costing is equivalent to the accumulated code length of the total test data encoded by an inferred model. Given that an optimal code length can only be achieved by encoding the real probability distribution of the test data, a smaller probabilistic costing will indicate a better performance on the overall probabilistic prediction than a larger probabilistic costing. For each benchmark data set, the classification performance is evaluated by fourfold and tenfold cross-validation.

5.1. Comparison with State-of-the-Art Algorithms

We choose NB and TAN as comparator algorithms because they are relatively unparameterized and can readily produce clearly understood performance outcomes. We first consider the relative performance when the induction rules or FDs take different values, that is, one or two. In an attempt to assess these predictions, we calculated the mean of error, bias, and variance for NB and TAN over the nine data sets. The experimental results are presented in Tables 2 and 3. We observe that increasing from one to two consistently decreases bias at the cost of an increase in variance. This trade-off delivers low errors for NB and TAN. If our reasoning on the expected bias profiles of these algorithms is accepted, the performance of NB or TAN should increase with increasing data quantity, and unrestricted BN classifiers should also achieve low errors.

5.2. Comparison with SNB-FSS and TAN-CFS

Tables 4 and 5 show the mean of error, bias, and variance results of TAN-FDA, SNB-FSS, and TAN-CFS. TAN-FDA has higher biases, lower variances, and lower mean errors than SNB-FSS and TAN-CFS. When increases from one to two, fewer instances may satisfy the stopping criterion of the association rule. Moreover, the rules or extracted FDs may only be coincidental and are not real domain knowledge. The occurrence of low variance is less obvious than the occurrence of high mean bias. This phenomenon remains an interesting unexplained topic that is worthy of further investigation.

To obtain Pcost for BN, the probabilistic prediction for each test instance is calculated by arithmetically averaging the probabilistic predictions submitted by each iteration. We observe that TAN-FDA has achieved better (lower) probabilistic costing than SNB-FSS and TAN-CFS in almost all data sets in terms of logarithm of probability Pcost bit costing (Figure 6). The superior performance of TAN-FDA on probabilistic prediction can be attributed to the belief that FDs are extracted from a whole data set, irrelevant attributes are not included, and classifiers are based on subsets of selected attributes. This restricted network structure maximizes the classification performance by removing irrelevant attributes and relaxing independence assumptions between correlated attributes. The computational demands for determining the network structure are low, particularly if a large number of attributes are available.

The size of the CPTs of the nodes increases exponentially with the number of parents, thus possibly resulting in an unreliable probability estimate of the nodes that have a large number of parents. However, the introduction of FDs reduces this negative effect significantly.

6. Conclusion

For high-dimensional data, only very low-dimensional forms of BN are robust. FDs provide a novel way of solving this common problem that exists in machine-learning techniques. Moreover, we have established that higher-dimensional variants are likely to deliver greater accuracy than lower-dimensional alternatives when provided with reliable domain knowledge. Thus, a promising direction for future research is the development of computationally efficient techniques for approximating TAN-FDA with high values.

A further unresolved issue is the manner of selecting an appropriate value for any specific data set . Furthermore, a number of techniques have been developed for extending TAN to handle numeric data [21]. Extending the current study to the general TAN-FDA framework is needed.

We have presented a strategy for deriving FDs and removing redundant nodes. However, mining relationships between different FDs should also be important. Reducing the errors of BNs has been proven possible by appropriate feature selection and submodel weighting [22]. Therefore, exploring efficient methods for removing redundant FD parts and obtaining high values is significant. If fast classification is required and training time is not constrained, approaches that use search methods to select small numbers of FDs from a TAN-FDA model are likely to be appropriate. If sufficient training time is available, searching for appropriate values will be useful.

For forward sequential selection (FSS) to consider joining two attributes, FSS must first construct a classifier wherein one of the attributes is used independently. However, on exclusive—or, a Bayesian classifier with only one relevant attribute will not perform better than a simple guess on the most frequent class, and correct relevant attributes will rarely be joined. By contrast, classical floating search (CFS) starts with the TAN classifier of all attributes and considers all pairs of joins. CFS always considers joining two relevant attributes (specifically on larger data sets) and has high accuracy. A potential weakness of the algorithm for joining attributes is that to join more than two attributes, two attributes must first be joined before joining other attributes. This process will not occur unless the formation of the first pair results in an increase in accuracy. This problem is common in many algorithms. In this paper, we have developed a generative learning algorithm that generalizes the principles underlying TAN-FDA. We have shown that searching for dependencies among attributes while learning Bayesian classifiers results in significant increases in accuracy. All redundant attributes will be considered only once to improve the accuracy of the Bayesian classifier when more than two interacting attributes are present.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant no. 61272209) and Postdoctoral Science Foundation of China (Grant no. 20100481053).