Abstract
Although naïve Bayes learner has been proven to show reasonable performance in machine learning, it often suffers from a few problems with handling real world data. First problem is conditional independence; the second problem is the usage of frequency estimator. Therefore, we have proposed methods to solve these two problems revolving around naïve Bayes algorithms. By using an attribute weighting method, we have been able to handle conditional independence assumption issue, whereas, for the case of the frequency estimators, we have found a way to weaken the negative effects through our proposed smooth kernel method. In this paper, we have proposed a compact Bayes model, in which a smooth kernel augments weights on likelihood estimation. We have also chosen an attribute weighting method which employs mutual information metric to cooperate with the framework. Experiments have been conducted on UCI benchmark datasets and the accuracy of our proposed learner has been compared with that of standard naïve Bayes. The experimental results have demonstrated the effectiveness and efficiency of our proposed learning algorithm.
1. Introduction
Naïve Bayes classifier is a supervised learning method based on Bayes rule of probability theory, running on labeled training examples and driven by a strong assumption that all attributes in the training examples are independent from one another on the given training examples known as naïve Bayes assumption or naïve Bayes conditional independence assumption. Naïve Bayes classifier has high performance and rapid classification speed and has exhibited its effectiveness especially in huge training instances with plenty of attributes mainly because of its independence assumption [1].
In practice, classification performance is affected by the attribute independence assumption which is usually violated in real world. However, due to the attractive advantages of efficiency and simplicity, both stemming from the attribute independence assumption, many researchers have proposed effective methods to further improve the performance of naïve Bayes classifier by weakening the attribute independence without neglecting its advantages. We categorize some typical previous methods of relaxing naïve Bayes assumption and give brief reviews in Section 3. However, we have found out that attribute weighting method has drawn relatively little attention among those previous methods in improving naïve Bayes classifier, especially in the case when attribute weighting method is combined with kernel method in a reasonable way.
Although Chen and Wang [2] proposed attribute weighting method with the kernel, their weighting scheme generates a series of parameters from least squares crossvalidation which is less meaningful in terms of interpretation than our proposed method. In contrast, we propose an attribute weighting algorithm based on attribute weighting framework with kernel method. Our method makes the weights embedded in kernel have relatively interpretable meaning; thus we can flexibly choose different metrics and methods to measure the weights based on our attribute weighting framework.
Contributions of this paper are threefold:(i)We briefly make a survey of ways to improve naïve Bayes, especially focusing on those naïve Bayes weighting methods.(ii)We propose a novel attribute weighting framework called Attribute Weighting with Smooth Kernel Density Estimation, simply AWSKDE. The AWSKDE framework employs a smooth kernel that makes the probabilistic estimation of likelihood to be dominated by the weights, which enables the combination of kernel methods and weighting methods. After setting up the kernel, we can generate a set of weights directly by using various methods cooperating with the kernel.(iii)On the AWSKDE framework, we propose a learner called AW in which we choose the mutual information criterion to measure the dependency between an attribute and its class label.
Our experimental results show that mutual information criterion based on AWSKDE framework exhibits superior performance compared to standard naïve Bayes classifier.
The paper is organized as follows: we briefly make a survey of ways to improve naïve Bayes in Section 2. In Section 3, we introduce the background of our study. In Section 4, we first propose our attribute weighting framework based on kernel density estimation. After that, we propose a method employing the mutual information criterion for attribute weighting based on our proposed framework. In Section 5, we describe the experiment and results in detail. Lastly, we draw conclusions for our study and describe the future research in Section 6.
2. Related Work
A number of methods that weaken attribute independent assumption for naïve Bayes have been proposed in the recent years. Jiang et al. [3] made a survey about improving naïve Bayes method. Those methods are broadly divided into five main categories: structure extension, feature selection, data expansion, local learning, and attribute weighting. We make a brief review by following this categorization.
For data expansion, Kang and Sohn [4] have presented an algorithm called propositionalized attribute taxonomy learner, simply PATlearner. In PATlearner, the training data set is first disassembled into small pieces with attributes values; then, PATlearner rebuilds a new data set called PATTable by using divergence between the distribution of the class labels associated with the corresponding attributes at the disassembled date set. Kang and Kim [5] also proposed a Bayes learner based on PATlearner, called propositionalized attribute taxonomy guided naïve Bayes learner (PATNBL). They utilize propositionalized data set and PATTable that is generated from PATlearner to build naïve Bayes classifiers.
Wong [6] has focused on the discretization method of attributes to improve naïve Bayes. Wong has proposed a hybrid method for continuous attributes and mentioned that discretizing continuous attributes in a data set using different methods can improve the performance of naïve Bayes learner. Also, Wong provides a nonparametric measure to evaluate the dependence level between a continuous attribute and the class.
In structure extension, Webb et al. [7] have proposed a method called aggregating onedependence estimators, simply AODE. In AODE, the conditional probability of test instances given class is tuned by one attribute value which occurs in the test instance. After the training stage, AODE outputs an average onedependence estimator. AODE is a lazy method of structure extension of Bayesian network. Jiang et al. [3] have proposed hidden naïve Bayes, simply HNB, which is also a kind of structure extension method.
As for attribute weighting methods, we have two ways to get attribute weights. The first one is to construct a function with the parameters of attribute weight and to let this function fit itself with the training data by estimating the weights. Zaidi et al. [8] have proposed a weighted naïve Bayes algorithm, called weighting to alleviate the naïve Bayes independence assumption, simply WANBIA. Based on WANBIA framework, the authors have described two methods to obtain the attribute weights: , which maximizes the conditional log likelihood function and , which minimizes mean squared error function.
Chen and Wang [2] have also proposed an algorithm to minimize mean squared error function in order to obtain the attribute weights. In another paper, Chen and Wang [9] have proposed a method called subspace weighting naïve Bayes (simply SWNB) that is a naïve Bayes weighting method to deal with highdimensional data. Using the local featureweighting technique, SWNB has the ability to describe different contributions of attributes in the training data set and outputs an optimal set of attribute weights fitting a Logit normal priori distribution.
There are many other methods that can be categorized into attribute weighting. Lee et al. [10] have calculated attributes weight via KullbackLeibler divergence between the attribute and class label. Wu and Cai [11] have proposed decision treebased attribute weighted AODE, simply DTWAODE. DTWAODE generates a set of attribute weights directly, and the weight value decreases according to attribute depth in the decision tree. Omura et al. [12] have proposed a weighting method, called confidence weight for naïve Bayes, and that confidence weight is derived from the probabilities of the majority class in the training data set.
3. Background
In this section, we explain the concepts of machine learning methods used in this paper, including naïve Bayes classifier, naïve Bayes attribute weighting, and kernel density estimation for naïve Bayes categorical attributes. The symbols used in this paper are summarized in Notations section.
3.1. Naïve Bayes Classifier
In supervised learning, consider a training data set composed of instances, where each instance (dimensional vector) is labeled with class label . For the posterior probability of given , we have
But likelihood cannot be directly estimated from because of insufficient data in practice. Naïve Bayes uses attributes independence assumption to alleviate this problem; from the assumption, is shown as follows:
In the training phase, only and need to be estimated for each class and each attribute value . The estimation method uses the frequency of given and the frequency of for and , respectively.
In the classification phase, if we have a test instance where is an attribute value of the attribute in the test instance, naïve Bayes classifier outputs a class label prediction of based on the frequency estimation of and which have been generated in the training phase. The classifier of naïve Bayes is shown as follows:
As it was aforementioned, naïve Bayes assumption conflicts with most real world applications (note that it is rare that attributes in the same data set do not have any relationships between each other). Therefore, many researchers provide proposals to relax naïve Bayes assumption effectively, which have been reviewed in Section 2.
In this paper, we focus on attribute weighting methods combined with kernel density estimation technique which is applied to naïve Bayes learner in order to relax conditional independence assumption.
3.2. Naïve Bayes Attribute Weighting
Generally, naïve Bayes attribute weighting scheme can be formulated in several forms. Firstly, the weight to each attribute is defined as follows:
If the weight depends on attribute and class, the corresponding formula is as follows:
The following formula is used for the case when the weight depends on attribute value:
Referring back to (4), when , the formula is shown as follows:
It is worthwhile to mention that (7) is considered as a special case of naïve Bayes classifier, where each attribute has the same weight . In other words, naïve Bayes classifier ignores the importance of attributes. From information theoretic perspective, naïve Bayes classifier abandons the chance of digging more information from to reduce the entropy of class. This is one of the reasons why attribute weighting method provides more accuracy of classification result than naïve Bayes classifier.
In our approach, we follow (4) that assigns which corresponds to the attribute . But instead of using as an exponential parameter, we incorporate into so that it works in a more generalized form. The weight in our paper works in the kernel, as is shown in (13), described in Section 4.1.
Based on information theoretic perspective, attribute weighting method tries to find out which attribute will give more information for classification than other attributes. If an attribute in data set provides more information to reduce the entropy of class label than other attributes, then will be assigned with a higher weight.
3.3. Kernel Density Estimation for Naïve Bayes Categorical Attributes
In naïve Bayes learner, which has been discussed in Section 3.1, the likelihood is often estimated by , the frequency of given ; note that is the value of attribute at the instance in a data set . From a statistical perspective, a nonsmooth estimator has the least sample bias, but it also has a large estimation variance [2, 13] at the same time. Aitchison and Aitken [14] have proposed a kernel function and Chen and Wang [2] have proposed a variant of smooth kernel function alternating the frequency. The definition of their kernel function in [2] is as follows.
Given a test instance where is an attribute value of the attribute in the test instance,
Note that is a kernel function for given , which may become an indicator if . is the bandwidth such that , , and is a number of instances in given .
In [2], they have used (8) to estimate as follows:where we use instead of . (Note that is still estimated by frequency.) They minimize the cost function to take out a series for each in class . The cost function is defined as follows:
Hence, the classifier is formulated as follows:
4. AWSKDE Framework and AWSKDE^{MI} Learner
As mentioned earlier, in this section, we propose an attribute weighting framework working on the categorical attribute called Attribute Weighting with Smooth Kernel Density Estimations, simply AWSKDE. Based on the AWSKDE framework, a learner named AW is proposed, in which mutual information attribute weighting is applied.
4.1. AWSKDE Framework
In (8), we pose an assumption that if a certain attribute has more importance for classification given class label, in other words, can provide more information to reduce the indeterminacy of class , then the value of should be more close to ; otherwise, if is less meaningful for classification, then should be more close to . We let the bandwidth , where , and is the number of instances labeled . The variation of (8) according to our proposal is as follows:
The estimation of probability of is described as follows:
Hence, AWSKDE framework is defined as follows:
The AWSKDE framework incorporates a smooth kernel to make the probabilistic estimation of likelihood dominated by the weights. This enables natural combination of kernel methods and weighting methods. After setting up the kernel, we can generate a set of weights estimated by various methods cooperating with the kernel.
4.2. AWSKDE^{MI} Learner
Our approach generates a set of attribute weights by employing mutual information between and . It makes sense that if one attribute has more mutual information with class label, the attribute will provide more classification ability than other attributes and therefore should be assigned a larger weight.
The average weight of each attribute is defined as follows:where the definition of is as follows:
We also incorporate split information used in C4.5 [15] with into our weighting scheme to avoid choosing the attributes with lots of values. The split information for each is defined as follows where is the value of attribute at instance (as described in Notations section):
Now, the weight of is defined as follows:
We feed AW with a training data set . In the training stage, we generate , , and out for each . In the classification phase, we give a test instance ; then AW classifier is formed; a prediction of class is outputted finally. The learning algorithm of AW is described in Algorithm 1.

During the training phase, AW only needs to construct conditional probability tables (CPT), which are the tables that contain joint probabilities of attributes and a class label. In terms of time complexity, the calculation of , , , and requires , , , and , respectively. Therefore, the total time complexity is in the training phase. In the classification phase, the algorithm time complexity is . We summarize the time complexity of AW and naïve Bayes in Table 1.
Here, we also present a framework named Attribute Weighting with Light Smooth Kernel Density Estimation, simply AWLSKDE, which does not consider the bandwidth. AWLSKDE can be regarded as a simple version of AWSKDE. According to (8), we directly let where . Hence, the kernel is changed to which is defined as follows:
The estimation is described as follows:
We also build an attribute weighting naïve Bayes learner with mutual information metric based on this AWLSKDE framework, called AW. The method of obtaining the weight of is the same as that of AW learner. Unfortunately, AWLSKDE framework does not give us encouraging results. The experimental results of AW learner can be found in Table 3 with analysis of the results.
5. Experimental Results
In order to compare AW, AW, and naïve Bayes in terms of classification accuracy, we have conducted experiments on UCI Machine Learning Repository Benchmark Data Sets [16]. The UCI benchmark data sets that we have used are shown in Table 2. Note that we have conducted preprocessing to each data set: removing missing values and discretizing numerical attribute values.
In the implementation of our algorithm, all the probabilities including and are estimated via Laplacian smoothing which is shown as follows:where is the number of training examples for which the class value is known; is the number of training examples for which both attribute and the class are known. The is the count value of . The quotient of as the dividend and as the divisor result in conditional probability .
To compare the performance of the algorithms, we have adapted test with 10fold crossvalidation. We have conducted the experiments by applying our algorithm and standard naïve Bayes on the same training data sets as well as the same test data sets. The performance of the algorithm is evaluated through classification accuracy.
Table 3 shows the comparison of accuracies among standard naïve Bayes, AW learner, and AW learner.
It can be seen that AW learner shows four better results, six even results, and seven worse results than naïve Bayes within seventeen UCI data sets. AW learner only has one better result. Note that accuracies are estimated using 10fold crossvalidation with 95% confidence interval. AW has a significant performance in the anneal data set and the mean accuracy of the AW learner is 84.81 which is better than that of naïve Bayes’ 84.78. This experimental result can prove that our new attribute weighting model AW is efficient and effective. AW learner has performed poorly due to the ignorance of bandwidth parameters in the kernel methods which results in a relatively larger bias.
6. Conclusions and Future Work
In this paper, a novel attribute weighting framework called Attribute Weighting with Smooth Kernel Density Estimations, simply AWSKDE framework, has been proposed. The AWSKDE framework enables the estimation of likelihood to be dominated by attribute weights. Based on AWSKDE, AW has been proposed to exploit mutual information. We have conducted experiments on seventeen UCI benchmark data sets and made a comparison of accuracy among the standard NB, AW, and AW. The experimental result proves that our new learner, AW, is efficient and effective. Also, due to the relatively larger bias in the algorithm of AW, it has underperformed.
Even though AW shows comparable results, as shown in Table 3, it does not quite outperform naïve Bayes. In the future work, we plan to improve AWSKDE framework and investigate more effective attribute weighting methods instead of the weight measurement method with mutual information between attributes and class label.
Notations
:  The attribute in data set 
:  The cardinality of attribute 
:  The value of at instance 
:  Training data set consists of instances 
:  An instance, dimensional vector, 
:  Class label, 
:  An element of , 
:  A test instance, dimensional vector 
:  The unconditioned probability of event 
:  The conditional probability of given 
:  An estimation of 
:  The frequency of given 
:  The weightvalue of attribute 
:  The mutual information between and . 
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (no. NRF2013R1A1A2013401).