Mathematical Problems in Engineering

Volume 2015 (2015), Article ID 170324, 7 pages

http://dx.doi.org/10.1155/2015/170324

## Bayesian Prediction Model Based on Attribute Weighting and Kernel Density Estimations

^{1}Weifang University of Science & Technology, Shouguang, Shandong 262-700, China^{2}Department of Computer and Information Engineering, Dongseo University, 47, Churye-Ro, Sasang-Gu, Busan 617-716, Republic of Korea

Received 26 March 2015; Accepted 29 July 2015

Academic Editor: Fons J. Verbeek

Copyright © 2015 Zhong-Liang Xiang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Although naïve Bayes learner has been proven to show reasonable performance in machine learning, it often suffers from a few problems with handling real world data. First problem is conditional independence; the second problem is the usage of frequency estimator. Therefore, we have proposed methods to solve these two problems revolving around naïve Bayes algorithms. By using an attribute weighting method, we have been able to handle conditional independence assumption issue, whereas, for the case of the frequency estimators, we have found a way to weaken the negative effects through our proposed smooth kernel method. In this paper, we have proposed a compact Bayes model, in which a smooth kernel augments weights on likelihood estimation. We have also chosen an attribute weighting method which employs mutual information metric to cooperate with the framework. Experiments have been conducted on UCI benchmark datasets and the accuracy of our proposed learner has been compared with that of standard naïve Bayes. The experimental results have demonstrated the effectiveness and efficiency of our proposed learning algorithm.

#### 1. Introduction

Naïve Bayes classifier is a supervised learning method based on Bayes rule of probability theory, running on labeled training examples and driven by a strong assumption that all attributes in the training examples are independent from one another on the given training examples known as naïve Bayes assumption or naïve Bayes conditional independence assumption. Naïve Bayes classifier has high performance and rapid classification speed and has exhibited its effectiveness especially in huge training instances with plenty of attributes mainly because of its independence assumption [1].

In practice, classification performance is affected by the attribute independence assumption which is usually violated in real world. However, due to the attractive advantages of efficiency and simplicity, both stemming from the attribute independence assumption, many researchers have proposed effective methods to further improve the performance of naïve Bayes classifier by weakening the attribute independence without neglecting its advantages. We categorize some typical previous methods of relaxing naïve Bayes assumption and give brief reviews in Section 3. However, we have found out that attribute weighting method has drawn relatively little attention among those previous methods in improving naïve Bayes classifier, especially in the case when attribute weighting method is combined with kernel method in a reasonable way.

Although Chen and Wang [2] proposed attribute weighting method with the kernel, their weighting scheme generates a series of parameters from least squares cross-validation which is less meaningful in terms of interpretation than our proposed method. In contrast, we propose an attribute weighting algorithm based on attribute weighting framework with kernel method. Our method makes the weights embedded in kernel have relatively interpretable meaning; thus we can flexibly choose different metrics and methods to measure the weights based on our attribute weighting framework.

Contributions of this paper are threefold:(i)We briefly make a survey of ways to improve naïve Bayes, especially focusing on those naïve Bayes weighting methods.(ii)We propose a novel attribute weighting framework called Attribute Weighting with Smooth Kernel Density Estimation, simply AW-SKDE. The AW-SKDE framework employs a smooth kernel that makes the probabilistic estimation of likelihood to be dominated by the weights, which enables the combination of kernel methods and weighting methods. After setting up the kernel, we can generate a set of weights directly by using various methods cooperating with the kernel.(iii)On the AW-SKDE framework, we propose a learner called AW- in which we choose the mutual information criterion to measure the dependency between an attribute and its class label.

Our experimental results show that mutual information criterion based on AW-SKDE framework exhibits superior performance compared to standard naïve Bayes classifier.

The paper is organized as follows: we briefly make a survey of ways to improve naïve Bayes in Section 2. In Section 3, we introduce the background of our study. In Section 4, we first propose our attribute weighting framework based on kernel density estimation. After that, we propose a method employing the mutual information criterion for attribute weighting based on our proposed framework. In Section 5, we describe the experiment and results in detail. Lastly, we draw conclusions for our study and describe the future research in Section 6.

#### 2. Related Work

A number of methods that weaken attribute independent assumption for naïve Bayes have been proposed in the recent years. Jiang et al. [3] made a survey about improving naïve Bayes method. Those methods are broadly divided into five main categories: structure extension, feature selection, data expansion, local learning, and attribute weighting. We make a brief review by following this categorization.

For data expansion, Kang and Sohn [4] have presented an algorithm called propositionalized attribute taxonomy learner, simply PAT-learner. In PAT-learner, the training data set is first disassembled into small pieces with attributes values; then, PAT-learner rebuilds a new data set called PAT-Table by using divergence between the distribution of the class labels associated with the corresponding attributes at the disassembled date set. Kang and Kim [5] also proposed a Bayes learner based on PAT-learner, called propositionalized attribute taxonomy guided naïve Bayes learner (PAT-NBL). They utilize propositionalized data set and PAT-Table that is generated from PAT-learner to build naïve Bayes classifiers.

Wong [6] has focused on the discretization method of attributes to improve naïve Bayes. Wong has proposed a hybrid method for continuous attributes and mentioned that discretizing continuous attributes in a data set using different methods can improve the performance of naïve Bayes learner. Also, Wong provides a nonparametric measure to evaluate the dependence level between a continuous attribute and the class.

In structure extension, Webb et al. [7] have proposed a method called aggregating one-dependence estimators, simply AODE. In AODE, the conditional probability of test instances given class is tuned by one attribute value which occurs in the test instance. After the training stage, AODE outputs an average one-dependence estimator. AODE is a lazy method of structure extension of Bayesian network. Jiang et al. [3] have proposed hidden naïve Bayes, simply HNB, which is also a kind of structure extension method.

As for attribute weighting methods, we have two ways to get attribute weights. The first one is to construct a function with the parameters of attribute weight and to let this function fit itself with the training data by estimating the weights. Zaidi et al. [8] have proposed a weighted naïve Bayes algorithm, called weighting to alleviate the naïve Bayes independence assumption, simply WANBIA. Based on WANBIA framework, the authors have described two methods to obtain the attribute weights: , which maximizes the conditional log likelihood function and , which minimizes mean squared error function.

Chen and Wang [2] have also proposed an algorithm to minimize mean squared error function in order to obtain the attribute weights. In another paper, Chen and Wang [9] have proposed a method called subspace weighting naïve Bayes (simply SWNB) that is a naïve Bayes weighting method to deal with high-dimensional data. Using the local feature-weighting technique, SWNB has the ability to describe different contributions of attributes in the training data set and outputs an optimal set of attribute weights fitting a Logit normal priori distribution.

There are many other methods that can be categorized into attribute weighting. Lee et al. [10] have calculated attributes weight via Kullback-Leibler divergence between the attribute and class label. Wu and Cai [11] have proposed decision tree-based attribute weighted AODE, simply DTWAODE. DTWAODE generates a set of attribute weights directly, and the weight value decreases according to attribute depth in the decision tree. Omura et al. [12] have proposed a weighting method, called confidence weight for naïve Bayes, and that confidence weight is derived from the probabilities of the majority class in the training data set.

#### 3. Background

In this section, we explain the concepts of machine learning methods used in this paper, including naïve Bayes classifier, naïve Bayes attribute weighting, and kernel density estimation for naïve Bayes categorical attributes. The symbols used in this paper are summarized in Notations section.

##### 3.1. Naïve Bayes Classifier

In supervised learning, consider a training data set composed of instances, where each instance (-dimensional vector) is labeled with class label . For the posterior probability of given , we have

But likelihood cannot be directly estimated from because of insufficient data in practice. Naïve Bayes uses attributes independence assumption to alleviate this problem; from the assumption, is shown as follows:

In the training phase, only and need to be estimated for each class and each attribute value . The estimation method uses the frequency of given and the frequency of for and , respectively.

In the classification phase, if we have a test instance where is an attribute value of the attribute in the test instance, naïve Bayes classifier outputs a class label prediction of based on the frequency estimation of and which have been generated in the training phase. The classifier of naïve Bayes is shown as follows:

As it was aforementioned, naïve Bayes assumption conflicts with most real world applications (note that it is rare that attributes in the same data set do not have any relationships between each other). Therefore, many researchers provide proposals to relax naïve Bayes assumption effectively, which have been reviewed in Section 2.

In this paper, we focus on attribute weighting methods combined with kernel density estimation technique which is applied to naïve Bayes learner in order to relax conditional independence assumption.

##### 3.2. Naïve Bayes Attribute Weighting

Generally, naïve Bayes attribute weighting scheme can be formulated in several forms. Firstly, the weight to each attribute is defined as follows:

If the weight depends on attribute and class, the corresponding formula is as follows:

The following formula is used for the case when the weight depends on attribute value:

Referring back to (4), when , the formula is shown as follows:

It is worthwhile to mention that (7) is considered as a special case of naïve Bayes classifier, where each attribute has the same weight . In other words, naïve Bayes classifier ignores the importance of attributes. From information theoretic perspective, naïve Bayes classifier abandons the chance of digging more information from to reduce the entropy of class. This is one of the reasons why attribute weighting method provides more accuracy of classification result than naïve Bayes classifier.

In our approach, we follow (4) that assigns which corresponds to the attribute . But instead of using as an exponential parameter, we incorporate into so that it works in a more generalized form. The weight in our paper works in the kernel, as is shown in (13), described in Section 4.1.

Based on information theoretic perspective, attribute weighting method tries to find out which attribute will give more information for classification than other attributes. If an attribute in data set provides more information to reduce the entropy of class label than other attributes, then will be assigned with a higher weight.

##### 3.3. Kernel Density Estimation for Naïve Bayes Categorical Attributes

In naïve Bayes learner, which has been discussed in Section 3.1, the likelihood is often estimated by , the frequency of given ; note that is the value of attribute at the instance in a data set . From a statistical perspective, a nonsmooth estimator has the least sample bias, but it also has a large estimation variance [2, 13] at the same time. Aitchison and Aitken [14] have proposed a kernel function and Chen and Wang [2] have proposed a variant of smooth kernel function alternating the frequency. The definition of their kernel function in [2] is as follows.

Given a test instance where is an attribute value of the attribute in the test instance,

Note that is a kernel function for given , which may become an indicator if . is the bandwidth such that , , and is a number of instances in given .

In [2], they have used (8) to estimate as follows:where we use instead of . (Note that is still estimated by frequency.) They minimize the cost function to take out a series for each in class . The cost function is defined as follows:

Hence, the classifier is formulated as follows:

#### 4. AW-SKDE Framework and AW-SKDE^{MI} Learner

As mentioned earlier, in this section, we propose an attribute weighting framework working on the categorical attribute called* Attribute Weighting with Smooth Kernel Density Estimations*, simply AW-SKDE. Based on the AW-SKDE framework, a learner named AW- is proposed, in which mutual information attribute weighting is applied.

##### 4.1. AW-SKDE Framework

In (8), we pose an assumption that if a certain attribute has more importance for classification given class label, in other words, can provide more information to reduce the indeterminacy of class , then the value of should be more close to ; otherwise, if is less meaningful for classification, then should be more close to . We let the bandwidth , where , and is the number of instances labeled . The variation of (8) according to our proposal is as follows:

The estimation of probability of is described as follows:

Hence, AW-SKDE framework is defined as follows:

The AW-SKDE framework incorporates a smooth kernel to make the probabilistic estimation of likelihood dominated by the weights. This enables natural combination of kernel methods and weighting methods. After setting up the kernel, we can generate a set of weights estimated by various methods cooperating with the kernel.

##### 4.2. AW-SKDE^{MI} Learner

Our approach generates a set of attribute weights by employing mutual information between and . It makes sense that if one attribute has more mutual information with class label, the attribute will provide more classification ability than other attributes and therefore should be assigned a larger weight.

The average weight of each attribute is defined as follows:where the definition of is as follows:

We also incorporate split information used in C4.5 [15] with into our weighting scheme to avoid choosing the attributes with lots of values. The split information for each is defined as follows where is the value of attribute at instance (as described in Notations section):

Now, the weight of is defined as follows:

We feed AW- with a training data set . In the training stage, we generate , , and out for each . In the classification phase, we give a test instance ; then AW- classifier is formed; a prediction of class is outputted finally. The learning algorithm of AW- is described in Algorithm 1.