Classification Based on both Attribute Value Weight and Tuple Weight under the Cloud Computing

Zheng, Yifeng; Huang, Zaixiang; He, Tianzhong

doi:https://doi.org/10.1155/2013/436368

Mathematical Problems in Engineering

On this page

Abstract Introduction Experimental Results Conclusions References Copyright Related Articles

Special Issue

Applied Mathematics and Algorithms for Cloud Computing and IoT

View this Special Issue

Research Article | Open Access

Volume 2013 | Article ID 436368 | https://doi.org/10.1155/2013/436368

Classification Based on both Attribute Value Weight and Tuple Weight under the Cloud Computing

Yifeng Zheng,¹Zaixiang Huang,¹and Tianzhong He¹

Academic Editor: Yuxin Mao

Received17 Jul 2013

Accepted03 Sept 2013

Published10 Oct 2013

Abstract

In recent years, more and more people pay attention to cloud computing. Users need to deal with magnanimity data in the cloud computing environment. Classification can predict the need of users from large data in the cloud computing environment. Some traditional classification methods frequently adopt the following two ways. One way is to remove instance after it is covered by a rule, another way is to decrease tuple weight of instance after it is covered by a rule. The quality of these traditional classifiers may be not high. As a result, they cannot achieve high classification accuracy in some data. In this paper, we present a new classification approach, called classification based on both attribute value weight and tuple weight (CATW). CATW is distinguished from some traditional classifiers in two aspects. First, CATW uses both attribute value weight and tuple weight. Second, CATW proposes a new measure to select best attribute values and generate high quality classification rule set. Our experimental results indicate that CATW can achieve higher classification accuracy than some traditional classifiers.

1. Introduction

Cloud computing has become a hot issue in recent years. With the rapid development of information technology and the popularity of cloud computing, it is necessary to mine useful information from magnanimity data [1–9]. Classification is one of the most important tasks in the data mining and the machine learning. Classification can predict the need of users from large data. First, it builds classification rules from training dataset. Second, it uses these rules to predict the class label of new instances.

The traditional classifiers [10–19] frequently adopt the following two ways. Some traditional classifiers remove instance after it is covered by a rule, such as FOIL [20] and ELEM2 [21]. Other traditional classifiers decrease tuple weight of instance after it is covered by a rule, such as PRM and CPAR [22]. Then, we introduce the feature of these classifiers. In the process of extracting rules, FOIL uses measure gain to select a best attribute value and generates one classification rule. It removes instance after it is covered by a rule. As a result, this method is ineffective. It generates a small rule set and cannot achieve high accuracy in some data. ELEM2 uses another measure to generate classification rules. It also removes instance after it is covered by a rule. ELEM2 considers the degree of relevance of an attribute-value pair and selects the most relevant pairs to generate rules. PRM modifies FOIL to achieve higher accuracy. PRM does not remove instance when it is covered by a rule. PRM gives the instance a tuple weight. Thus, PRM can insure that each instance is covered more than once. PRM selects only the best gain to generate rule. CPAR stands in the middle between exhaustive and greedy algorithms and combines the advantages of both. CPAR selects several best attribute values and builds several rules at one time. It does not remove instance immediately when it is covered by a rule. CPAR also uses tuple weight to guarantee that each instance can be covered more than once. These methods do not employ attribute value weight. They cannot get high quality classification rule set. As a result, they can not achieve high classification accuracy in some data.

In this paper, we propose a new algorithm, named classification based on both attribute value weight and tuple weight (CATW). CATW uses the both attribute value weight and tuple weight. Moreover, CATW uses a new measure to improve the quality of classification rule set. Our method has following advantages.(1)After an instance is covered by a rule, instead of removing it, its weight is decreased by multiplying a factor. Thus, we can guarantee that each instance can be covered more than once. (2)If we only use tuple weight, we cannot change the importance of an attribute-value pair in the dataset. Therefore, CATW uses attribute value weight to reduce the importance of attribute-value pair after the rule is generated. In this way, CATW can increase the chances of attaining other optimal attribute-value pairs. We can generate more high quality of rules.(3)CATW presents a new measure to select the best attribute value. CATW uses two different measures: support and correlation confidence. If two different attribute-value pairs have same correlation confidence, CATW considers their support.

Experimental results indicate that: if the instance is removed immediately after it is covered by a rule, the classifier generates a very small number of rules; if the classifier is only using tuple weight, the quality of classification rule set is not good. Since CATW uses both attribute value weight and tuple weight, it achieves high classification accuracy.

The outline of this paper is as follows. Section 2 presents the details of CATW and describes the process of rule generation in CATW. Section 3 discusses how to predict class label using the rules. The experimental results are presented in Section 4. Finally, we conclude the study in Section 5.

2. Rule Generation of CATW

The algorithm of CATW has three special points: the attribute value weight, the tuple weight, and the improved measure. First, we describe the method of how to use tuple weight. Second, we introduce the use of attribute weight. Third, we propose a new measure to generate high quality classification rule set. Finally, we show the whole process of how to generate rule set.

Let be a set of tuples. Each tuple has attributes . Suppose to be a set of class labels , where means the number of class label.

Definition 1 (a literal). A literal is an attribute-value pair, which follows the pattern of , where is an attribute and is a value of attribute .

Definition 2 (a classification rule). is called a classification rule , if consists of a conjunction of literals with the form of , where is a class label.

A tuple satisfies the antecedent of if and only if it has all literals in . If satisfies the antecedent of , predicts that has a class label .

2.1. The Tuple Weight

In traditional classification, all rules are generated from the training database. If a tuple is covered by a rule , they can not ensure that is the best rule for . If is generated from the remaining dataset instead of the whole dataset [22], may not be the best rule. In order to improve the classification accuracy and increase the number of rules, some traditional classifiers use tuple weight. By depending on tuple weight, these classifiers can delay removing instance after it is covered by a rule. In our algorithm, after a tuple is covered by a rule, instead of removing it, its weight is decreased by multiplying a factor. We set a threshold for tuple weight. When the tuple weight of tuple is less than threshold, we remove the tuple from training data. CATW produces more rules. Each tuple can be covered by classification rules more than once.

In our approach, we can set an initial threshold and an end threshold. We can limit the number of rules which are generated according to actual situation. If we set a small end threshold, it generates a large number of rules. On the contrary, if we set a large end threshold, it generates a less number of rules. In our experiment, we set an initial threshold , a weight factor . Moreover, we set an end threshold. The end threshold is the third power of weight factor. We can make sure that each instance can be covered three times.

2.2. The Attribute Value Weight

Some traditional classifiers only use tuple weight. They do not change the importance of an attribute-value pair in the training data. After a rule is generated, these classifiers may select the duplicate attribute-value pair. Thus, they may miss some high quality rules which can be used to affect the classification accuracy. CATW uses attribute value weight to reduce the importance of attribute-value pair after the rule is generated. When the tuple is covered by a rule, our algorithm can reduce the importance of attribute-value pairs which are contained in it. In this way, we can increase the chances of attaining another optimal attribute-value pair.

Example 3. The following training dataset with two classes is shown in Table 1. Then, we demonstrate how to use attribute value weight.

Suppose to be just generated. Then, we set a weight factor , and set for positive examples. After a rule is generated, CATW uses weight factor to reduce the importance of all attribute values that are contained in antecedent of the rule in positive examples. The result is shown in Table 2.

The results of our experiment indicate that classification accuracy is influenced by attribute value weight. Compared with the classifiers which do not use attribute value weight, CATW can achieve higher classification accuracy in some data. Thus, the attribute value weight can be a help to improve the quality of classification rule.

2.3. The Measure of CATW

Some classifiers use FOIL gain to select literal. FOIL gain is used to measure the information gained from adding literal to the current rule. Let us suppose that means the number of positive examples which satisfies the antecedent of the current rule and means the number of negative examples which satisfy the antecedent of the current rule . After literal is added to , means the number of positive examples which satisfy the antecedent of the new rule, and means the number of negative examples which satisfy the antecedent of the new rule [22]. The FOIL gain of is defined as:

In our experiment, we employ two different improved measures.

2.3.1. Improved FOIL Measure

In our experiment, means total tuple weight of positive examples which satisfy the antecedent of current rule . means total tuple weight of negative examples which satisfy the antecedent of current rule . After literal is added to , means total attribute value weight of literal in positive examples, and means total attribute value weight of literal in negative examples. Therefore, CATW uses both tuple weight and attribute value weight when it measures literal . We call this measure an improved FOIL measure.

2.3.2. Improved Correlation Measure

In traditional FOIL gain, has a huge influence to select a best attribute value. For example, if is too small and is too large, the result of is not the best for rule. We use two different measures: support and correlation confidence. We divide the traditional FOIL measure in two parts.(1) : .(2) : .

When we select literal , a global order of literal is composed. Given two literal and , is better than , denoted as .

if and only if (1) PART II or (2) and . We call this measure an improved correlation measure.

2.4. Algorithm of CATW

In this part, we will introduce our algorithm in detail. The CATW algorithm is presented in Algorithm 1.

Input: Training set ( and are the sets of all positive and negative example, respectively)
Output: A set of rules for predicting class labels for examples
Procedure CATW



rules ← null
while
,

while and
find the best attribute value use the improved correlation measure combine tuple weight with attribute weight
add to
remove from all examples not satisfying
remove from all examples not satisfying
end
add to rules
for each attribute at that is included in antecedent of in

end
for each example in satisfying ’s body

if then remove from
end
end
return rules

3. Classification of CATW

Before making any prediction, we use the Laplace expected error estimate [23] to evaluate the quality of rules. It is defined as follows: where is the number of classes and is the total number of examples satisfying the antecedent of rule, among which examples belong to .

When using rules to predict the class-label of unknown instance, we use several rules which are matched by the instance. If all the rules have the same consequent of rule, we assign that label to the instance. If all the best rules have several classes, we calculate the average Laplace accuracy of each class. Then, we select the class label with the highest average value and assign it to the instance.

4. Experimental Results

All experiments are performed on different datasets from the UCI data collection. All datasets were conducted using stratified tenfold cross-validation. In cross-validation, the data set is divided into 10 blocks. Each block is held out once. The classifier is trained on the remaining blocks. The character of each dataset is shown in Table 3. We perform our experiments on a 2.2 GHz PC with 2 G memory, running Microsoft Windows XP.

In Tables 4 and 5, Column 1 shows the accuracy of FOIL. Column 2 shows the accuracy of CMAR. Column 3 shows the accuracy of CPAR. Column 4 shows the accuracy of CATW without attribute value weight, set tuple weight . Column 5 shows the accuracy of CATW, set attribute value weight and tuple weight . Column 6 shows the accuracy of CATW, set attribute value weight and tuple weight .

In Table 4, we use the measure which is an improved FOIL measure. Figure 1 gives the accuracy of FOIL, CMAR, CPAR, and CATW based on Table 4. CATW uses both attribute value weight and tuple weight and employs the improved FOIL measure. From Figure 1 and Table 4, we can see that CATW can achieve higher accuracy than FOIL, CMAR, and CPAR.

In Table 5, we use the measure which is an improved correlation measure. Figure 2 gives the accuracy of FOIL, CMAR, CPAR, and CATW based on Table 5. CATW uses both attribute value weight and tuple weight and employs the improved correlation measure. From Figure 2 and Table 5, we can see that CATW with improved correlation measure can also achieve higher accuracy than FOIL, CMAR, and CPAR.

By comparison, the accuracy of CATW with the improved correlation measure is higher than the accuracy of CATW with the improved FOIL measure. From Tables 4 and 5, we can see that it is necessary to use the improved correlation measure.

Table 6 displays the accuracy of different attribute value weights in CATW. In Table 6, CATW employs the improved FOIL measure. Table 7 displays the accuracy of different attribute value weights in CATW. In Table 7, CATW employs the improved correlation measure. The results of the two tables indicate that the accuracy of improved correlation measure is higher than the accuracy of improved FOIL measure and different value of attribute value weight has different influence on the accuracy of classification.

Through all the above results of our experiment, we can conclude that it is necessary to use attribute value weight and tuple weight; it is necessary to use improved correlation measure; different value of attribute value weight has different influence on the accuracy of classification.

5. Conclusions and Future Work

With the rapid development of information technology and the popularity of cloud computing, it is necessary to mine useful information from magnanimity data. Some traditional classification methods frequently adopt the following two ways. One way is that it does not use tuple weight to remove instance after it is covered by a rule. Another way is that it only gives tuple weight of instance after it is covered by a rule. As result, they cannot achieve high classification accuracy in some data. In this paper, we present a novel approach CATW. First, CATW uses both attribute value weight and tuple weight. Second, CATW proposes a new measure which is the improved correlation measure. CATW employs the improved correlation measure to select best attribute values and generate high quality classification rule set. The results of our experiment indicate that CATW can generate a reasonable number of classification rules. In addition, CATW can achieve high classification accuracy. Our experiment shows that different value of attribute value weight has different influence on the accuracy of classification. At present, we cannot find the regular change in selecting an optimal attribute value weight. In future research, we will focus on it. We also focus on another research. We will combine distributed data mining with cloud computing platform in order to improve the efficiency of CATW.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work is funded by the China NFS Program (no. 61170129), and by the Fujian province NSF Program (no. 2013J01259).

References

K. Lal and N. C. Mahanti, “A novel data mining algorithm for semantic web based data cloud,” International Journal of Computer Science and Security, vol. 4, no. 2, pp. 160–175, 2010.
View at: Google Scholar
S. Adapa, M. Kalyan Srinivas, and A. V. R. K. Harsha Vardhan Varma, “A study on cloud computing data mining,” International Journal of Innovative Research in Computer and Communication Engineering, vol. 1, no. 5, pp. 1232–1237, 2013.
View at: Google Scholar
Z. Qureshi, J. Bansal, and S. Bansal, “A survey on association rule mining in cloud computing,” International Journal of Emerging Technology and Advanced Engineering, vol. 3, no. 4, pp. 318–321, 2013.
View at: Google Scholar
J. Ding and S. Yang, “Classification rules mining model with genetic algorithm in cloud computing,” International Journal of Computer Applications, vol. 48, no. 18, pp. 24–32, 2012.
View at: Google Scholar
L. Hu, Z. Zhang, F. Wang, and K. Zhao, “Optimization of the deployment of temperature nodes based on linear programing in the internet of things,” Tsinghua Science and Technology, vol. 18, no. 3, pp. 250–258, 2013.
View at: Google Scholar
S. Gond, A. Patil, and V. B. Nikam, “A survey on parallelization of data mining techniques,” International Journal of Engineering Research and Applications, vol. 3, no. 4, pp. 520–526, 2013.
View at: Google Scholar
N. Mishra, S. Sharma, and A. Pandey, “High performance cloud data mining algorithm and data mining in clouds,” IOSR Journal of Computer Engineering, vol. 8, no. 4, pp. 54–61, 2013.
View at: Publisher Site | Google Scholar
A. Pareek and M. Gupta, “Review of data mining techniques in cloud computing database,” International Journal of Advanced Computer Research, vol. 2, no. 2, pp. 52–55, 2012.
View at: Google Scholar
R.-Ş. Petre, “Data mining in cloud computing,” Database Systems Journal, vol. 3, no. 3, pp. 67–71, 2012.
View at: Google Scholar
Y. Jiao, “Research of an improved apriori algorithm in data mining association rules,” in Proceedings of the IEEE International Conference on Information Theory and Information Security (ICITIS '11), November 2011.
View at: Google Scholar
F. Thabtah, P. Cowling, and Y. Peng, “MCAR: multi-class classification based on association rule,” in Proceedings of the 3rd ACS/IEEE International Conference on Computer Systems and Applications, pp. 127–133, January 2005.
View at: Publisher Site | Google Scholar
G. Dong, X. Zhang, L. Wong, and J. Li, “CAEP: classification by aggregating emerging patterns,” Discovery Science, vol. 1721, pp. 30–42, 1999.
View at: Publisher Site | Google Scholar
J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation,” in Proceedings of the ACM SIGMOD international Conference on Management of Data (SIGMOD '00), pp. 1–12, 2000.
View at: Google Scholar
W. Li, J. Han, and J. Pei, “CMAR: accurate and efficient classification based on multiple class-association rules,” in Proceedings of the 1st IEEE International Conference on Data Mining (ICDM '01), pp. 369–376, San Jose, Calif, USA, November 2001.
View at: Google Scholar
F. A. Thabtah and P. I. Cowling, “A greedy classification algorithm based on association rule,” Applied Soft Computing Journal, vol. 7, no. 3, pp. 1102–1111, 2007.
View at: Publisher Site | Google Scholar
B. Liu, W. Hsu, and Y. Ma, “Integerating classification and association rule mining,” in Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD ’98), pp. 80–86, New York, NY, USA, August 1998.
View at: Google Scholar
X. Wang, Z. Zhou, and G. Pan, “CMER: classification based on multiple excellent rules,” Journal of Theoretical and Applied Information Technology, vol. 48, pp. 661–665, 2013.
View at: Google Scholar
G. Chen, H. Liu, L. Yu, Q. Wei, and X. Zhang, “A new approach to classification based on association rule mining,” Journal of Decision Support Systems, vol. 42, no. 2, pp. 674–689, 2006.
View at: Publisher Site | Google Scholar
P. Leng and F. Coenen, “The effect of threshold values on association rule based classification accuracy,” Journal of Data and Knowledge Engineering, vol. 60, no. 2, pp. 345–360, 2007.
View at: Publisher Site | Google Scholar
J. Ross Quinlan and R. Mike Cameron-Jones, “FOIL: a midtern report,” in Proceedings of the European Conference Machine Learning, pp. 3–20, Vienna, Austria, 1993.
View at: Google Scholar
A. An, “Learning classification rules from data,” Computers & Mathematics with Applications, vol. 45, no. 4-5, pp. 737–748, 2003.
View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
X. Yin and J. Han, “CPAR: classification based on predictive association rules,” in Proceedings of the Society for Industrial and Applied Mathematics (SIAM) International Conference on Data Mining, May 2003.
View at: Google Scholar
P. Clark and R. Boswell, “Rule induction with CN2: somerecent improvements,” in Proceedings of European Working Session on Learning (EWSL '91), pp. 151–163, Porto, Portugal, March 1991.
View at: Google Scholar

Copyright

Copyright © 2013 Yifeng Zheng et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

2101

Downloads

891

Citations