About this Journal Submit a Manuscript Table of Contents
Mathematical Problems in Engineering
Volume 2013 (2013), Article ID 436368, 7 pages
http://dx.doi.org/10.1155/2013/436368
Research Article

Classification Based on both Attribute Value Weight and Tuple Weight under the Cloud Computing

Department of Computer Science and Engineering, Minnan Normal University, Zhangzhou 363000, China

Received 17 July 2013; Accepted 3 September 2013

Academic Editor: Yuxin Mao

Copyright © 2013 Yifeng Zheng et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

In recent years, more and more people pay attention to cloud computing. Users need to deal with magnanimity data in the cloud computing environment. Classification can predict the need of users from large data in the cloud computing environment. Some traditional classification methods frequently adopt the following two ways. One way is to remove instance after it is covered by a rule, another way is to decrease tuple weight of instance after it is covered by a rule. The quality of these traditional classifiers may be not high. As a result, they cannot achieve high classification accuracy in some data. In this paper, we present a new classification approach, called classification based on both attribute value weight and tuple weight (CATW). CATW is distinguished from some traditional classifiers in two aspects. First, CATW uses both attribute value weight and tuple weight. Second, CATW proposes a new measure to select best attribute values and generate high quality classification rule set. Our experimental results indicate that CATW can achieve higher classification accuracy than some traditional classifiers.

1. Introduction

Cloud computing has become a hot issue in recent years. With the rapid development of information technology and the popularity of cloud computing, it is necessary to mine useful information from magnanimity data [19]. Classification is one of the most important tasks in the data mining and the machine learning. Classification can predict the need of users from large data. First, it builds classification rules from training dataset. Second, it uses these rules to predict the class label of new instances.

The traditional classifiers [1019] frequently adopt the following two ways. Some traditional classifiers remove instance after it is covered by a rule, such as FOIL [20] and ELEM2 [21]. Other traditional classifiers decrease tuple weight of instance after it is covered by a rule, such as PRM and CPAR [22]. Then, we introduce the feature of these classifiers. In the process of extracting rules, FOIL uses measure gain to select a best attribute value and generates one classification rule. It removes instance after it is covered by a rule. As a result, this method is ineffective. It generates a small rule set and cannot achieve high accuracy in some data. ELEM2 uses another measure to generate classification rules. It also removes instance after it is covered by a rule. ELEM2 considers the degree of relevance of an attribute-value pair and selects the most relevant pairs to generate rules. PRM modifies FOIL to achieve higher accuracy. PRM does not remove instance when it is covered by a rule. PRM gives the instance a tuple weight. Thus, PRM can insure that each instance is covered more than once. PRM selects only the best gain to generate rule. CPAR stands in the middle between exhaustive and greedy algorithms and combines the advantages of both. CPAR selects several best attribute values and builds several rules at one time. It does not remove instance immediately when it is covered by a rule. CPAR also uses tuple weight to guarantee that each instance can be covered more than once. These methods do not employ attribute value weight. They cannot get high quality classification rule set. As a result, they can not achieve high classification accuracy in some data.

In this paper, we propose a new algorithm, named classification based on both attribute value weight and tuple weight (CATW). CATW uses the both attribute value weight and tuple weight. Moreover, CATW uses a new measure to improve the quality of classification rule set. Our method has following advantages.(1)After an instance is covered by a rule, instead of removing it, its weight is decreased by multiplying a factor. Thus, we can guarantee that each instance can be covered more than once. (2)If we only use tuple weight, we cannot change the importance of an attribute-value pair in the dataset. Therefore, CATW uses attribute value weight to reduce the importance of attribute-value pair after the rule is generated. In this way, CATW can increase the chances of attaining other optimal attribute-value pairs. We can generate more high quality of rules.(3)CATW presents a new measure to select the best attribute value. CATW uses two different measures: support and correlation confidence. If two different attribute-value pairs have same correlation confidence, CATW considers their support.

Experimental results indicate that: if the instance is removed immediately after it is covered by a rule, the classifier generates a very small number of rules; if the classifier is only using tuple weight, the quality of classification rule set is not good. Since CATW uses both attribute value weight and tuple weight, it achieves high classification accuracy.

The outline of this paper is as follows. Section 2 presents the details of CATW and describes the process of rule generation in CATW. Section 3 discusses how to predict class label using the rules. The experimental results are presented in Section 4. Finally, we conclude the study in Section 5.

2. Rule Generation of CATW

The algorithm of CATW has three special points: the attribute value weight, the tuple weight, and the improved measure. First, we describe the method of how to use tuple weight. Second, we introduce the use of attribute weight. Third, we propose a new measure to generate high quality classification rule set. Finally, we show the whole process of how to generate rule set.

Let be a set of tuples. Each tuple has attributes . Suppose to be a set of class labels , where means the number of class label.

Definition 1 (a literal). A literal is an attribute-value pair, which follows the pattern of , where is an attribute and is a value of attribute .

Definition 2 (a classification rule). is called a classification rule , if consists of a conjunction of literals with the form of , where is a class label.

A tuple satisfies the antecedent of if and only if it has all literals in . If satisfies the antecedent of , predicts that has a class label .

2.1. The Tuple Weight

In traditional classification, all rules are generated from the training database. If a tuple is covered by a rule , they can not ensure that is the best rule for . If is generated from the remaining dataset instead of the whole dataset [22], may not be the best rule. In order to improve the classification accuracy and increase the number of rules, some traditional classifiers use tuple weight. By depending on tuple weight, these classifiers can delay removing instance after it is covered by a rule. In our algorithm, after a tuple is covered by a rule, instead of removing it, its weight is decreased by multiplying a factor. We set a threshold for tuple weight. When the tuple weight of tuple is less than threshold, we remove the tuple from training data. CATW produces more rules. Each tuple can be covered by classification rules more than once.

In our approach, we can set an initial threshold and an end threshold. We can limit the number of rules which are generated according to actual situation. If we set a small end threshold, it generates a large number of rules. On the contrary, if we set a large end threshold, it generates a less number of rules. In our experiment, we set an initial threshold , a weight factor . Moreover, we set an end threshold. The end threshold is the third power of weight factor. We can make sure that each instance can be covered three times.

2.2. The Attribute Value Weight

Some traditional classifiers only use tuple weight. They do not change the importance of an attribute-value pair in the training data. After a rule is generated, these classifiers may select the duplicate attribute-value pair. Thus, they may miss some high quality rules which can be used to affect the classification accuracy. CATW uses attribute value weight to reduce the importance of attribute-value pair after the rule is generated. When the tuple is covered by a rule, our algorithm can reduce the importance of attribute-value pairs which are contained in it. In this way, we can increase the chances of attaining another optimal attribute-value pair.

Example 3. The following training dataset with two classes is shown in Table 1. Then, we demonstrate how to use attribute value weight.

tab1
Table 1: The training dataset.

Suppose to be just generated. Then, we set a weight factor , and set for positive examples. After a rule is generated, CATW uses weight factor to reduce the importance of all attribute values that are contained in antecedent of the rule in positive examples. The result is shown in Table 2.

tab2
Table 2: Attribute value weight in positive examples.

The results of our experiment indicate that classification accuracy is influenced by attribute value weight. Compared with the classifiers which do not use attribute value weight, CATW can achieve higher classification accuracy in some data. Thus, the attribute value weight can be a help to improve the quality of classification rule.

2.3. The Measure of CATW

Some classifiers use FOIL gain to select literal. FOIL gain is used to measure the information gained from adding literal to the current rule. Let us suppose that means the number of positive examples which satisfies the antecedent of the current rule and means the number of negative examples which satisfy the antecedent of the current rule . After literal is added to , means the number of positive examples which satisfy the antecedent of the new rule, and means the number of negative examples which satisfy the antecedent of the new rule [22]. The FOIL gain of is defined as:

In our experiment, we employ two different improved measures.

2.3.1. Improved FOIL Measure

In our experiment, means total tuple weight of positive examples which satisfy the antecedent of current rule . means total tuple weight of negative examples which satisfy the antecedent of current rule . After literal is added to , means total attribute value weight of literal in positive examples, and means total attribute value weight of literal in negative examples. Therefore, CATW uses both tuple weight and attribute value weight when it measures literal . We call this measure an improved FOIL measure.

2.3.2. Improved Correlation Measure

In traditional FOIL gain, has a huge influence to select a best attribute value. For example, if is too small and is too large, the result of is not the best for rule. We use two different measures: support and correlation confidence. We divide the traditional FOIL measure in two parts.(1)  : .(2)  : .

When we select literal , a global order of literal is composed. Given two literal and , is better than , denoted as .

if and only if (1) PART II or (2) and . We call this measure an improved correlation measure.

2.4. Algorithm of CATW

In this part, we will introduce our algorithm in detail. The CATW algorithm is presented in Algorithm 1.

alg1
Algorithm 1: Classification based on both attribute value weight and tuple weight (CATW).

3. Classification of CATW

Before making any prediction, we use the Laplace expected error estimate [23] to evaluate the quality of rules. It is defined as follows: where is the number of classes and is the total number of examples satisfying the antecedent of rule, among which examples belong to .

When using rules to predict the class-label of unknown instance, we use several rules which are matched by the instance. If all the rules have the same consequent of rule, we assign that label to the instance. If all the best rules have several classes, we calculate the average Laplace accuracy of each class. Then, we select the class label with the highest average value and assign it to the instance.

4. Experimental Results

All experiments are performed on different datasets from the UCI data collection. All datasets were conducted using stratified tenfold cross-validation. In cross-validation, the data set is divided into 10 blocks. Each block is held out once. The classifier is trained on the remaining blocks. The character of each dataset is shown in Table 3. We perform our experiments on a 2.2 GHz PC with 2 G memory, running Microsoft Windows XP.

tab3
Table 3: Characteristics of UCI datasets.

In Tables 4 and 5, Column 1 shows the accuracy of FOIL. Column 2 shows the accuracy of CMAR. Column 3 shows the accuracy of CPAR. Column 4 shows the accuracy of CATW without attribute value weight, set tuple weight . Column 5 shows the accuracy of CATW, set attribute value weight and tuple weight . Column 6 shows the accuracy of CATW, set attribute value weight and tuple weight .

tab4
Table 4: The accuracy of CATW with improved FOIL gain measure.
tab5
Table 5: The accuracy of CATW with improved correlation measure.

In Table 4, we use the measure which is an improved FOIL measure. Figure 1 gives the accuracy of FOIL, CMAR, CPAR, and CATW based on Table 4. CATW uses both attribute value weight and tuple weight and employs the improved FOIL measure. From Figure 1 and Table 4, we can see that CATW can achieve higher accuracy than FOIL, CMAR, and CPAR.

436368.fig.001
Figure 1: The accuracy of CATW with improved FOIL gain measure.

In Table 5, we use the measure which is an improved correlation measure. Figure 2 gives the accuracy of FOIL, CMAR, CPAR, and CATW based on Table 5. CATW uses both attribute value weight and tuple weight and employs the improved correlation measure. From Figure 2 and Table 5, we can see that CATW with improved correlation measure can also achieve higher accuracy than FOIL, CMAR, and CPAR.

436368.fig.002
Figure 2: The accuracy of CATW with improved correlation measure.

By comparison, the accuracy of CATW with the improved correlation measure is higher than the accuracy of CATW with the improved FOIL measure. From Tables 4 and 5, we can see that it is necessary to use the improved correlation measure.

Table 6 displays the accuracy of different attribute value weights in CATW. In Table 6, CATW employs the improved FOIL measure. Table 7 displays the accuracy of different attribute value weights in CATW. In Table 7, CATW employs the improved correlation measure. The results of the two tables indicate that the accuracy of improved correlation measure is higher than the accuracy of improved FOIL measure and different value of attribute value weight has different influence on the accuracy of classification.

tab6
Table 6: Comparing different attribute value weights with improved FOIL gain measure.
tab7
Table 7: Comparing different attribute value weights with improved correlation measure.

Through all the above results of our experiment, we can conclude that it is necessary to use attribute value weight and tuple weight; it is necessary to use improved correlation measure; different value of attribute value weight has different influence on the accuracy of classification.

5. Conclusions and Future Work

With the rapid development of information technology and the popularity of cloud computing, it is necessary to mine useful information from magnanimity data. Some traditional classification methods frequently adopt the following two ways. One way is that it does not use tuple weight to remove instance after it is covered by a rule. Another way is that it only gives tuple weight of instance after it is covered by a rule. As result, they cannot achieve high classification accuracy in some data. In this paper, we present a novel approach CATW. First, CATW uses both attribute value weight and tuple weight. Second, CATW proposes a new measure which is the improved correlation measure. CATW employs the improved correlation measure to select best attribute values and generate high quality classification rule set. The results of our experiment indicate that CATW can generate a reasonable number of classification rules. In addition, CATW can achieve high classification accuracy. Our experiment shows that different value of attribute value weight has different influence on the accuracy of classification. At present, we cannot find the regular change in selecting an optimal attribute value weight. In future research, we will focus on it. We also focus on another research. We will combine distributed data mining with cloud computing platform in order to improve the efficiency of CATW.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work is funded by the China NFS Program (no. 61170129), and by the Fujian province NSF Program (no. 2013J01259).

References

  1. K. Lal and N. C. Mahanti, “A novel data mining algorithm for semantic web based data cloud,” International Journal of Computer Science and Security, vol. 4, no. 2, pp. 160–175, 2010.
  2. S. Adapa, M. Kalyan Srinivas, and A. V. R. K. Harsha Vardhan Varma, “A study on cloud computing data mining,” International Journal of Innovative Research in Computer and Communication Engineering, vol. 1, no. 5, pp. 1232–1237, 2013.
  3. Z. Qureshi, J. Bansal, and S. Bansal, “A survey on association rule mining in cloud computing,” International Journal of Emerging Technology and Advanced Engineering, vol. 3, no. 4, pp. 318–321, 2013.
  4. J. Ding and S. Yang, “Classification rules mining model with genetic algorithm in cloud computing,” International Journal of Computer Applications, vol. 48, no. 18, pp. 24–32, 2012.
  5. L. Hu, Z. Zhang, F. Wang, and K. Zhao, “Optimization of the deployment of temperature nodes based on linear programing in the internet of things,” Tsinghua Science and Technology, vol. 18, no. 3, pp. 250–258, 2013.
  6. S. Gond, A. Patil, and V. B. Nikam, “A survey on parallelization of data mining techniques,” International Journal of Engineering Research and Applications, vol. 3, no. 4, pp. 520–526, 2013.
  7. N. Mishra, S. Sharma, and A. Pandey, “High performance cloud data mining algorithm and data mining in clouds,” IOSR Journal of Computer Engineering, vol. 8, no. 4, pp. 54–61, 2013. View at Publisher · View at Google Scholar
  8. A. Pareek and M. Gupta, “Review of data mining techniques in cloud computing database,” International Journal of Advanced Computer Research, vol. 2, no. 2, pp. 52–55, 2012.
  9. R.-Ş. Petre, “Data mining in cloud computing,” Database Systems Journal, vol. 3, no. 3, pp. 67–71, 2012.
  10. Y. Jiao, “Research of an improved apriori algorithm in data mining association rules,” in Proceedings of the IEEE International Conference on Information Theory and Information Security (ICITIS '11), November 2011.
  11. F. Thabtah, P. Cowling, and Y. Peng, “MCAR: multi-class classification based on association rule,” in Proceedings of the 3rd ACS/IEEE International Conference on Computer Systems and Applications, pp. 127–133, January 2005. View at Publisher · View at Google Scholar · View at Scopus
  12. G. Dong, X. Zhang, L. Wong, and J. Li, “CAEP: classification by aggregating emerging patterns,” Discovery Science, vol. 1721, pp. 30–42, 1999. View at Publisher · View at Google Scholar
  13. J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation,” in Proceedings of the ACM SIGMOD international Conference on Management of Data (SIGMOD '00), pp. 1–12, 2000.
  14. W. Li, J. Han, and J. Pei, “CMAR: accurate and efficient classification based on multiple class-association rules,” in Proceedings of the 1st IEEE International Conference on Data Mining (ICDM '01), pp. 369–376, San Jose, Calif, USA, November 2001. View at Scopus
  15. F. A. Thabtah and P. I. Cowling, “A greedy classification algorithm based on association rule,” Applied Soft Computing Journal, vol. 7, no. 3, pp. 1102–1111, 2007. View at Publisher · View at Google Scholar · View at Scopus
  16. B. Liu, W. Hsu, and Y. Ma, “Integerating classification and association rule mining,” in Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD ’98), pp. 80–86, New York, NY, USA, August 1998.
  17. X. Wang, Z. Zhou, and G. Pan, “CMER: classification based on multiple excellent rules,” Journal of Theoretical and Applied Information Technology, vol. 48, pp. 661–665, 2013.
  18. G. Chen, H. Liu, L. Yu, Q. Wei, and X. Zhang, “A new approach to classification based on association rule mining,” Journal of Decision Support Systems, vol. 42, no. 2, pp. 674–689, 2006. View at Publisher · View at Google Scholar · View at Scopus
  19. P. Leng and F. Coenen, “The effect of threshold values on association rule based classification accuracy,” Journal of Data and Knowledge Engineering, vol. 60, no. 2, pp. 345–360, 2007. View at Publisher · View at Google Scholar · View at Scopus
  20. J. Ross Quinlan and R. Mike Cameron-Jones, “FOIL: a midtern report,” in Proceedings of the European Conference Machine Learning, pp. 3–20, Vienna, Austria, 1993.
  21. A. An, “Learning classification rules from data,” Computers & Mathematics with Applications, vol. 45, no. 4-5, pp. 737–748, 2003. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet
  22. X. Yin and J. Han, “CPAR: classification based on predictive association rules,” in Proceedings of the Society for Industrial and Applied Mathematics (SIAM) International Conference on Data Mining, May 2003.
  23. P. Clark and R. Boswell, “Rule induction with CN2: somerecent improvements,” in Proceedings of European Working Session on Learning (EWSL '91), pp. 151–163, Porto, Portugal, March 1991.