Abstract

In recent years, more and more people pay attention to cloud computing. Users need to deal with magnanimity data in the cloud computing environment. Classification can predict the need of users from large data in the cloud computing environment. Some traditional classification methods frequently adopt the following two ways. One way is to remove instance after it is covered by a rule, another way is to decrease tuple weight of instance after it is covered by a rule. The quality of these traditional classifiers may be not high. As a result, they cannot achieve high classification accuracy in some data. In this paper, we present a new classification approach, called classification based on both attribute value weight and tuple weight (CATW). CATW is distinguished from some traditional classifiers in two aspects. First, CATW uses both attribute value weight and tuple weight. Second, CATW proposes a new measure to select best attribute values and generate high quality classification rule set. Our experimental results indicate that CATW can achieve higher classification accuracy than some traditional classifiers.

1. Introduction

Cloud computing has become a hot issue in recent years. With the rapid development of information technology and the popularity of cloud computing, it is necessary to mine useful information from magnanimity data [19]. Classification is one of the most important tasks in the data mining and the machine learning. Classification can predict the need of users from large data. First, it builds classification rules from training dataset. Second, it uses these rules to predict the class label of new instances.

The traditional classifiers [1019] frequently adopt the following two ways. Some traditional classifiers remove instance after it is covered by a rule, such as FOIL [20] and ELEM2 [21]. Other traditional classifiers decrease tuple weight of instance after it is covered by a rule, such as PRM and CPAR [22]. Then, we introduce the feature of these classifiers. In the process of extracting rules, FOIL uses measure gain to select a best attribute value and generates one classification rule. It removes instance after it is covered by a rule. As a result, this method is ineffective. It generates a small rule set and cannot achieve high accuracy in some data. ELEM2 uses another measure to generate classification rules. It also removes instance after it is covered by a rule. ELEM2 considers the degree of relevance of an attribute-value pair and selects the most relevant pairs to generate rules. PRM modifies FOIL to achieve higher accuracy. PRM does not remove instance when it is covered by a rule. PRM gives the instance a tuple weight. Thus, PRM can insure that each instance is covered more than once. PRM selects only the best gain to generate rule. CPAR stands in the middle between exhaustive and greedy algorithms and combines the advantages of both. CPAR selects several best attribute values and builds several rules at one time. It does not remove instance immediately when it is covered by a rule. CPAR also uses tuple weight to guarantee that each instance can be covered more than once. These methods do not employ attribute value weight. They cannot get high quality classification rule set. As a result, they can not achieve high classification accuracy in some data.

In this paper, we propose a new algorithm, named classification based on both attribute value weight and tuple weight (CATW). CATW uses the both attribute value weight and tuple weight. Moreover, CATW uses a new measure to improve the quality of classification rule set. Our method has following advantages.(1)After an instance is covered by a rule, instead of removing it, its weight is decreased by multiplying a factor. Thus, we can guarantee that each instance can be covered more than once. (2)If we only use tuple weight, we cannot change the importance of an attribute-value pair in the dataset. Therefore, CATW uses attribute value weight to reduce the importance of attribute-value pair after the rule is generated. In this way, CATW can increase the chances of attaining other optimal attribute-value pairs. We can generate more high quality of rules.(3)CATW presents a new measure to select the best attribute value. CATW uses two different measures: support and correlation confidence. If two different attribute-value pairs have same correlation confidence, CATW considers their support.

Experimental results indicate that: if the instance is removed immediately after it is covered by a rule, the classifier generates a very small number of rules; if the classifier is only using tuple weight, the quality of classification rule set is not good. Since CATW uses both attribute value weight and tuple weight, it achieves high classification accuracy.

The outline of this paper is as follows. Section 2 presents the details of CATW and describes the process of rule generation in CATW. Section 3 discusses how to predict class label using the rules. The experimental results are presented in Section 4. Finally, we conclude the study in Section 5.

2. Rule Generation of CATW

The algorithm of CATW has three special points: the attribute value weight, the tuple weight, and the improved measure. First, we describe the method of how to use tuple weight. Second, we introduce the use of attribute weight. Third, we propose a new measure to generate high quality classification rule set. Finally, we show the whole process of how to generate rule set.

Let be a set of tuples. Each tuple has attributes . Suppose to be a set of class labels , where means the number of class label.

Definition 1 (a literal). A literal is an attribute-value pair, which follows the pattern of , where is an attribute and is a value of attribute .

Definition 2 (a classification rule). is called a classification rule , if consists of a conjunction of literals with the form of , where is a class label.

A tuple satisfies the antecedent of if and only if it has all literals in . If satisfies the antecedent of , predicts that has a class label .

2.1. The Tuple Weight

In traditional classification, all rules are generated from the training database. If a tuple is covered by a rule , they can not ensure that is the best rule for . If is generated from the remaining dataset instead of the whole dataset [22], may not be the best rule. In order to improve the classification accuracy and increase the number of rules, some traditional classifiers use tuple weight. By depending on tuple weight, these classifiers can delay removing instance after it is covered by a rule. In our algorithm, after a tuple is covered by a rule, instead of removing it, its weight is decreased by multiplying a factor. We set a threshold for tuple weight. When the tuple weight of tuple is less than threshold, we remove the tuple from training data. CATW produces more rules. Each tuple can be covered by classification rules more than once.

In our approach, we can set an initial threshold and an end threshold. We can limit the number of rules which are generated according to actual situation. If we set a small end threshold, it generates a large number of rules. On the contrary, if we set a large end threshold, it generates a less number of rules. In our experiment, we set an initial threshold , a weight factor . Moreover, we set an end threshold. The end threshold is the third power of weight factor. We can make sure that each instance can be covered three times.

2.2. The Attribute Value Weight

Some traditional classifiers only use tuple weight. They do not change the importance of an attribute-value pair in the training data. After a rule is generated, these classifiers may select the duplicate attribute-value pair. Thus, they may miss some high quality rules which can be used to affect the classification accuracy. CATW uses attribute value weight to reduce the importance of attribute-value pair after the rule is generated. When the tuple is covered by a rule, our algorithm can reduce the importance of attribute-value pairs which are contained in it. In this way, we can increase the chances of attaining another optimal attribute-value pair.

Example 3. The following training dataset with two classes is shown in Table 1. Then, we demonstrate how to use attribute value weight.

Suppose to be just generated. Then, we set a weight factor , and set for positive examples. After a rule is generated, CATW uses weight factor to reduce the importance of all attribute values that are contained in antecedent of the rule in positive examples. The result is shown in Table 2.

The results of our experiment indicate that classification accuracy is influenced by attribute value weight. Compared with the classifiers which do not use attribute value weight, CATW can achieve higher classification accuracy in some data. Thus, the attribute value weight can be a help to improve the quality of classification rule.

2.3. The Measure of CATW

Some classifiers use FOIL gain to select literal. FOIL gain is used to measure the information gained from adding literal to the current rule. Let us suppose that means the number of positive examples which satisfies the antecedent of the current rule and means the number of negative examples which satisfy the antecedent of the current rule . After literal is added to , means the number of positive examples which satisfy the antecedent of the new rule, and means the number of negative examples which satisfy the antecedent of the new rule [22]. The FOIL gain of is defined as:

In our experiment, we employ two different improved measures.

2.3.1. Improved FOIL Measure

In our experiment, means total tuple weight of positive examples which satisfy the antecedent of current rule . means total tuple weight of negative examples which satisfy the antecedent of current rule . After literal is added to , means total attribute value weight of literal in positive examples, and means total attribute value weight of literal in negative examples. Therefore, CATW uses both tuple weight and attribute value weight when it measures literal . We call this measure an improved FOIL measure.

2.3.2. Improved Correlation Measure

In traditional FOIL gain, has a huge influence to select a best attribute value. For example, if is too small and is too large, the result of is not the best for rule. We use two different measures: support and correlation confidence. We divide the traditional FOIL measure in two parts.(1)  : .(2)  : .

When we select literal , a global order of literal is composed. Given two literal and , is better than , denoted as .

if and only if (1) PART II or (2) and . We call this measure an improved correlation measure.

2.4. Algorithm of CATW

In this part, we will introduce our algorithm in detail. The CATW algorithm is presented in Algorithm 1.

Input: Training set ( and are the sets of all positive and negative example, respectively)
Output: A set of rules for predicting class labels for examples
Procedure CATW
 rules null
 while
   ,
  
  while and
   find the best attribute value use the improved correlation measure combine tuple weight with attribute weight
   add to
   remove from all examples not satisfying
   remove from all examples not satisfying
  end
  add to rules
  for each attribute at that is included in antecedent of in
   
  end
  for each example in satisfying ’s body
   
   if then remove from
  end
 end
return rules

3. Classification of CATW

Before making any prediction, we use the Laplace expected error estimate [23] to evaluate the quality of rules. It is defined as follows: where is the number of classes and is the total number of examples satisfying the antecedent of rule, among which examples belong to .

When using rules to predict the class-label of unknown instance, we use several rules which are matched by the instance. If all the rules have the same consequent of rule, we assign that label to the instance. If all the best rules have several classes, we calculate the average Laplace accuracy of each class. Then, we select the class label with the highest average value and assign it to the instance.

4. Experimental Results

All experiments are performed on different datasets from the UCI data collection. All datasets were conducted using stratified tenfold cross-validation. In cross-validation, the data set is divided into 10 blocks. Each block is held out once. The classifier is trained on the remaining blocks. The character of each dataset is shown in Table 3. We perform our experiments on a 2.2 GHz PC with 2 G memory, running Microsoft Windows XP.

In Tables 4 and 5, Column 1 shows the accuracy of FOIL. Column 2 shows the accuracy of CMAR. Column 3 shows the accuracy of CPAR. Column 4 shows the accuracy of CATW without attribute value weight, set tuple weight . Column 5 shows the accuracy of CATW, set attribute value weight and tuple weight . Column 6 shows the accuracy of CATW, set attribute value weight and tuple weight .

In Table 4, we use the measure which is an improved FOIL measure. Figure 1 gives the accuracy of FOIL, CMAR, CPAR, and CATW based on Table 4. CATW uses both attribute value weight and tuple weight and employs the improved FOIL measure. From Figure 1 and Table 4, we can see that CATW can achieve higher accuracy than FOIL, CMAR, and CPAR.

In Table 5, we use the measure which is an improved correlation measure. Figure 2 gives the accuracy of FOIL, CMAR, CPAR, and CATW based on Table 5. CATW uses both attribute value weight and tuple weight and employs the improved correlation measure. From Figure 2 and Table 5, we can see that CATW with improved correlation measure can also achieve higher accuracy than FOIL, CMAR, and CPAR.

By comparison, the accuracy of CATW with the improved correlation measure is higher than the accuracy of CATW with the improved FOIL measure. From Tables 4 and 5, we can see that it is necessary to use the improved correlation measure.

Table 6 displays the accuracy of different attribute value weights in CATW. In Table 6, CATW employs the improved FOIL measure. Table 7 displays the accuracy of different attribute value weights in CATW. In Table 7, CATW employs the improved correlation measure. The results of the two tables indicate that the accuracy of improved correlation measure is higher than the accuracy of improved FOIL measure and different value of attribute value weight has different influence on the accuracy of classification.

Through all the above results of our experiment, we can conclude that it is necessary to use attribute value weight and tuple weight; it is necessary to use improved correlation measure; different value of attribute value weight has different influence on the accuracy of classification.

5. Conclusions and Future Work

With the rapid development of information technology and the popularity of cloud computing, it is necessary to mine useful information from magnanimity data. Some traditional classification methods frequently adopt the following two ways. One way is that it does not use tuple weight to remove instance after it is covered by a rule. Another way is that it only gives tuple weight of instance after it is covered by a rule. As result, they cannot achieve high classification accuracy in some data. In this paper, we present a novel approach CATW. First, CATW uses both attribute value weight and tuple weight. Second, CATW proposes a new measure which is the improved correlation measure. CATW employs the improved correlation measure to select best attribute values and generate high quality classification rule set. The results of our experiment indicate that CATW can generate a reasonable number of classification rules. In addition, CATW can achieve high classification accuracy. Our experiment shows that different value of attribute value weight has different influence on the accuracy of classification. At present, we cannot find the regular change in selecting an optimal attribute value weight. In future research, we will focus on it. We also focus on another research. We will combine distributed data mining with cloud computing platform in order to improve the efficiency of CATW.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work is funded by the China NFS Program (no. 61170129), and by the Fujian province NSF Program (no. 2013J01259).