Abstract

The Internet of things (IOT) is a hot issue in recent years. It accumulates large amounts of data by IOT users, which is a great challenge to mining useful knowledge from IOT. Classification is an effective strategy which can predict the need of users in IOT. However, many traditional rule-based classifiers cannot guarantee that all instances can be covered by at least two classification rules. Thus, these algorithms cannot achieve high accuracy in some datasets. In this paper, we propose a new rule-based classification, CDCR-P (Classification based on the Pruning and Double Covered Rule sets). CDCR-P can induce two different rule sets and . Every instance in training set can be covered by at least one rule not only in rule set , but also in rule set . In order to improve the quality of rule set , we take measure to prune the length of rules in rule set . Our experimental results indicate that, CDCR-P not only is feasible, but also it can achieve high accuracy.

1. Introduction

The Internet of things is one of the hot topics in recent years. It has integrated many kinds of modern technology. By these kinds of technology, it produces large-scale data in IOT. In order to handle these large data, it requires techniques and methods of data mining and machine learning [16].

As one of the most important tasks of data mining, classification has been widely applied in IOT. The main idea of classification is that builds classification rules. According these rules, we can predict the class label for unknown objects.

Traditional rule-based classifications usually use greedy approach, such as FOIL [7], CPAR [8], and CMER [9]. These methods repeatedly search for the current best one rule or best- rules and remove examples covered by the rules. They cannot guarantee that all instances can be covered by at least two classification rules. As a result, some traditional classifiers have less classification rules. Their accuracy may not be high. Decision tree classifiers produce classification rules by constructing classification trees, such as ID3 [10], C4.5 [11], and TASC [12]. The process of building a decision tree does not need to delete any examples. All examples can find only one matching rule in the classification rule set. That is why decision trees often generate small rule sets and cannot achieve high accuracy in some data.

Aiming at these weaknesses, we propose a novel Double Covered Rule sets classifier called CDCR-P (Classification based on the Pruning and Double Covered Rule sets). CDCR-P generates two different rule sets and and then prunes the rule set . Each instance can be covered by at least one rule from rule set . At the same time, each instance can be covered by at least one rule from rule set . CDCR-P has four aspects. First, CDCR-P generates rule set . We select several best values which can just cover the training set to construct a candidate set. CDCR-P employs candidate set to produce rule set . Second, in order to induce rule set , we remove the values of candidate set in training data and select other several best values to induce rule set . Rule set is fully different from rule set . Third, each instance can find at least two matching rules. One of the rules is from rule set , and another is from rule set . Forth, we prune the length of rules in rule set , so as to improve the quality of rule set . Our method has the following advantages.(1)CDCR-P can produce two rule sets. Thus, CDCR-P can generate large number of classification rules.(2)All instances in training set can be matched by at least two classification rules.(3)CDCR-P can achieve high accuracy by combining rule set with rule set .

The paper is organized as follows. In Section 2, we introduce the method of CVCR (Classification based on Value Covered Rules). In Section 3, we propose a new classifier CDCR-P and discuss how to use CDCR-P to classify new objects. We report our experimental results in Section 4. We finally conclude our study in Section 5.

2. Classification Based on Value Covered Rules

In this section, we introduce the method of value covered classifier; this method is called CVCR (Classification based on Value Covered Rules).

Suppose is a set of tuples. Each tuple has attributes . Let be a finite set of class labels and be a set consisting of data samples. A rule consists of several samples and a class label , which takes the form of . One rule set is formed by a lot of rules which are extracted from one classifier. If tuple satisfies from rule , the is matched by . predicts that belongs to class .

Definition 1 (information gain). Let be the number of samples of in class . The information gain of an attribute value is denoted by and is defined as follows: where is the probability that a literal belongs to class label . is estimated by .
CVCR finds a set of values which can cover all the training set. The process of constructing CVCR is as follows.
First, CVCR sorts all literals according to the information gain in a descending order and selects several best attribute values which can just cover the training set . construct a candidate set. Let these values split to subdatasets , respectively. Second, CVCR connects with attribute values which can just cover dataset to produce patterns. Finally, repeat the above steps until the information gain of each pattern is equal to .
The experimental results of CVCR are shown in Table 2. The experimental results show that CVCR can achieve higher accuracy than ID3 and FOIL. Because CVCR contains the global optimal attribute values, CVCR is more feasible than ID3. However, CVCR still produces less classification rules, which cannot guarantee that each instance can be matched by at least two rules.

3. Classification Based on Pruning and Double Covered Rule Sets

In this section, we produce a new method CDCR-P. First, we show the process of how to induce rule sets and . Second, we describe the method of how to prune rule set . Finally, we give the way of how to use the two rule sets and to classify new objects.

3.1. Constructing Rule Sets and

Based on the idea of CVCR, we continue mining knowledge in-depth. This approach divides the training set into three small datasets , ,  and   according to candidate set. The method contains four steps. Step 1, we select several best attribute values from candidate set which can just cover one-third of training set. Attribute values have less information gain. The tuples which contain one of form the small dataset . The process is shown as Algorithm 1. We form using the same way as . Step 2, according to ,   is split into datasets . We find (a set of cover values) from on the basis of information gain. The measure of is the same as CVCR, shown as Algorithm 2. CDCR connects with cvi to produce patterns. If the information gain of pattern is equal to , belongs to rule set , shown as Algorithm 3. Step 3, CDCR recalculates the CV (cover values) in excluding . CV splits into some datasets and connects covered value in each dataset to produce new rules. These rules belong to rule set , shown as Algorithm 4. Finally, we remove from and iterate the process until are trained. Rule set is the same as CVCR. Both rule sets and belong to CDCR.

Input: Training data
Output: , candidate value CV1
Method:
(1)   ;
(2)  compute the information gain of each sample ;
(3)  sort according to the information gain in descending order. If two sample have the
    same information gain, sort the two sample according to the support;
(4)   ;
(5)  While
(6)   , where is the first element of ;
(7)  for each tuple in
(8)   While &&
(9)         ;
(10)          ;
(11)          ;
(12)      end while
(13)      end for
(14)  end while
(15)  return

Input: Dataset
Output: A set of cover value
Method: find a set of cover value that can cover
(1)  data set ; cover value
(2)  compute the information gain of each sample in ;
(3)  sort in according to the information gain in descending order. If two sample have
  same information gain, sort the two sample according to the support;
(4)  while
(5)     ; where is the first element of ;
(6)     ;
(7)    remove from that all examples contain ;
(8)  end while
(9)  return

Input: Training data ,
Output: Rule set
Method:
(1)  Rule set , , cover value , IniteQueue ;
(2)  while
(3)    where is first element of ;
(4)    ;
(5)  end while
(6)  while
(7)   pattern ; ;
(8)   if continue;
(9)   compute the information gain of ;
(10)  if
(11)   ;
(12)  else
(13)   ;
(14)  found in dataset;
(15)  connect with ;
(16)   ;
(17)   , ;
(18)  end if
(19)  end while
(20)  return

Input: Training data ,
Output: Rule set
Method:
(1)  Rule set , , cover value , candidate set , IniteQueue ;
(2)  add cover value which can cover to , and
(3)  while
(4)        where is first element of ;
(5)        ;
(6)  end while
(7)  while
(8)    ;
(9)   If continue;
(10)     compute the information gain of ;
(11)     if
(12)      ;
(13)     else
(14)      ;
(15)     Found in dataset, and ;
(16)     connect with ;
(17)      ;
(18)     , ;
(19)     end if
(20)  end while
(21)  return

3.2. Pruning Rule Set

In order to improve the quality of rule set , we introduce a new method CDCR-P (Classification based on the Pruning and Double Covered Rule sets).

Definition 2 (confidence). The confidence of sample is defined as follows: where means the number of tuples which contain sample in class .
The confidence of rules that CDCR generated is equal to 100%. We modify the length of rule set . The rules are generated when the confidence is 100% in the small dataset instead of in the whole training set . Each rule is marked with the confidence in . Thus, rule set in CDCR-P is shorter than rule set in CDCR.

3.3. Classifying Unknown Examples

In this part, we give the method of how to use CDCR and CDCR-P to classify unknown instances.

Definition 3 (support). The support of sample is denoted by where means the number of tuples which contain sample . is the number of tuples in training data.
When testing unknown examples, CDCR selects the matched rule with the highest support. If some rules have the same support, we select the maximum number of matched rules in each class.
CDCR-P first considers the rule with the highest confidence. If two rules have the same confidence, CDCR-P sorts the two rules according to the support.

Definition 4 (missing match rate). If the test instance cannot find any match rule, this unclassified instance is considered mismatch. The missing match rate is defined as
where count (unclassified instance) means the number of tuples which cannot be matched by rules.

4. Experiments

We show the experimental results in 14 UCI datasets. The character of each data is shown in Table 1. All the experiments are performed on a 2.2 GHz PC with 2.84 G main memory, running Microsoft Windows XP. Experiments run tenfold cross validation method for each data.

In Table 2, we give the accuracy of ID3, FOIL, CVCR, CDCR, and CDCR-P. Figure 1 gives the accuracy of ID3, FOIL, and CVCR. CVCR employs the idea of covered values; these cover values are the global optimal attribute values in training data. From Figure 1 and Table 2 we can see that CVCR can achieve higher accuracy than ID3 and FOIL. Figure 2 gives the accuracy of CVCR, CDCR, and CDCR-P. CDCR not only uses the method of covered values, but also produces two rule sets and . Each instance can be matched at least by one rule from rule set and rule set . From Figure 2 and Table 2 we can see that CDCR can achieve higher accuracy than CVCR. Based on all advantages of CVCR and CDCR, CDCR-P take measure to prune the length of rule set . The experimental results show that CDCR-P has the highest accuracy.

Table 3 displays the missing match rate of ID3, FOIL, CVCR, CDCR, and CDCR-P. CVCR can produce more rules than ID3 and FOIL. From Table 3 we can see that the missing match rate is decreased obviously by CVCR. CDCR produces two rule sets. Therefore, CDCR produces more rules than CVCR. From Table 3 we can see that the missing match rate of CDCR is lower than CVCR. CDCR-P modifies the length of rule set ; the quality of rules in CDCR-P is higher than CDCR. The experiments indicate that the mismatch rate of CDCR-P is the lowest.

Through all the above experimental results, we can conclude the following. It is necessary for us to construct two rule sets. It is necessary to prune rule set . CDCR-P can achieve high accuracy and has an excellent result in missing match rate.

5. Conclusions

Classification has been widely applied in IOT. The accuracy of classification is an important factor in classification task. The traditional rule-based classifications cannot guarantee that all test cases can be matched by two rules. They usually generate less classification rules. Thus, the accuracy of these algorithms may be low in some data. In this paper, a novel approach CDCR-P is proposed. CDCR-P generates two rule sets: rule set and rule set . All instances can be matched by at least one rule not only in rule set , but also in rule set . This method greatly increases the number of extracted rules. Thus, it gets more information from training data. Our experimental results show that the methods of CDCR-P can produce more rules and achieve high accuracy. In future research, we will perform an in-depth study on combining distributed data mining with IOT in order to improve the efficiency of CDCR-P.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is funded by China NFS program (no. 61170129), Fujian Province NSF Program (no. 2013J01259), and Minnan Normal University Postgraduate Education Project (no. 1300-1314).