Abstract

The aim of this study is focusing the issue of traditional clustering algorithm subjects to data space distribution influence, a novel clustering algortihm combined with rough set theory is employed to the normal clustering. The proposed rough clustering algorithm takes the condition attributes and decision attributes displayed in the information table as the consistency principle, meanwhile it takes the data supercubic and information entropy to realize data attribute shortcutting and discretizing. Based on above discussion, by applying assemble feature vector addition principle computiation only one scanning information table can realize clustering for the data subject. Experiments reveal that the proposed algorithm is efficient and feasible.

1. Introduction

With the fast development and widespread application of computer and network technology, more and more service data are available, while these data contain a huge mass of valuable information which is hard to be detected. Therefore, lots of researchers focused on the issues and carry out some works. Clustering was proposed for the goal of group similar objects in one cluster and dissimilar objects in different clusters [17]. At present, many clustering algorithms have been presented by the scholars. Maybe the most popular employed clustering algorithm is the classic k-means with applications everywhere. However, because all these algorithms are very sensitive to date space distribution or for the reason of improving algorithm efficiency, the data are compressed possibly with a loss of quality; the algorithm result is bad sometimes. Lots of control theories [811] have also been discussed about this issue. The rough set theory was presented by professor Pawlak in Warsaw University of Technology in the 1980s [1216]. It is a simplifying data theory especially in dealing with uncertain and incomplete data. The main characteristic of it is that it only uses the information provided by itself and does not need any other additional information or transcendental knowledge to pack or to discrete data or to reduce data attributes [1719], and so forth. So a new clustering algorithm based on rough set is presented in this paper.

Definition 1 (communication system). Let one set communication system as , where is a nonempty finite set of objects, , is one object in this formula; is a set of object’s properties, divided into two disjoint sets, the conditional attributes set and the decisive attributes set , ; is a set of attributes value, , , is a domain of attribute ; is a mapping function of , and it gives an attribute value to each attribute of all objects, that is, , , .

Definition 2 (interval partition). Let one set communication systems as ; is the number of decisive kinds; a breakpoint from domain which is formed by attribute is marked . If , ; at any breakpoint set of domain is defining as interval partition of .

Definition 3 (comentropy). Let one set communication table as , , , , for each subset ; is class ’s samples number in subset , if ; the comentropy is .

Definition 4 (similarity of set). If the objects number is , and the number of attributes which describe each object is , is discrete value, is one object subset among them, and the objects number of it is marked , and in all the objects’ discrete intervals of this subset, the number of attributes which have the same value is , the similarity of set is defined as .

Definition 5 (characteristic vector of set). If the objects number is , and the number of attributes which describe each object is , and is one object subset among them, the objects number of it is marked , and in all the objects’ discrete intervals of this subset, the number of attributes which have the same value is , correspondingly, the sequence numbers of attributes are ; so the characteristic vector of the object set is , where , .

Definition 6 (addition rule of set characteristic vector). If the objects number is , and the number of attributes which describe each object is , and   are two disjoint object subsets among them, correspondingly, their set characteristic vectors are , ; so the addition rule of set characteristic vector is defined as where , , , .

Because of the need of algorithm and based on the relation between conditional attributes and decisive attributes, as well as the related concepts of rough set theory, the following theorems are introduced and proved in this paper.

Theorem 7. Let one set decision table as , where is a nonempty finite set of objects, A is a set of object’s properties, , and , ; let one set as a rough negative domain; if , ; in this rough negative domain, there exists the following: .

Proof. Consider the following:

Theorem 8. Let one set , and   and as a rough negative domain; there exists the following: .

Proof. For and  , there are two conditions: (1)if is a redundant attribute, there exists the following: ; so ;(2)if is an important attribute, there exists the following: ; so, , that is ; so .

Theorem 9. If the objects number is , and the number of attributes which describe each object is , and are two disjoint object subsets among them, and combines with to form set , correspondingly, their set characteristic vectors are
Therefore,

Proof. Because there is no intersection between set and , and the numbers of their elements are and , so the elements number of set is ; that is, .
First, let us prove that . For any , in set , all objects have the same attribute property in the place whose ordinal number is , and because , all objects in set have the same attribute property in the place whose ordinal number is too. So ; by managing together, we can have too; hence, .
On the other hand, it can be proved that , actually, for any , because all objects in set have the same attribute property in the place whose ordinal number is . And all objects in set have the same attribute property in the place whose ordinal number is too. Then in set , all objects must have the same attribute property in the place whose ordinal number is ; that is, , ; so .
Based on the definition of set similarity and , we can come to the conclusion that And based on the definition of characteristic vector, it is clear that To sum up, the theorem has been proved.

4. Algorithm Description

Before realizing the rough clustering, the discrete breakpoint should be initialized first…. Set , and calculate the relative comentropy of source information table.

Step 1. Apply the attribute significance formula to calculate the key attribute of information table; eliminate the redundant attribute.

Step 2. Based on the concept of hypercube, generalize every attribute as follows:(1)according to the decisive attribute, cluster the instances of information table;(2)generalize the instances which belong to the same class.Calculate the breakpoint set .

Step 3. According to breakpoint set, calculate the relative comentropy of information table. If , then turn to Step 6; else turn to Step 4.

Step 4. Based on the integral discretization of information table, partially discretize the newly divided regions as follows.
Let’s set the two discrete sets as , ; if the new class which has been clustered by the two set’s instances subsets does not contain any different instance, then these two sets should be clustered to one class, which then forms a new breakpoint set .

Step 5. According to the breakpoint set , discretize information table and calculate relative comentropy ; if , turn to Step 3; else turn to Step 6.

Step 6. According to the breakpoint set , integer map the attribute values of information table is on appropriate integer map.

Step 7. In the new information table which has been discretized, each object sets up a new set, and they are, respectively, marked , . Based on the additive property theorem, let us calculate
After combination, if the set’s similarity is greater than any class’ object lower similarity limit . Then and combine to form a set, as initial class marks ; if the set’s internal similarity is less than any class’ object lower similarity limit . Then and each will be a respective new initial class, marks and ; furthermore, the classes number marks .

Step 8. According to the set , , calculate ; seek , thus we have .
If is greater than any class’ object lower similarity limit . Then and combine to form a set, as a class after updating still marks ; if is less than any class’ object lower similarity limit . Then let be a new initial class, marks , and the classes number .

Step 9. In the finally established classes , , the ones which contain less objects are isolate object classes, and they could be removed according to the actual demands, so the classes left will be the final result of clustering.

For convenience of illustrating the rough clustering algorithm, the float chart is shown in Figure 1.

5. Simulation

The source information table is a decision table of concrete freezing resistance. In this table, the conditional attributes , , , , are all continuous attributes, and their values are five checking results which describe the condensability of concrete; in information table, one decisive attribute is class; if its value is 1, it is means that the concrete freezing resistance is good; else if its value is 0, it is means that the concrete freezing resistance is bad. And the similarity threshold of one class is defined 0.5.

Based on the attribute significance formula, we can calculate that the attribute is redundant attribute; so delete it from the information table; then we can get an attribute reduction from the source information table.

By using continuous attributes discretization presented in this algorithm, after discretization of information table, we can get the discrete decisive Table 1.

Set up one set for each client, and respectively, mark , .

Combine and , in the new set ; the same attributes set of data objects and is ; from this, we can work out the similarity of set as

After combination, since the set’s internal similarity is not less than any class’ object lower similarity limit 0.5, then and combine to form a set, as initial class marks ; then the number of initial class is 1.

Again, sets , , and combine to form a set; in this new set, the same attributes set of data objects , , and is ; then the set similarity is .

After combination, since the set’s internal similarity is not less than any class’ object lower similarity limit 0.5, then , , , and combine to form a set, as initial class marks ; then the number of initial class is still 1. Incorporating , sets and , and form a new set; in this new set, the same attributes set of data objects , , , and is ; then the set similarity is . After Combination, since the set’s internal similarity is less than any class’ object lower similarity limit 0.5, then let be a new initial class, marks , and the classes number turn to 2. Calculating , and seek , thus we have .

If is greater than any class’ object lower similarity limit 0.5. Then and combine to form a set, as an initial class after updating, still marks ; if is less than any class’ object lower similarity limit 0.5. Then let be a new initial class, marks , and . For , carry out the similar operations in turn. Until we get the final initial classes , , , .

6. Conclusions

From the clustering result, we can see that only data object was wrongly clustered to different class; the other data objects’ clustering results completely accord with the class classification which we have known before; from the operation of numerical example,we can find that the clustering algorithm presented in this paper has some advantages as follows:(1)because data is pretreated by the application of rough set theory in this clustering algorithm, data structure is simplified, and the clustering algorithm is simple to implement, and cluster quality is improved also;(2)this clustering algorithm is not affected by the distributional characteristics of date space, and based on the set eigenvalue, isolated objects can be eliminated. In the example presented, data object is isolated object;(3)because set eigenvectors are the operands in this clustering algorithm, and data objects’ clustering and dividing operation can be finished by only scanning the information table once, this algorithm is efficient.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This present work was supported partially by the Polish-Norwegian Research Programme (Project no. Pol-Nor/200957/47/2013). The authors highly appreciate the above financial supports.