Research Article | Open Access
A Novel Research on Rough Clustering Algorithm
The aim of this study is focusing the issue of traditional clustering algorithm subjects to data space distribution influence, a novel clustering algortihm combined with rough set theory is employed to the normal clustering. The proposed rough clustering algorithm takes the condition attributes and decision attributes displayed in the information table as the consistency principle, meanwhile it takes the data supercubic and information entropy to realize data attribute shortcutting and discretizing. Based on above discussion, by applying assemble feature vector addition principle computiation only one scanning information table can realize clustering for the data subject. Experiments reveal that the proposed algorithm is efficient and feasible.
With the fast development and widespread application of computer and network technology, more and more service data are available, while these data contain a huge mass of valuable information which is hard to be detected. Therefore, lots of researchers focused on the issues and carry out some works. Clustering was proposed for the goal of group similar objects in one cluster and dissimilar objects in different clusters [1–7]. At present, many clustering algorithms have been presented by the scholars. Maybe the most popular employed clustering algorithm is the classic k-means with applications everywhere. However, because all these algorithms are very sensitive to date space distribution or for the reason of improving algorithm efficiency, the data are compressed possibly with a loss of quality; the algorithm result is bad sometimes. Lots of control theories [8–11] have also been discussed about this issue. The rough set theory was presented by professor Pawlak in Warsaw University of Technology in the 1980s [12–16]. It is a simplifying data theory especially in dealing with uncertain and incomplete data. The main characteristic of it is that it only uses the information provided by itself and does not need any other additional information or transcendental knowledge to pack or to discrete data or to reduce data attributes [17–19], and so forth. So a new clustering algorithm based on rough set is presented in this paper.
2. Related Definitions
Definition 1 (communication system). Let one set communication system as , where is a nonempty finite set of objects, , is one object in this formula; is a set of object’s properties, divided into two disjoint sets, the conditional attributes set and the decisive attributes set , ; is a set of attributes value, , , is a domain of attribute ; is a mapping function of , and it gives an attribute value to each attribute of all objects, that is, , , .
Definition 2 (interval partition). Let one set communication systems as ; is the number of decisive kinds; a breakpoint from domain which is formed by attribute is marked . If , ; at any breakpoint set of domain is defining as interval partition of .
Definition 3 (comentropy). Let one set communication table as , , , , for each subset ; is class ’s samples number in subset , if ; the comentropy is .
Definition 4 (similarity of set). If the objects number is , and the number of attributes which describe each object is , is discrete value, is one object subset among them, and the objects number of it is marked , and in all the objects’ discrete intervals of this subset, the number of attributes which have the same value is , the similarity of set is defined as .
Definition 5 (characteristic vector of set). If the objects number is , and the number of attributes which describe each object is , and is one object subset among them, the objects number of it is marked , and in all the objects’ discrete intervals of this subset, the number of attributes which have the same value is , correspondingly, the sequence numbers of attributes are ; so the characteristic vector of the object set is , where , .
Definition 6 (addition rule of set characteristic vector). If the objects number is , and the number of attributes which describe each object is , and are two disjoint object subsets among them, correspondingly, their set characteristic vectors are , ; so the addition rule of set characteristic vector is defined as where , , , .
3. Related Theorems
Because of the need of algorithm and based on the relation between conditional attributes and decisive attributes, as well as the related concepts of rough set theory, the following theorems are introduced and proved in this paper.
Theorem 7. Let one set decision table as , where is a nonempty finite set of objects, A is a set of object’s properties, , and , ; let one set as a rough negative domain; if , ; in this rough negative domain, there exists the following: .
Proof. Consider the following:
Theorem 8. Let one set , and and as a rough negative domain; there exists the following: .
Proof. For and , there are two conditions: (1)if is a redundant attribute, there exists the following: ; so ;(2)if is an important attribute, there exists the following: ; so, , that is ; so .
Theorem 9. If the objects number is , and the number of attributes which describe each object is , and are two disjoint object subsets among them, and combines with to form set , correspondingly, their set characteristic vectors are
Proof. Because there is no intersection between set and , and the numbers of their elements are and , so the elements number of set is ; that is, .
First, let us prove that . For any , in set , all objects have the same attribute property in the place whose ordinal number is , and because , all objects in set have the same attribute property in the place whose ordinal number is too. So ; by managing together, we can have too; hence, .
On the other hand, it can be proved that , actually, for any , because all objects in set have the same attribute property in the place whose ordinal number is . And all objects in set have the same attribute property in the place whose ordinal number is too. Then in set , all objects must have the same attribute property in the place whose ordinal number is ; that is, , ; so .
Based on the definition of set similarity and , we can come to the conclusion that And based on the definition of characteristic vector, it is clear that To sum up, the theorem has been proved.
4. Algorithm Description
Before realizing the rough clustering, the discrete breakpoint should be initialized first…. Set , and calculate the relative comentropy of source information table.
Step 1. Apply the attribute significance formula to calculate the key attribute of information table; eliminate the redundant attribute.
Step 2. Based on the concept of hypercube, generalize every attribute as follows:(1)according to the decisive attribute, cluster the instances of information table;(2)generalize the instances which belong to the same class.Calculate the breakpoint set .
Step 4. Based on the integral discretization of information table, partially discretize the newly divided regions as follows.
Let’s set the two discrete sets as , ; if the new class which has been clustered by the two set’s instances subsets does not contain any different instance, then these two sets should be clustered to one class, which then forms a new breakpoint set .
Step 6. According to the breakpoint set , integer map the attribute values of information table is on appropriate integer map.
Step 7. In the new information table which has been discretized, each object sets up a new set, and they are, respectively, marked , . Based on the additive property theorem, let us calculate
After combination, if the set’s similarity is greater than any class’ object lower similarity limit . Then and combine to form a set, as initial class marks ; if the set’s internal similarity is less than any class’ object lower similarity limit . Then and each will be a respective new initial class, marks and ; furthermore, the classes number marks .
Step 8. According to the set , , calculate ; seek , thus we have .
If is greater than any class’ object lower similarity limit . Then and combine to form a set, as a class after updating still marks ; if is less than any class’ object lower similarity limit . Then let be a new initial class, marks , and the classes number .
Step 9. In the finally established classes , , the ones which contain less objects are isolate object classes, and they could be removed according to the actual demands, so the classes left will be the final result of clustering.
For convenience of illustrating the rough clustering algorithm, the float chart is shown in Figure 1.
The source information table is a decision table of concrete freezing resistance. In this table, the conditional attributes , , , , are all continuous attributes, and their values are five checking results which describe the condensability of concrete; in information table, one decisive attribute is class; if its value is 1, it is means that the concrete freezing resistance is good; else if its value is 0, it is means that the concrete freezing resistance is bad. And the similarity threshold of one class is defined 0.5.
Based on the attribute significance formula, we can calculate that the attribute is redundant attribute; so delete it from the information table; then we can get an attribute reduction from the source information table.
By using continuous attributes discretization presented in this algorithm, after discretization of information table, we can get the discrete decisive Table 1.
Set up one set for each client, and respectively, mark , .
Combine and , in the new set ; the same attributes set of data objects and is ; from this, we can work out the similarity of set as
After combination, since the set’s internal similarity is not less than any class’ object lower similarity limit 0.5, then and combine to form a set, as initial class marks ; then the number of initial class is 1.
Again, sets , , and combine to form a set; in this new set, the same attributes set of data objects , , and is ; then the set similarity is .
After combination, since the set’s internal similarity is not less than any class’ object lower similarity limit 0.5, then , , , and combine to form a set, as initial class marks ; then the number of initial class is still 1. Incorporating , sets and , and form a new set; in this new set, the same attributes set of data objects , , , and is ; then the set similarity is . After Combination, since the set’s internal similarity is less than any class’ object lower similarity limit 0.5, then let be a new initial class, marks , and the classes number turn to 2. Calculating , and seek , thus we have .
If is greater than any class’ object lower similarity limit 0.5. Then and combine to form a set, as an initial class after updating, still marks ; if is less than any class’ object lower similarity limit 0.5. Then let be a new initial class, marks , and . For , carry out the similar operations in turn. Until we get the final initial classes , , , .
From the clustering result, we can see that only data object was wrongly clustered to different class; the other data objects’ clustering results completely accord with the class classification which we have known before; from the operation of numerical example,we can find that the clustering algorithm presented in this paper has some advantages as follows:(1)because data is pretreated by the application of rough set theory in this clustering algorithm, data structure is simplified, and the clustering algorithm is simple to implement, and cluster quality is improved also;(2)this clustering algorithm is not affected by the distributional characteristics of date space, and based on the set eigenvalue, isolated objects can be eliminated. In the example presented, data object is isolated object;(3)because set eigenvectors are the operands in this clustering algorithm, and data objects’ clustering and dividing operation can be finished by only scanning the information table once, this algorithm is efficient.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This present work was supported partially by the Polish-Norwegian Research Programme (Project no. Pol-Nor/200957/47/2013). The authors highly appreciate the above financial supports.
- C. Bean and C. Kambhampati, “Autonomous clustering using rough set theory,” International Journal of Automation and Computing, vol. 5, no. 1, pp. 90–102, 2008.
- T. B. Ho and N. B. Nguyen, “Nonhierarchical document clustering based on a tolerance rough set model,” International Journal of Intelligent Systems, vol. 17, no. 2, pp. 199–212, 2002.
- P. Lingras, A. Elagamy, A. Ammar, and Z. Elouedi, “Iterative meta-clustering through granular hierarchy of supermarket customers and products,” Information Sciences, vol. 257, pp. 14–31, 2014.
- C. L. Ngo and H. S. Nguyen, “A tolerance rough set approach to clustering web search results,” in Knowledge Discovery in Databases: PKDD 2004, pp. 515–517, Springer, 2004.
- A. Singh, “Grid fuzzy clustering: tongue diagnosis,” The International Journal of Big Data, vol. 1, 2014.
- W. Song and S. C. Park, “Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering,” Knowledge and Information Systems, vol. 22, no. 3, pp. 347–369, 2010.
- D.-R. Yu, Q.-H. Hu, and W. Bao, “Combining rough set methodology and fuzzy clustering for knowledge discovery from quantitative data,” Proceedings of the Chinese Society of Electrical Engineering, vol. 24, no. 6, pp. 205–210, 2004.
- S. Yin, S. X. Ding, A. Haghani, H. Hao, and P. Zhang, “A comparison study of basic data-driven fault diagnosis and process monitoring methods on the benchmark Tennessee Eastman process,” Journal of Process Control, vol. 22, no. 9, pp. 1567–1581, 2012.
- S. Yin, H. Luo, and S. Ding, “Real-time implementation of fault-tolerant control systems with performance optimization,” IEEE Transactions on Industrial Electronics, vol. 64, no. 5, pp. 2402–2411, 2014.
- X. Zhao, L. Zhang, P. Shi, and H. Karimi, “Novel stability criteria for TS fuzzy systems,” IEEE Transactions on Fuzzy Systems, 2013.
- X. Zhao, L. Zhang, P. Shi, and H. Karimi, “Robust control of continuous-time systems with state-dependent uncertainties and its application to electronic circuits,” IEEE Transactions on Industrial Electronics, vol. 61, no. 8, pp. 4161–4170, 2013.
- F. Li, M. Ye, and X. Chen, “An extension to Rough -means clustering based on decision-theoretic Rough Sets model,” International Journal of Approximate Reasoning, vol. 55, pp. 116–129, 2014.
- P. Lingras, “Rough set clustering for web mining,” in Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZ '02), pp. 1039–1044, May 2002.
- P. Lingras, “Applications of rough set based k-means, Kohonen SOM, GA clustering,” in Transactions on Rough Sets VII, pp. 120–139, Springer, 2007.
- S. K. Pal and P. Mitra, “Multispectral image segmentation using the rough-set-initialized EM algorithm,” IEEE Transactions on Geoscience and Remote Sensing, vol. 40, no. 11, pp. 2495–2501, 2002.
- F. Questier, I. Arnaut-Rollier, B. Walczak, and D. L. Massart, “Application of rough set theory to feature selection for unsupervised clustering,” Chemometrics and Intelligent Laboratory Systems, vol. 63, no. 2, pp. 155–167, 2002.
- S. Yin, G. Wang, and H. R. Karimi, “Data-driven design of robust fault detection system for wind turbines,” Mechatronics, 2013.
- S. Yin, X. Yang, and H. R. Karimi, “Data-driven adaptive observer for fault diagnosis,” Mathematical Problems in Engineering, vol. 2012, Article ID 832836, 21 pages, 2012.
- S. Yin, S. X. Ding, A. H. A. Sari, and H. Hao, “Data-driven monitoring for stochastic systems and its application on batch process,” International Journal of Systems Science, vol. 44, no. 7, pp. 1366–1376, 2013.
Copyright © 2014 Tao Qu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.