Research Article  Open Access
A GlobalRelationship Dissimilarity Measure for the kModes Clustering Algorithm
Abstract
The kmodes clustering algorithm has been widely used to cluster categorical data. In this paper, we firstly analyzed the kmodes algorithm and its dissimilarity measure. Based on this, we then proposed a novel dissimilarity measure, which is named as GRD. GRD considers not only the relationships between the object and all cluster modes but also the differences of different attributes. Finally the experiments were made on four real data sets from UCI. And the corresponding results show that GRD achieves better performance than two existing dissimilarity measures used in kmodes and Cao’s algorithms.
1. Introduction
Clustering is an important technique in data mining, and its main task is to group the given data based on some similarity/dissimilarity measures [1]. Most clustering techniques use distances largely to measure the dissimilarity between different objects [2–4]. However, these methods work only on the data sets with numeric attributes, which limits their uses in solving categorical data clustering problems [5].
Some researchers have made great efforts to quantize relationships among different categorical attributes. Guha et al. [6] proposed a hierarchical clustering method termed ROCK, which can measure the similarity between a pair of objects [7]. In ROCK, the number of Link is computed as the number of common neighbors between two objects [8]. However, the following two deficiencies still exist: (1) two involved parameters must be assigned in advance and (2) the mass calculation is involved [9]. For these reasons, some researchers have generated some new algorithms like QROCK [10], DNNS [11], and GEROCK [12] to modify or improve the ROCK algorithm. To remove the numericonly limitation of kmeans algorithm, Huang et al. [13, 14] proposed the kmodes algorithm, which extends the kmeans algorithm by using (1) a simple matching dissimilarity measure for categorical attributes; (2) modes in place of means for clustering; and (3) a frequencyrelated strategy to update modes to minimize the clustering costs [15]. In fact, the idea of simple matching has been used in many clustering algorithms, such as fuzzy kmodes algorithm [16], fuzzy kmodes algorithm with fuzzy centroid [17], and kprototype algorithm [14]. However, simple matching often results in some low intradissimilarity clusters [18] and disregards of the dissimilarity hidden between the categorical values [19].
In this paper, a GlobalRelationship Dissimilarity (GRD) measure for the kmodes clustering algorithm is proposed. This dissimilarity measure considers not only the relationships between the object and all cluster modes but also the differences of various attributes instead of simple matching. The clustering effectiveness of kmodes based on GRD (KBGRD) is demonstrated on four standard data sets from the UCI Machine Learning Repository [20].
The remainder of this paper is organized as follows: a detailed review of the dissimilarity measure used in kmodes is presented and analyzed in Section 2. In Section 3, the new dissimilarity measure GRD is proposed. Section 4 describes the details of KBGRD algorithm. Section 5 illustrates the performance and stability of KBGRD. Finally, a concluding remark is given in Section 6.
2. Related Works
2.1. Categorical Data
As is known to all, the structural data can be stored in a table, where each row represents a fact about an object. And the practical data usually contains categorical attributes [21]. We firstly define the term “data set” [22].
Definition 1 (data set). A data set information system can be expressed as a quadruple , which is satisfied with(1) is a nonempty set of data objects, which is named as a universe;(2) is a nonempty set of categorical attributes;(3) is the union of all attribute domains, that is, , where is the value domain of attribute , and it is finite and unordered; is the number of categories of attribute for ;(4) is a mapping function which can be formally expressed as .
2.2. kModes Dissimilarity Measure
The kmodes clustering algorithm is an improvement of the kmeans algorithm [4] by using a simple dissimilarity measure for categorical data. And it adopts a frequencyrelated strategy to update modes in the clustering to minimize the clustering costs. These extensions have excluded the numericonly limitation existed in kmeans algorithm and enable the clustering process to be used on largesize categorical data sets from real world database [22].
Definition 2. Let be a categorical data set information system which is defined in Definition 1 and . For any object and cluster mode for , is the simple matching dissimilarity measure between object and the mode of the th cluster which is defined as follows:In (1), can be expressed as .
There are nine objects with four attributes and three initial cluster modes as shown in Table 1. For determining the appropriate cluster of , it is required to compute the dissimilarity of and the three cluster modes. According to (1), . Therefore, it is impossible to determine exactly to which cluster the object should be assigned.

The dissimilarity between an object and a cluster mode should consider the relationships between the object and all cluster modes as well as the differences of various attributes. When the kmodes dissimilarity measure is computing dissimilarity of a certain attribute, it only simply matches this object with this mode and ignores the differences of various attributes. Such as attribute “A4” in Table 1, almost all of objects and cluster modes is “E”; “A4” should contribute more to dissimilarity than other attributes. However, the kmodes dissimilarity treats all attributes equally.
3. GlobalRelationship Dissimilarity Measure
Definition 3. Let be a categorical data set information system which is defined in Definition 1 and . For any object and cluster mode for , is the new dissimilarity measure between object and the mode of the th cluster which is defined as In (2), is the dimension number of data set and the similarity function is defined as follows:subject towhere is the number of cluster modes, andhere is satisfied with
As shown in Table 1, it is required to compute the dissimilarity of with three cluster modes for determining which cluster should be assigned to. According to (2)–(6), the following three ones can be got:(1).(2).(3).Hence, can be assigned to cluster “2” definitely.
4. KBGRD Algorithm
In this section, we give the concrete procedure of the kmodes based on GRD (KBGRD) algorithm. In addition, the computational complexity of KBGRD is analyzed.
4.1. KBGRD Algorithm Description
Definition 4. Let be a categorical data set information system which is defined in Definition 1 and . The kmodes algorithm uses the kmeans paradigm to cluster categorical data. The objective function of the kmodes algorithm is defined as follows:In (7), . Here is a known cluster number; is a kby matrix; is a binary variable and indicates whether object belongs to the th cluster; if belongs to the th cluster and 0 otherwise; ; and is the th cluster mode with categorical attributes .
4.2. Update and Convergence Analysis
The steps of the KBGRD algorithm are presented below. Here and denote cluster modes and membership matrix at th iteration, respectively.(1)Randomly select distinct objects from as initial mode . Determine such that is minimized according to (8). Set .(2)Determine such that is minimized according to (9). If , then stop; otherwise, go to step (3).(3)Determine such that is minimized according to (8). If , then stop; otherwise, set and go to step (2).
In each iteration, and are updated by the following formulae.
When is given, is updated by (8) for and .
And when is given, is updated as follows:where , . Here, ; is the number of categorical of attribute for .
Now we consider the convergence of the KBGRD algorithm.
Theorem 5. is minimized when and is updated by (8).
Proof. For a given , we have . The updating method of is computing the minimized dissimilarity between objects and modes according to (8), and the dissimilarities of objects and modes are independent. So is updated by (8) such that is minimized.
Theorem 6. is minimized when and is updated by (9).
Proof. For a given , we havewhere . Note that all inner sums are nonnegative and independent. Then minimizing is equivalent to maximizing each inner sum. When , according to (9), is maximized. So is updated by (9) such that is minimized.
Theorem 7. The KBGRD algorithm converges in a finite number of iterations.
Proof. Firstly, we note that there are only a finite number () of potential cluster mode. There are possible kinds for cluster modes; it is a finite number too.
Secondly, each possible mode appears at most once in the iteration process of KBGRD algorithm. If not, there exist () such that . According to Theorem 6, a given can obtain a certain , that is, , . When , we have , that is, in the iteration of algorithm, occurring at . However, if or , algorithm is stopped according to steps (2) and (3) of the KBGRD algorithm, that is, never occurs.
So the KBGRD algorithm converges in a finite number of iterations.
4.3. Pseudocodes and Complexity Analysis
The pseudocodes of KBGRD algorithm are presented in Pseudocode 1.

The major function of subfunction Cluster() is computing the dissimilarity between object and cluster mode and classifying the objects into the clusters whose dissimilarity is the minimum. The function of subfunction Fun() is computing the value of objective function.
In fact, main function is a controller, which controls the iterations of algorithm. We first choose distinct objects as initial modes. Line 2 is the initialization of cluster; Line 3 computes original cluster result and “new Dissimilarity.” Lines 4–9 are to iteratively update modes and clusters. And when “new Dissimilarity” is invariant, the iteration stops.
Referring to the pseudocodes as shown in Pseudocode 1, the computational complexity of KBGRD algorithm is analyzed as follows. We only consider the major computational steps.
We firstly consider the computational complexity of two subfunctions. The computational complexity for computing the dissimilarity is , where is the number of modes, n is the number of objects in data set, and is the dimension of data set. The computational complexity for assigning the th object into the lth cluster is . So the computational complexity for updating all clusters is , that is, . The computational complexity of computing objective function is .
Suppose that the iteration time is and the whole computational cost of KBGRD algorithm is , that is, . This shows that the computational cost is linearly scalable with the number of objects, the number of attributes, and the number of clusters.
5. Experimental Analysis
5.1. Experimental Environment and Evaluation Indexes
The experiments are conducted on a PC with an Intel i3 processor and 4 G byte memory running the Windows 7 operating system. All algorithms are coded by JAVA on Eclipse.
To evaluate the efficiency of clustering algorithm, the evaluation indexes Accuracy (AC) and RandIndex are employed in the experiments.
Let be the set of three classes in the data set and be the set of three clusters generated by the clustering algorithm. Given a pair of objects in the data set, we refer to it as(1) if both objects belong to the same cluster in and the same cluster in ;(2) if the two objects belong to the same cluster in and two different clusters in ;(3)c if the two objects belong to two different clusters in and to the same cluster in ;(4)d if both objects belong to two different clusters in and two different clusters in .Let , , , and be the number of a, b, c, and d, RandIndex [23] is defined as follows:
Accuracy (AC) is defined as follows:where is the number of clusters, n is the number of objects, and is the number of objects that are correctly assigned to the cluster ().
Four categorical data sets from the UCI Machine Learning Repository are used to evaluate the clustering performance, including QSAR Biodegradation (QSAR), Chess, Mushroom, and Nursery. The relative information about the data sets is tabulated in Table 2.

5.2. Experimental Results and Analysis
In the experiments, we compare KBGRD algorithm with the original kmodes and Cao’s algorithm [24]. Three algorithms are sequentially run on all data sets. Each algorithm requires the number of modes (ClusterNum) as an input parameter. We randomly select distinct ClusterNum objects as initial cluster modes. The number of iteration of all algorithms is no more than 500.
Note that there are very few missing values in the Mushroom data set; we use optimal completion strategy to deal with missing values. In the optimal completion strategy, the missing values in data set are viewed as additional variables [25, 26].
Firstly, we set ClusterNum as the classes’ number of the data set. The average RandIndex of ten times’ experiments on four data sets for three algorithms is summarized in Table 3. The average AC of ten times’ experiments on four data sets for three algorithms is summarized in Table 4. As shown in Tables 3 and 4, KBGRD achieves the highest RandIndex and AC. That is, it performs better than other algorithms under the same conditions.


In real world applications, the number of initial cluster modes is unknown. We evaluated clustering stability by setting different ClusterNum (10, 15, 20, 25, 30, and 35) for each data set and used RandIndex to evaluate clustering results. The average RandIndex of ten times’ experiments on four data sets for three algorithms is summarized in Tables 5–8. And the last column shows the average clustering RandIndex of each algorithm on six ClusterNum. As shown in Tables 5–8, KBGRD achieves the highest RandIndex. That is to say, it performs better than other algorithms on four data sets. Additionally, KBGRD has the highest stability compared with other algorithms.




6. Conclusion
This paper analyzes the advantages and disadvantages of kmodes algorithms for categorical data. Based on this, we propose a novel dissimilarity measure (GRD) for clustering categorical data. This measure is used to improve the performance of the existing kmodes algorithm. The computational complexity of KBGRD algorithm has been analyzed which is linear with the number of data objects, attributes, and clusters. We have tested KBGRD algorithm on four real data sets from UCI. Experimental results have shown that KBGRD algorithm is effective and stable in clustering categorical data sets.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was supported by the National Science Foundation of China under the Grants of 61402363 and 61472319, Education Department of Shaanxi Province Key Laboratory Project under the Grant of 15JS079, Xi’an Science Program Project under the Grant of CXY1509(7), Beilin district of Xi’an Science and Technology Project under the Grant of GX1625, and CERNET Innovation Project under the Grant of NGLL20150707.
References
 A. Saha and S. Das, “Categorical fuzzy kmodes clustering with automated feature weight learning,” Neurocomputing, vol. 166, pp. 422–435, 2015. View at: Publisher Site  Google Scholar
 H. Zhou, J. Guo, and Y. Wang, “A feature selection approach based on term distributions,” SpringerPlus, vol. 5, no. 1, pp. 1–14, 2016. View at: Publisher Site  Google Scholar
 M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “A densitybased algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 226–231, Las Vegas, Nev, USA, August 2008. View at: Google Scholar
 J. Macqueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, Berkeley, Calif, USA, 1967. View at: Google Scholar
 F. Jiang, G. Liu, J. Du, and Y. Sui, “Initialization of Kmodes clustering using outlier detection techniques,” Information Sciences, vol. 332, pp. 167–183, 2016. View at: Publisher Site  Google Scholar
 S. Guha, R. Rastogi, and K. Shim, “Rock: a robust clustering algorithm for categorical attributes,” Information Systems, vol. 25, no. 5, pp. 345–366, 2000. View at: Publisher Site  Google Scholar
 H. Zhou, J. Guo, Y. Wang, and M. Zhao, “A feature selection approach based on interclass and intraclass relative contributions of terms,” Computational Intelligence and Neuroscience, vol. 2016, Article ID 1715780, 8 pages, 2016. View at: Publisher Site  Google Scholar
 I.K. Park and G.S. Choi, “Rough set approach for clustering categorical data using informationtheoretic dependency measure,” Information Systems, vol. 48, pp. 289–295, 2015. View at: Publisher Site  Google Scholar
 H. Zhou, X. Zhao, and X. Wang, “An effective ensemble pruning algorithm based on frequent patterns,” KnowledgeBased Systems, vol. 56, no. 3, pp. 79–85, 2014. View at: Publisher Site  Google Scholar
 M. Dutta, A. K. Mahanta, and A. K. Pujari, “QROCK: a quick version of the ROCK algorithm for clustering of categorical data,” Pattern Recognition Letters, vol. 26, no. 15, pp. 2364–2373, 2005. View at: Publisher Site  Google Scholar
 J. Yang, “A clustering algorithm using dynamic nearest neighbors selection model,” Chinese Journal of Computers, vol. 30, no. 5, pp. 756–762, 2007. View at: Google Scholar
 Q. Zhang, L. Ding, and S. Zhang, “A genetic evolutionary ROCK algorithm,” in Proceedings of the International Conference on Computer Application and System Modeling (ICCASM '10), pp. V12347–V12351, IEEE, Taiyuan, China, October 2010. View at: Publisher Site  Google Scholar
 Z. Huang, “A fast clustering algorithm to cluster very large categorical data sets in data mining,” in Proceedings of the SIGMOD Workshop Research Issues on Data Mining & Knowledge Discovery, pp. 1–8, 1998. View at: Google Scholar
 Z. Huang, “Extensions to the kmeans algorithm for clustering large data sets with categorical values,” Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283–304, 1998. View at: Publisher Site  Google Scholar
 Z. He, X. Xu, and S. Deng, “Attribute value weighting in kmodes clustering,” Expert Systems with Applications, vol. 38, no. 12, pp. 15365–15369, 2011. View at: Publisher Site  Google Scholar
 Z. Huang and M. K. Ng, “A fuzzy kmodes algorithm for clustering categorical data,” IEEE Transactions on Fuzzy Systems, vol. 7, no. 4, pp. 446–452, 1999. View at: Publisher Site  Google Scholar
 D.W. Kim, K. H. Lee, and D. Lee, “Fuzzy clustering of categorical data using fuzzy centroids,” Pattern Recognition Letters, vol. 25, no. 11, pp. 1263–1271, 2004. View at: Publisher Site  Google Scholar
 M. K. Ng, M. J. Li, J. Z. Huang, and Z. He, “On the impact of dissimilarity measure in kmodes clustering algorithm,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 3, pp. 503–507, 2007. View at: Publisher Site  Google Scholar
 C.C. Hsu, C.L. Chen, and Y.W. Su, “Hierarchical clustering of mixed data based on distance hierarchy,” Information Sciences, vol. 177, no. 20, pp. 4474–4492, 2007. View at: Publisher Site  Google Scholar
 UCI Machine Learning Repository, 2016, https://archive.ics.uci.edu/ml/datasets.html.
 K. Chidananda Gowda and E. Diday, “Symbolic clustering using a new dissimilarity measure,” Pattern Recognition, vol. 24, no. 6, pp. 567–578, 1991. View at: Publisher Site  Google Scholar
 L. Bai and J. Liang, “The kmodes type clustering plus betweencluster information for categorical data,” Neurocomputing, vol. 133, pp. 111–121, 2014. View at: Publisher Site  Google Scholar
 J. Z. Huang, M. K. Ng, H. Rong, and Z. Li, “Automated variable weighting in kmeans type clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 657–668, 2005. View at: Publisher Site  Google Scholar
 F. Cao, J. Liang, D. Li, L. Bai, and C. Dang, “A dissimilarity measure for the kmodes clustering algorithm,” KnowledgeBased Systems, vol. 26, no. 9, pp. 120–127, 2012. View at: Publisher Site  Google Scholar
 L. Zhang, W. Lu, X. Liu, W. Pedrycz, and C. Zhong, “Fuzzy cmeans clustering of incomplete data based on probabilistic information granules of missing values,” KnowledgeBased Systems, vol. 99, pp. 51–70, 2016. View at: Google Scholar
 H. Zhou, J. Li, J. Li, F. Zhang, and Y. Cui, “A graph clustering method for community detection in complex networks,” Physica A: Statistical Mechanics and Its Applications, vol. 469, pp. 551–562, 2017. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2017 Hongfang Zhou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.