Computational Intelligence and Neuroscience

Volume 2017 (2017), Article ID 3691316, 7 pages

https://doi.org/10.1155/2017/3691316

## A Global-Relationship Dissimilarity Measure for the* k*-Modes Clustering Algorithm

School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710048, China

Correspondence should be addressed to Hongfang Zhou

Received 19 January 2017; Revised 4 March 2017; Accepted 19 March 2017; Published 28 March 2017

Academic Editor: Elio Masciari

Copyright © 2017 Hongfang Zhou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The* k*-modes clustering algorithm has been widely used to cluster categorical data. In this paper, we firstly analyzed the* k*-modes algorithm and its dissimilarity measure. Based on this, we then proposed a novel dissimilarity measure, which is named as GRD. GRD considers not only the relationships between the object and all cluster modes but also the differences of different attributes. Finally the experiments were made on four real data sets from UCI. And the corresponding results show that GRD achieves better performance than two existing dissimilarity measures used in* k*-modes and Cao’s algorithms.

#### 1. Introduction

Clustering is an important technique in data mining, and its main task is to group the given data based on some similarity/dissimilarity measures [1]. Most clustering techniques use distances largely to measure the dissimilarity between different objects [2–4]. However, these methods work only on the data sets with numeric attributes, which limits their uses in solving categorical data clustering problems [5].

Some researchers have made great efforts to quantize relationships among different categorical attributes. Guha et al. [6] proposed a hierarchical clustering method termed ROCK, which can measure the similarity between a pair of objects [7]. In ROCK, the number of* Link* is computed as the number of common neighbors between two objects [8]. However, the following two deficiencies still exist: (1) two involved parameters must be assigned in advance and (2) the mass calculation is involved [9]. For these reasons, some researchers have generated some new algorithms like QROCK [10], DNNS [11], and GE-ROCK [12] to modify or improve the ROCK algorithm. To remove the numeric-only limitation of* k*-means algorithm, Huang et al. [13, 14] proposed the* k*-modes algorithm, which extends the* k*-means algorithm by using (1) a simple matching dissimilarity measure for categorical attributes; (2) modes in place of means for clustering; and (3) a frequency-related strategy to update modes to minimize the clustering costs [15]. In fact, the idea of simple matching has been used in many clustering algorithms, such as fuzzy* k*-modes algorithm [16], fuzzy* k*-modes algorithm with fuzzy centroid [17], and* k*-prototype algorithm [14]. However, simple matching often results in some low intradissimilarity clusters [18] and disregards of the dissimilarity hidden between the categorical values [19].

In this paper, a Global-Relationship Dissimilarity (GRD) measure for the* k*-modes clustering algorithm is proposed. This dissimilarity measure considers not only the relationships between the object and all cluster modes but also the differences of various attributes instead of simple matching. The clustering effectiveness of* k*-modes based on GRD (KBGRD) is demonstrated on four standard data sets from the UCI Machine Learning Repository [20].

The remainder of this paper is organized as follows: a detailed review of the dissimilarity measure used in* k*-modes is presented and analyzed in Section 2. In Section 3, the new dissimilarity measure GRD is proposed. Section 4 describes the details of KBGRD algorithm. Section 5 illustrates the performance and stability of KBGRD. Finally, a concluding remark is given in Section 6.

#### 2. Related Works

##### 2.1. Categorical Data

As is known to all, the structural data can be stored in a table, where each row represents a fact about an object. And the practical data usually contains categorical attributes [21]. We firstly define the term “data set” [22].

*Definition 1 (data set). *A data set information system can be expressed as a quadruple , which is satisfied with(1) is a nonempty set of data objects, which is named as a universe;(2) is a nonempty set of categorical attributes;(3) is the union of all attribute domains, that is, , where is the value domain of attribute , and it is finite and unordered; is the number of categories of attribute for ;(4) is a mapping function which can be formally expressed as .

##### 2.2. *k*-Modes Dissimilarity Measure

The* k*-modes clustering algorithm is an improvement of the* k*-means algorithm [4] by using a simple dissimilarity measure for categorical data. And it adopts a frequency-related strategy to update modes in the clustering to minimize the clustering costs. These extensions have excluded the numeric-only limitation existed in* k*-means algorithm and enable the clustering process to be used on large-size categorical data sets from real world database [22].

*Definition 2. *Let be a categorical data set information system which is defined in Definition 1 and . For any object and cluster mode for , is the simple matching dissimilarity measure between object and the mode of the th cluster which is defined as follows:In (1), can be expressed as .

There are nine objects with four attributes and three initial cluster modes as shown in Table 1. For determining the appropriate cluster of , it is required to compute the dissimilarity of and the three cluster modes. According to (1), . Therefore, it is impossible to determine exactly to which cluster the object should be assigned.