Mathematical Problems in Engineering

Volume 2017, Article ID 6393652, 15 pages

https://doi.org/10.1155/2017/6393652

## Fast Density Clustering Algorithm for Numerical Data and Categorical Data

^{1}Zhejiang University of Technology, Zhejiang 310023, China^{2}Electrical Engineering Department, Ningbo Wanli University, Ningbo 310023, China

Correspondence should be addressed to Chen Jinyin; nc.ude.tujz@niynijnehc

Received 20 August 2016; Revised 2 January 2017; Accepted 15 January 2017; Published 26 March 2017

Academic Editor: Erik Cuevas

Copyright © 2017 Chen Jinyin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Data objects with mixed numerical and categorical attributes are often dealt with in the real world. Most existing algorithms have limitations such as low clustering quality, cluster center determination difficulty, and initial parameter sensibility. A fast density clustering algorithm (FDCA) is put forward based on one-time scan with cluster centers automatically determined by center set algorithm (CSA). A novel data similarity metric is designed for clustering data including numerical attributes and categorical attributes. CSA is designed to choose cluster centers from data object automatically which overcome the cluster centers setting difficulty in most clustering algorithms. The performance of the proposed method is verified through a series of experiments on ten mixed data sets in comparison with several other clustering algorithms in terms of the clustering purity, the efficiency, and the time complexity.

#### 1. Introduction

As one of the most important techniques in data mining, clustering is to partition a set of unlabeled objects into clusters, where the objects which fall into the same cluster have more similarities than others [1]. Clustering algorithms have been developed and applied to various fields including text analysis, customer segmentation, and image recognition. They are also useful in our daily life, since massive data with mixed attributes are now emerging. Typically, these data contain both numeric and categorical attributes [2, 3]. For example, the analysis of an applicant for a credit card would involve data of age (integers), income (float), marital status (categorical), and so forth, forming a typical example of data with mixed attributes.

Up to now, most research on data clustering has been focusing on either numeric or categorical data instead of both types of attributes. -means [4], BIRCH [5], DBSCAN [6], -modes [7], fuzzy -modes [8], BFCM [9], COOLCAT [10], TCGA [11], AS′ fuzzy -modes [12], and -means based method [13] are classic clustering algorithms. -means clustering algorithm [4] is put forward based on partition, where cluster centers need to be initialized by users or experience. Initialized cluster centers number could decide the clustering purity and efficiency. BIRCH [5] is short for balanced iterative reducing and clustering using hierarchies. Clustering feature and clustering feature trees are adopted to describe cluster specifically. Two stages are defined to implement BIRCH, including database scanning to build a clustering feature tree and global clustering to improve purity and efficiency. DBSCAN [6] (Density-Based Spatial Clustering of Applications with Noise) is a classic density-based clustering algorithm, which is capable of dealing with data with noise. Compared with -means, DBSCAN does not need to set cluster numbers priorly. However, two sensitive parameters are essential for DBSCAN, which are eps and minPts. Until now, various revised DBSCANs are brought up to improve the performance of DBSCAN algorithm. However, parameter sensitivity is still a challenge for DBSCAN for its further applications. -modes [7] is an upgraded version of -means by introducing categorical attributes clustering capability. Fuzzy -modes [8] is a modified -modes clustering algorithm with fuzzy mechanism to improve its robustness for various types of data sets. BFCM [9] is short for bias-correction fuzzy clustering algorithm which is an extension of hard clustering and it is based on fuzzy membership partitions. COOLCAT [10] is an entropy-based algorithm for categorical clustering which brought up a novel idea of clustering on basis of entropy. Data clusters are generated by their entropy values. TCGA [11] is a two-stage genetic algorithm for automatic clustering. Bioinspired clustering algorithm summarizes clustering process as an optimization problem and genetic algorithm is adopted for convergence to the global optima. These above-mentioned methods face difficulties when dealing with data with mixed attributes, while the latter is emerging very quickly [14–23]. Fast density clustering algorithm is put forward to solve clustering center determination problem [24]. However, its mixed similarity calculation method is based on relationship of all attributes which has high computation complexity. And its cluster center determination method is mainly dependent on parameter which is difficult to set priorly.

For example, distance measure functions for numerical values cannot capture the similarity among data with mixed attributes. Moreover, the representation of a cluster with numerical values is often defined as the mean value of all data objects in the cluster, which, however, is illogical for other attributes. Algorithms have been proposed [14, 15, 17, 21, 22] to cluster hybrid data, most of which are based on partition. First, a set of disjoint clusters are obtained and refined to minimize a predefined criterion function. The objective is maximizing the intracluster connectivity or compactness while minimizing intercluster connectivity [25]. However, most partition clustering algorithms are sensitive to the initial cluster centers which are yet difficult to determine. They are also suitable for spherical distribution data without outliers handling capacity.

The main contributions of our work include four aspects. A novel mixed data similarity metric is come up for mixed data clustering. Clustering center self-set algorithm (CSA) is applied to determine center automatically. Bisection method is adopted to calculate parameter for clustering to overcome parameter sensibility problem. Fast one-time scan density clustering algorithm (FDCA) is brought up to implement fast and efficient clustering for mixed data.

The rest of this paper is organized as follows. Section 2 introduces related works of mixed data clustering. In Section 3, the similarity metric for data with mixed attributes and how FDCA works are presented. In Section 4, the abundant simulations are carried out to testify FDCA’s performance compared with other classic algorithms. Section 5 is a practical application for handwriting number image recognition based on FDCA. And finally Section 6 concludes the paper.

#### 2. Related Works

##### 2.1. Mixed Data Clustering Algorithms Overview

As stated above, mixed data clustering algorithm is designed for data set of mixed attributes including numerical and categorical attributes. Numerical attributes of mixed data are evaluated by real values, while categorical attributes of mixed data represent the fact that those attributes are ordinal. It is still a challenge to cluster data with both numerical and categorical attributes. Lots of novel clustering algorithms are put forward to deal with mixed data. Huang proposed a -prototypes [14] algorithm which combines -means and -mode algorithms. -prototypes algorithm is an updated version of -means and -mode algorithm, especially designed for dealing with mixed data. It is a very early stage mixed data clustering algorithm. When the data set is uncertain, most clustering algorithm could not achieve purity and efficiency as expected. KL-FCM-GM [15] algorithm is an extended algorithm of -prototypes proposed by Chatzis. It is a fuzzy -means-type algorithm for clustering data with mixed numeric and categorical attributes by employing a probabilistic dissimilarity functional. It is designed for the Guss-multinormal distributed data. When the data set is large, the data similarity metric processing costs much more time than expected. So it is not quite suitable for big data objects. Zheng et al. developed a new algorithm called EKP [17], which is an improved -prototypes algorithm to overcome its flaws. EKP algorithm has global search capability by introducing an evolutionary algorithm. Later, Li and Biswas proposed the Similarity-Based Agglomerative Clustering (SBAC) algorithm [18], which adopts the similarity measure defined by Goodall [19] to evaluate the similarity. It is an unsupervised analysis method for identifying critical samples in large populations, so the efficiency of the similarity metric is not stable. Hsu and Chen proposed a clustering algorithm based on the variance and entropy (CAVE) [20] for clustering mixed data. However, the CAVE algorithm needs to build the distance hierarchy for every categorical attribute and the determination of distance hierarchy requires the domain expertise.

Besides the above-mentioned unsupervised similarity metric for clustering, there are further researches on mixed data similarity calculation methods proposed. Ahmad and Dey proposed a -means type algorithm [21] to deal with mixed data. Cooccurrence of categorical attribute values is used to evaluate the significance of each attribute. For mixed data attributes, Ji et al. proposed IWKM algorithm [22], in which distribution centroid is applied to represent the prototypes clusters. And the significance of different attributes is taken into account towards the clustering process. Besides, Ji et al. proposed WFK-prototypes [23] by introducing fuzzy centroid to represent the cluster prototypes. The significance concepts proposed by Ahmad and Dey [21] are adopted to extend -prototypes algorithm in WFK-prototypes algorithm. WFK-prototypes algorithm is a classic mixed data clustering algorithm until now. David and Averbuch proposed a categorical spectral clustering algorithm for numerical and nominal data, called SpectralCAT [26]. Cheung and Jia [27] proposed a mixed data clustering algorithm based on a unified similarity metric without knowing clusters number. The embedded competition and penalization mechanisms are used to determine the number of clusters automatically by gradually eliminating the redundant clusters.

In a word, there are a lot of mixed data similarity metrics and clustering algorithms designed for different applications. We still want to develop a universal numerical and categorical data similarity metric and clustering algorithm that could be applied to most cases and practical data sets.

##### 2.2. Fast Data Clustering Algorithm

Rodriguez and Laio had got their novel paper “Clustering by Fast Search and Fine of Density Peaks” published on Science in June 2014 [28]. In their algorithm, clustering centers could be observed from density-distance relationship graph. Inspired by their method, we conclude their method as follows: the cluster centers are surrounded by neighbors with lower density and they are at a relatively large distance from any points with a higher density. Noise points have comparatively larger distance and smaller density.

The density of data point is defined as follows:where denotes data ’s density, represents distance between data and data , and is the threshold distance of each cluster defined priorly. According to (2), if the distance between data and data is less than , then density of data is . In other words, is equal to the number of points that are closer than to point .

is measured by computing the minimum distance between the point and any other point with higher density:

For the point with highest density, we conventionally take . Note that is much larger than the typical nearest neighbor distance only for points that are local or global maxima in the density. Thus, cluster centers are recognized as points for which the value of is anomalously large.

This observation, which is the core of the algorithm, is illustrated by the simple example in Figure 1(a). Then the density and distance of every point are computed. and distribution is shown in Figure 1(b).