Abstract

Aiming at the mixed data composed of numerical and categorical attributes, a new unified dissimilarity metric is proposed, and based on that a new clustering algorithm is also proposed. The experiment result shows that this new method of clustering mixed data by fast search and find of density peaks is feasible and effective on the UCI datasets.

1. Introduction

Cluster analysis has been one of the research hotspots in data mining and machine learning. In the big data era, various kinds of data are emerging one after another. Most of them are data with multiple attribute types, such as numerical and categorical attributes. Clustering algorithms such as -means are mainly used for numerical attribute data. In order to handle the data with mixed attributes, researchers have proposed a lot of solutions, which can be divided into Attribute Conversion methods, Clustering Ensemble methods, prototype-based methods, density-based methods, hierarchical methods, and so forth [1].

The Attribute Conversion method firstly converts other attributes into one attribute and then cluster them, such as SpectralCAT algorithm proposed by David and Averbuch [2], which transforms the numerical attribute into the categorical attribute first and then uses the spectral clustering method to deal with the transformed data.

The idea of Clustering Ensemble methods is to divide a group of objects by various algorithms, and then the results of these algorithms are combined by consensus function to get the final clustering results. It was first proposed by Strehl and Ghosh in 2002 [3] and then becomes one of the main methods for the mixed data clustering. Zhao et al. proposed a mixed data clustering algorithm named CEMC based on Clustering Ensemble method [4]; He et al. proposed a clustering algorithm named CEBMDC [5] based on Clustering Ensemble and Squeezer [6]. The algorithm uses the Squeezer algorithm for clustering the categorical parts and as consensus function.

The -Prototypes algorithm [7] was proposed by Huang in 1997, and it is mainly based on the idea of -means algorithm. It combines the clustering centers of the numerical attributes and the mode of the categorical attributes to construct a new mixed center. The data center is a prototype, and a distance metric formula and cost function are constructed on the basis of the prototype. Then the -means clustering process is used to cluster the mixed data directly. This prototype-based algorithm is simple and efficient. The most important factor for this method is the definition of prototype and the distance metric between the prototype and the data tuples. Cheung and Jia proposed a unified similarity metric method [8], which normalizes the distance metric of the numerical attribute part and makes the value of the similarity measure bound in . Then the similarity measure of each categorical attribute is weighted and normalized, respectively, and finally a unified distance measure formula is obtained. Based on this formula, an iterative algorithm OCIL is proposed to cluster the mixed attribute data. At the same time, the OCIL is further improved by introducing the competition and penalty mechanism and proposed a mixed data clustering algorithm PCL-OC which can determine the cluster number automatically. They compared the OCIL algorithm with the -Prototypes algorithm; the clustering accuracy of OCIL is greatly improved, but the computation complexity is higher. The prototype-based method mentioned above still needs to determine the number of clusters; it is sensitive to the selection of the original cluster center and the outlier points, and also it cannot find different shape of the cluster.

Li and Biswas proposed SBAC (Similarity Based Agglomerative Clustering) algorithm [9], which is a good agglomerative hierarchical clustering algorithm based on Goodall similarity measure. This method works well but its computational complexity is higher than . Hierarchical clustering algorithms have high time and space complexity and are not reversible.

The RDBC_M algorithm proposed by Huang and Li [10] used dimension-oriented distance formula to calculate the distance of each dimension. It applies Euclidean distance to the numerical attribute and uses expert scoring method for the similarity calculation between the different values of the categorical attributes. The definition of a distance matrix to measure the distance of each categorical dimension requires manual scoring. The MDCDen algorithm [11] and the DC-MDACC algorithm [12] proposed by Chen and He firstly classified the dataset into three categories: numerical dominance, categorical dominance, and balanced. Then, different distance metric functions are defined for each class. They require a priori analysis of the dataset. The similarity measure of the categorical attributes in RDBC_M algorithm needs to be evaluated by the domain expert; the MDCDen algorithm needs to adjust three parameters to obtain the better result.

In 2014, Rodriguez and Laio published a clustering method by fast search and find of density peaks (abbreviated as “DPC algorithm”) in Science [13]. The algorithm has the advantages of good clustering effect, high efficiency, and few parameters. It can not only find the number of clusters and identify outliers automatically, but also cluster data with different shapes. The input of DPC algorithm is the distance matrix between data points. As long as the problem of distance measurement between data points of mixed data is solved, the algorithm can be applied to cluster analysis directly. However up to now there are still no reports on the clustering of mixed data using DPC algorithm.

In this paper, a new unified distance metric for mixed data points is firstly proposed and used to construct the distance matrix between data points of mixed data. And then based on that, a new method to clustering mixed data by fast search and find of density peaks abbreviated as “DPC_M algorithm” is put forward. Finally, DPC_M algorithm is used to cluster the common UCI mixed datasets. The result shows that the DPC_M algorithm not only has better clustering performance than the traditional -Prototypes algorithm, but also can automatically find the number of clusters. Moreover, it is not sensitive to the outliers.

This paper is organized as follows: Section 2 introduces principle of the DPC algorithm. Section 3 presents a unified formula for the distance metric of the mixed data points and describes the details of the DPC_M algorithm. Section 4 describes the experimental analysis of the DPC_M algorithm. The final section summarizes our work.

2. Principle of the DPC Algorithm

The DPC algorithm is based on two fundamental assumptions: the cluster center point has a high local density and is surrounded by a point with a lower local density, and the cluster center point is relatively far away from its neighbor points with higher density. Therefore, the DPC algorithm constructs a Decision Graph by computing a local density and a relative distance to discover the cluster center in a dataset. The remaining data points in the dataset are allocated at once to the cluster to which the nearest cluster center belongs.

Suppose is a dataset for clustering and is the distance between data points and ; the DPC algorithm defines a cutoff distance and a local density as formula (1) and distance as formula (2) for data point , where if and ; otherwise

Here, the distance is defined as the distance corresponding to the data point when the local density is not the maximum density but has the minimum value of the distance from the point to the point where all the densities are larger than it, or else it takes the maximum distance to all other points.

When the number of data points in the dataset is small, the effect of calculating local density by formula (1) is not ideal. Therefore, in [13], a Gaussian kernel function is given for the dataset with fewer data points as follows:

Based on the local density and distance for each data point, users can explicitly choose the number and the center of the cluster on the Decision Graph. Once the center point is determined, each remaining data point can be classified into the same cluster as its nearest neighbor of higher density.

3. Clustering Mixed Data Based on Density Peaks (DPCM)

3.1. Definition of Unified Distance Metric

Let be a mixed dataset with dimensions and instances, where the numerical attributes have dimensions and the categorical attributes have dimensions. For two data points and in the dataset, their distance is defined as shown in the following formula:

Formula (5) illustrates the distances computation of the numerical attribute and the categorical attribute , respectively:where denotes the normalized Euclidean distances of the numerical attribute of the data points, . Since the Euclidean distance is nonnegative, it is ensured that the distance value of the numerical attribute is in the interval . As for the distance of the categorical attribute, the matching method with the entropy weight is used. The matching distance of the data points , in the th categorical attribute is calculated by the following formula:

The importance of a categorical attribute is quantified by its average entropy over each attribute value. The weight of each attribute is then computed by the following formula:

Assuming that the total number of categorical values on the th categorical attribute is , where the probability of occurrence of the th values is , The entropy weight can be calculated using the following formula:

3.2. The DPC_M Algorithm

The DPC_M algorithm first calculates the distance of the data points in the dataset by using the unified distance metric (4) and then calculates the local density of each data point by formula (3) and the distance by formula (2). In order to realize the automatic determination of clustering centers, we define and arrange them in a descending order. Then we can get the clustering center by computing the inflection point of . According to the definition of distance calculation formula (2), we can know that the point with the largest local density is the cluster center point. After the cluster centers have been found, each remaining point is assigned to the same cluster as its nearest neighbor of higher density. The cluster assignment is performed in a single step other than iterative steps.

The input of the algorithm is the mixed dataset and the neighbor occupation ratio ; the output is the clustered label vector. The specific process is as follows.

Step 1. Formula (4) is used to calculate the distance of each two points in the dataset.

Step 2. The local density of each data point is calculated by formula (3) and , and then the distance is calculated by formula (2), and at last is calculated.

Step 3. Sort in a descending order, calculate the inflection point which is used to determine the cluster centers, and set the class labels of centers.

Step 4. The remaining points are assigned to the same class label as their nearest neighbor with higher density one by one, and then the clustering results are achieved.

3.3. Complexity Analysis

For a dataset with data points, the space complexity of the algorithm is mainly for the storage of distance matrix which requires storage space. The distance matrix has three columns in which column 1 and column 2 are the data point numbers and column 3 is the distance between the two data points. In addition, the algorithm requires three arrays of length n to store the local density , the distance , and its product , so the space complexity is .

The time complexity of DPC_M algorithm is mainly derived from distance calculation in Step 1 and local density computation in Step 2; the time complexity of distance computation and its product calculation is ; the sort time complexity in Step 3 depends on the sorting algorithm, the minimum , and the largest , so the total complexity is no more than ; the time complexity of data point allocation in Step 4 is . Therefore, the overall complexity of the algorithm is and is the same as DPC algorithm.

4. Experimental Analysis

In order to verify the effectiveness of the DPC_M algorithm in this paper, a common UCI mixed dataset is used for experimental study, and this algorithm is compared with the -Prototypes algorithm. We implement both the two algorithms in Matlab 2015a and the experiments were done on a Win 10 computer with Intel Core i5-5200u CPU, 4G DDR3 memory.

4.1. Experiment Datasets

In this study, five datasets of mixed datasets from UCI machine learning repository were investigated, which are Statlog Heart, Cleveland Heart Disease (Cleveland), Statlog Credit Approval (Credit), Acute Inflammations (Acute), and Adult. The brief information about the mixed datasets is shown in Table 1.

The missing data are eliminated before the experiment. In addition, the numerical properties are normalized using the maximum-minimum normalization methods as follows:where denotes a numerical attribute value range and denotes a numerical attribute value corresponding to the th data point in the dataset.

4.2. Evaluation Index

Since the UCI datasets used have real class labels, the clustering accuracy rate can be used as the validity index. The clustering accuracy rate ACC is used to calculate the matching degree of the algorithm class label relative to the real class label, which is defined as follows:where ai denotes the number of samples correctly classified, denotes the number of clusters, and denotes the number of instances in the dataset. The higher the clustering accuracy, the better the clustering effect.

4.3. Effectiveness Experiment

The -Prototypes algorithm and the DPC_M algorithm are used to cluster the dataset described in Section 4.1, respectively, and ACC is calculated by formula (8). According to the literature [7], the important parameter of -Prototypes algorithm takes ( represents the mean standard deviation of numerical attributes). The -Prototypes algorithm runs 100 times and the average clustering accuracy of 100 times is regarded as the comparison index. The parameters of the DPC_M algorithm are given by % according to the literature [13].

The Decision Graph and inflexion diagram (Ordered Gamma) of the clustering process for DPC_M algorithm are shown in Figures 15.

In the Decision Graph and Ordered Gamma diagram in Figures 1, 2, 3, and 5, the two colorful points are the cluster centers. There are 4 cluster centers in Figure 4.

The clustering results are shown in Table 2 which indicates that the accuracy of the DPC_M algorithm is higher than that of the -Prototypes algorithm in all four datasets. It can be seen that the DPC_M algorithm proposed in this paper can cluster the real mixed attribute datasets and obtain better results, and it can also automatically determine the number of clusters.

4.4. Influence of Parameter

In this paper, the DPC_M algorithm has a unique parameter neighborhood ratio . In order to study the influence of parameter on the clustering effect of the algorithm, we use the credit dataset to carry on the simulation experiment, and let be 0.5%, 1%, 1.5%, 2%, 3%, 6%, 8%, 10%, 15%, and 20% respectively, and then the DPC_M algorithm is used to cluster the credit dataset. The clustering accuracy ACC is calculated under different conditions.

The clustering results are shown in Figure 6. The -axis is the value of , and the -axis is the cluster accuracy ACC. It can be seen from the figure that when is less than 6%, the fluctuation of ACC is very small, which means the algorithm has good stability and is less affected by the parameter value. When the value of is greater than 10%, the clustering accuracy is reduced obviously, which proves the conclusion in [13] that the suitable is about 1-2%.

5. Conclusion

The key issue for the DPC algorithm proposed in literature [13] is how to define the distance measurement between data points in the mixed dataset. Therefore, the DPC_M algorithm designed for the clustering of the mixed data proposed in this paper is constructed by using a new unified dissimilarity metric between the mixed data points. The clustering experiments on the UCI datasets show that the proposed DPC_M algorithm has better clustering performance and higher clustering stability than the traditional -Prototypes algorithm, and it is not sensitive to the initial selection of original prototypes.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This paper is supported by the Special Fund for Public Welfare Industry Research of the Ministry of Water Resources, China (201401044), the Subproject under National Science and Technology Support Program (2012BAD10B0101), and the General Research Project of the Foundation of Zhejiang Province Educational Committee (no. Y201636767).