Bio-Inspired Learning and Adaptation for Optimization and Control of Complex SystemsView this Special Issue
Research Article | Open Access
Rong Zhou, Yong Zhang, Shengzhong Feng, Nurbol Luktarhan, "A Novel Hierarchical Clustering Algorithm Based on Density Peaks for Complex Datasets", Complexity, vol. 2018, Article ID 2032461, 8 pages, 2018. https://doi.org/10.1155/2018/2032461
A Novel Hierarchical Clustering Algorithm Based on Density Peaks for Complex Datasets
Clustering aims to differentiate objects from different groups (clusters) by similarities or distances between pairs of objects. Numerous clustering algorithms have been proposed to investigate what factors constitute a cluster and how to efficiently find them. The clustering by fast search and find of density peak algorithm is proposed to intuitively determine cluster centers and assign points to corresponding partitions for complex datasets. This method incorporates simple structure due to the noniterative logic and less few parameters; however, the guidelines for parameter selection and center determination are not explicit. To tackle these problems, we propose an improved hierarchical clustering method HCDP aiming to represent the complex structure of the dataset. A -nearest neighbor strategy is integrated to compute the local density of each point, avoiding to select the nonnecessary global parameter and enables cluster smoothing and condensing. In addition, a new clustering evaluation approach is also introduced to extract a “flat” and “optimal” partition solution from the structure by adaptively computing the clustering stability. The proposed approach is conducted on some applications with complex datasets, where the results demonstrate that the novel method outperforms its counterparts to a large extent.
Clustering is a process of partitioning data objects into subsets. Each subset is a cluster whose objects are similar to each other while dissimilar to objects in other clusters. For decades, numerous clustering algorithms have been proposed and widely used in many fields, including business intelligence, image pattern recognition, web search, computational biology. The clustering by fast search and find of density peak (CDP) algorithm is proposed by Rodriguez and Laio in Science . It is based on the assumption that the density peaks are candidates for cluster centers, and cluster centers are far away from each other. After determining the cluster centers, the remaining objects will be directly assigned to the nearest clustering center. Compared to most clustering algorithms, the CDP algorithm does not require to design an objective function for iterative optimization, and it can find clusters in spite of their shapes. However, the CDP algorithm has following shortcomings: (i)There is no explicit criterion for the selection of the key parameter , which is the threshold of scanning radius for density calculation, and it greatly affects the clustering results. The authors claimed that when the average density of data objects is of the datasets, one can get good results, without giving an explicit method of determining the optimal to achieve the best clustering effect.(ii)Initial cluster centers are selected interactively but not automatically, whereas it is quite difficult to get a correct selection of some datasets.
To overcome the above issues, this paper proposes a new algorithm where (1)-nearest neighbors method is introduced to estimate local density so that the deficiencies of CDP in computing the local density of an object can be avoided;(2)instead of directly clustering, a hierarchical clustering method is used to generate a complete clustering structure;(3)the task of extracting a set of significant clusters is formulated as an optimization problem and a control algorithm that finds the globally optimal solution to this problem is proposed.
The rest of this paper is organized as follows. Related works are introduced in Section 2. Section 3 describes the proposed method in detail. In Section 4, experimental results are presented and discussed. Conclusions and future work are stated in Section 4.
2. Related Works
As a novel and efficient algorithm, the clustering using density peak algorithm is brought into sharp focus. However, there are still some shortcomings that cannot be ignored. In this section, we will first review CDP briefly and then introduce a hierarchical clustering method to represent the clustering structures of datasets.
2.1. Cluster Using Density Peaks
The clustering using density peak (CDP) algorithm is on the basis of the assumption that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher density. By calculating these two quantities of each data object, CDP builds a decision graph for users to pick up cluster centers and exclude outliers.
where is the Euclidean distance between points and , is a cutoff distance specified by the users, and if and otherwise . For the point with the highest density, it takes .
For small datasets, the algorithm turns to the exponential kernel for density calculation, as described as follows:
The cluster centers are then recognized as points for which the values of both and are anomalously large. After the cluster centers are determined, the algorithm assigns each remaining objects to the same cluster as its nearest neighbor of higher density.
The problem is there is no objective metric to decide whether the dataset is large or small, so users might face a dilemma in selecting methods for computing density indicators. Moreover, clustering results may vary greatly according to the selection of . In , the authors suggested to choosing so that the average number of neighbors is around 1-2% of the total number of points in the dataset. However, it is easy to find out that the suggested choice is not always applicable when the size of dataset changes.
Some subsequent researchers attempt to solve this problem. Mehmood et al.  introduced heat diffusion method  to estimate point density and used the time parameter of heat diffusion to efficiently create clusters. This method is similar to the kernel density estimation, and the bandwidth parameter is determined according to . Chen and He  calculated the field intensity and distance of every data point and fit them by a regression analysis. The cluster centers are determined by a residual analysis. Reference  computed local density of each point using its -nearest neighbors instead of in a kernel density estimation. Reference  also adopted a new local density metric using -nearest neighbors in a kernel density estimation too. Reference  analyzed the local density metrics using -nearest neighbors and tried to discriminate points belonging to different clusters more accurately.
Besides considering a single cutoff distance (or bandwidth), some studies also analyzed the case of multiple densities. In , a cover map procedure was applied iteratively with a decreasing locally adaptive window to build a multidimensional density map, which allows cluster center selection. Wang and Xu  proposed an adaptive peak detection with nonparametric multivariate kernel density estimation. The algorithm treats the dataset as a multivariate normal distribution, and the bandwidth matrix of kernel density estimation is correlated to the dataset’s dimensions . By narrowing down the possible ranges of and the number of clusters , the optimal values are chosen from all possible combinations of and . Mehmood et al.  considered density regions instead of cluster centers. They used CDP to find local clusters and merge them using the concept of shared density regions.
The other flaw of the algorithm is the vague, dull, and unclear choice of cluster centers. Some works [4, 11, 12] tried to identify the number of clusters by finding out singular points from the indicator curves; the others focus on merging microclusters [8, 10, 13] or splitting clusters  until some stopping conditions are satisfied. In this paper, we extract the optimal result from the hierarchical clustering structure, which will be introduced in the following section.
2.2. Hierarchical Clustering
The cluster centers are interactively determined in  rather than using explicit criteria. Some existing works [4, 11, 12] determined the number of clusters first, then they sorted the data points based on both indicators and in descending order and chose the first -largest points as cluster centers. Instead of directly calculating , hierarchical methods try to depict the clustering structure of the dataset. For an easier interpretation of the structure, some automatic techniques are provided to extract a “flat” solution.
Hierarchical clustering (HC) methods represent data objects in a hierarchy or “tree” structure of clusters . They are effective in detecting true clustering structures of datasets. Many works of HC have been published in recent years. They can be categorized into two classes: one uses distance-based methods and the other adopts density-based methods.
For distance-based HC methods, the core idea is to agglomerate or divide clusters according to the distance between two clusters, where each cluster contains a set of data points. The most common measure of cluster distance calculates the closest pair of points belonging to different clusters. It can be regarded as a nearest-neighbor clustering algorithm. Moreover, if there is a threshold to terminate the clustering process, it is called single-link method . The merging process repeats until all the points eventually form one cluster. Similar approaches such as average-link or complete-link are also widely used . If we take data objects as nodes of a graph, with distance-weighting edges, the clustering algorithm that uses the minimum distance measure is called a minimal spanning tree (MST) algorithm [15, 18]. Usually, a single, global threshold cutting through the hierarchical cluster representation can give a “flat” partition of the data, which users are most interested in.
To the contrast, density-based HC methods received less attention. The main idea of these methods is to investigate the reachable distance of all data points and form hierarchical structures using the concept of “density connectivity.” The most famous OPTICS algorithm  is able to represent a density-based clustering structure of the dataset. Although a postprocessing procedure to extract a simplified clustering result was proposed, the procedure did not become as popular as OPTICS itself since it has heavy reliance to the reachability plot and is sensitive to the choice of a critical parameter that cannot be determined easily. Gupta et al.  proposed a density-shaving strategy applying to the hierarchical structure referring to the work  to achieve cluster extraction. Campello et al.  presented HDBSCAN as an improvement over OPTICS. By defining cluster stability, they turned the cluster extraction problem to an optimization problem of maximizing the overall stability of the set of clusters extracted from the HDBSCAN hierarchy.
3. The Proposed Method
In this section, we describe the proposed method in detail. Generally speaking, the proposed method consists of three steps: local density calculation, hierarchy representation, and optimal cluster extraction. Local density calculation is conducted based on the -nearest neighbors. Hierarchical clustering method is then applied to depict the cluster structure. Finally, by introducing a concept of cluster stability, we propose an algorithm to solve the extraction problem from a cluster hierarchy.
3.1. Local Density Estimation Using -Nearest Neighbors
Most of the extended works for CDP attempt to improve density estimation from the perspective of kernel method assuming that the data points are in the same or different Gaussian distribution(s). These works did not solve the problem of choosing , on the contrast, they just turn the problem of selecting a suitable into the problem of determining a suitable bandwidth of the Gaussian distribution kernel. Unlike these works, we incline to assess an object’s local density using the information of its neighbors. -nearest neighbor (NN) has been shown to be a powerful technique for density estimation , clustering [24–26]. The goal of this approach is to find -nearest neighbors (NN) of each data object in the dataset. By using this approach, no assumptions on the distribution of the dataset are required, which means that the dataset can have arbitrary shapes and different density peaks.
To determine the density of a data object, we consider -nearest neighbor distance of the object. We call this -nearest density and it is formally defined as follows:
Definition 1 (-nearest density). Given a dataset and be a distance function on points in , for , point ’s -nearest density is defined as where , and is the th nearest point to according to .
In general, the distance can be any distance measured. The most common choice is standard Euclidean distance. The most naive implementation of NN search involves the brute-force computation of distances between all pairs of data points in the dataset, which scales as , where is the size of the dataset. In order to avoid the computation inefficiencies, spatial indexing structures such as KD tree  and ball tree  can be applied, leading to computation cost.
3.2. Hierarchical Clustering Based on Density Peaks
In this section, we propose to construct a hierarchy structure to represent the original dataset. The cluster hierarchy based on density peaks enables to represent the fact that each level corresponds to objects’ distance from their nearest neighbor with higher density.
First of all, for each data object in the dataset, we compute its local density according to (4), and its distance from points with higher density, which is shown below.
Then, we treat the data objects as nodes in a graph. For each node (except the node of maximal density), we connect it with its nearest neighbor of higher density by an edge weight . It is obvious that we finally get a tree, whose root is the point with maximal density, and each other node’s density is lower than its ancestors and higher than its descendants. Comparing to the MST method, this tree efficiently integrates information on not only distance but also density.
We can sort the edges and iteratively remove them from the tree in decreasing order of weights. After each edge cutting, the tree might be split, shrunk, or even disappeared, as defined below:
Definition 2 (tree split, shrink, and disappear). For a tree , remove edge(s) with weight . (1)If the number of children in the subtree is less than a given threshold, all child nodes in the subtree will be regarded as “noise” and this subtree disappears.(2)If there is only one subtree with “nonnoise” nodes, we say that is shrunk.(3)If there are more than one subtrees with “nonnoise” nodes, we say that is split.
Algorithm 1 shows the main steps of our HCDP algorithm, which requires 2 input parameters and . It produces a clustering tree that contains all partitions obtainable by CDP in a nested way. This “HCDP hierarchy” can be implemented in time. Applying this algorithm transforms the clustering problem into a subtree partition problem.
3.3. Clustering Extraction and Evaluation
As mentioned above, the “HCDP hierarchy” can present the cluster structure of the dataset; however, interpreting the structure into a more understandable result, that is, extracting a “suitable” partition from the hierarchy to demonstrate the “focal” clusters, remains a problem. HCDP contains all possible CDP solutions with respect to given parameters of and . When decreasing the value of edge weight , more and more edges are cut, and clusters get split or shrunk until they disappear. Obviously, the significant clusters “last” longer than the insignificant ones. To capture these significant clusters, we hope to evaluate the quality of the generated clusters instead of simply providing a single, global threshold.
For the sake of simplicity, we consider a modified version of cluster stability from . It is based on Hartigan’s model  and try to construct a tree of nested clusters by varying the threshold of density level. The power of Hartigan’s model is mainly relying on the following aspects: (1)It allows the concept of noise to be modeled as those objects lying in sparse regions of the data space.(2)It allows clusters of varied shapes to be modeled as the connected components of the density level sets. Such components are not restricted to the region of a single density peak; they can possibly represent the union of multiple density peaks.(3)It allows one to model the presence of nested clusters of varied densities in data, through the hierarchical relationships described by the density-contour tree.
Though the original definition is applied to density-based clustering algorithms such as DBSCAN, there is similar property between clustering by density peaks and clustering by density, that is, they both search dense regions based on density connectivity. So the original definition can also be employed after modification in clustering algorithms based on density peaks.
Given a cluster , we define its stability as
where is the maximal weight of which removing edges excludes point from the cluster, and is the maximal weight by removing edges of which cluster emerges (gets separated from the previous cluster).
Let be the collection of nonoverlapping clusters extracted from the hierarchy, and let denote the stability value of each cluster. We can treat the extraction problem as an optimization problem with the objective of maximizing the sum of stabilities of the clusters:
To solve (7), we start from the edge of the highest weight. Every time we cut an edge from the tree, we determine the current number of clusters and calculate their stability. Algorithm 2 gives the pseudocode for finding the optimal solution to (7). Here, we use for both neighborhood calculation and “noise” threshold as a classic smoothing factor whose effect can be well understood referencing [19, 20, 30, 31].
4. Experiments, Results, and Discussion
In this section, we conduct experiments to assess the effectiveness of the proposed method. To demonstrate that HCDP is effective to clusters with both convex and nonconvex shapes, we benchmarked the algorithm on some 2-dimensional datasets for easy visualization. Artificial datasets 1, 2, and 3 are selected. Dataset 1 is from , which consists of clusters with both convex and nonconvex shapes in a hierarchical structure; dataset 2  consists of 3 nonconvex shape clusters; dataset 3 is a synthetic dataset consisting of 2 isotropic Gaussian blobs and 2 interleaving half-circles. By choosing an appropriate , the intuitive visualization of hierarchical structure and clustering results is shown in Figure 1.
(a) Dataset 1: hierarchical structure
(b) Dataset 1: clustering result
(c) Dataset 2: hierarchical structure
(d) Dataset 2: clustering result
(e) Dataset 3: hierarchical structure
(f) Dataset 3: clustering result
We also considered datasets from the UCI Machine Learning Repository [34–37] and compare HCDP with DBSCAN and CDP algorithms. For the clustering results, we evaluated them using normalized mutual information (NMI) score, which can be information theoretically interpreted. It is defined as below: where is the set of clusters and is the set of known labels.
is mutual information: where , , and are the probabilities of an object being in cluster , labeled , and in the intersection of and .
is entropy defined as and
The NMI score ranges from 0 (no mutual information) to 1 (perfect correlation). The NMI scores of the clustering results in the experiment are shown in Table 1.
This paper presented a hierarchical clustering method and introduced clustering stability which enables HCDP to extract an optimal clustering result. We used -nearest neighbors to calculate the local density of data objects and construct clustering hierarchy according to the concept of density peaks. Clustering stability was computed to evaluate and extract “suitable” partitions from the hierarchy. Our experiments have shown that our methods are robust and accurate compared to the original density peak clustering algorithm and DBSCAN algorithm.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This research is supported by the National Natural Science Foundation of China (NSFC) (nos. 61433012 and U1435215) and Shenzhen Basic Research Grant JCYJ20160229195940462.
- A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014.
- R. Mehmood, G. Zhang, R. Bie, H. Dawood, and H. Ahmad, “Clustering by fast search and find of density peaks via heat diffusion,” Neurocomputing, vol. 208, pp. 210–217, 2016.
- Z. I. Botev, J. F. Grotowski, and D. P. Kroese, “Kernel density estimation via diffusion,” The Annals of Statistics, vol. 38, no. 5, pp. 2916–2957, 2010.
- J.-Y. Chen and H.-H. He, “A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data,” Information Sciences, vol. 345, pp. 271–293, 2016.
- J. Xie, H. Gao, W. Xie, X. Liu, and P. W. Grant, “Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors,” Information Sciences, vol. 354, pp. 19–40, 2016.
- M. Du, S. Ding, and H. Jia, “Study on density peaks clustering based on k-nearest neighbors and principal component analysis,” Knowledge-Based Systems, vol. 99, pp. 135–145, 2016.
- L. Yaohui, M. Zhengming, and Y. Fang, “Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy,” Knowledge-Based Systems, vol. 133, pp. 208–220, 2017.
- V. Courjault-Radé, L. D’Estampes, and S. Puechmorel, Improved Density Peak Clustering for Large Datasets, 2016, working paper or preprint.
- X.-F. Wang and Y. Xu, “Fast clustering using adaptive density peak detection,” Statistical Methods in Medical Research, vol. 26, no. 6, pp. 2800–2811, 2015.
- R. Mehmood, S. El-Ashram, R. Bie, H. Dawood, and A. Kos, “Clustering by fast search and merge of local density peaks for gene expression microarray data,” Scientific Reports, vol. 7, article 45602, 2017.
- C. Jinyin, L. Xiang, Z. Haibing, and B. Xintong, “A novel cluster center fast determination clustering algorithm,” Applied Soft Computing, vol. 57, pp. 539–555, 2017.
- R. Zhou, S. Zhang, C. Chen et al., “A distance and density-based clustering algorithm using automatic peak detection,” in 2016 IEEE International Conference on Smart Cloud (SmartCloud), pp. 176–183, New York, NY, USA, November 2016.
- Z. Liang and P. Chen, “Delta-density based clustering with a divide-and-conquer strategy: 3DC clustering,” Pattern Recognition Letters, vol. 73, pp. 52–59, 2016.
- J. Xu, G. Wang, and W. Deng, “DenPEHC: density peak based efficient hierarchical clustering,” Information Sciences, vol. 373, pp. 200–218, 2016.
- J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, Elsevier Inc., Third Edition edition, 2012.
- R. Sibson, “SLINK: an optimally efficient algorithm for the single-link cluster method,” The Computer Journal, vol. 16, no. 1, pp. 30–34, 1973.
- A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988.
- C. T. Zahn, “Graph-theoretical methods for detecting and describing gestalt clusters,” IEEE Transactions on Computers, vol. C-20, no. 1, pp. 68–86, 1971.
- M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS: ordering points to identify the clustering structure,” in Proceedings of the 1999 ACM SIGMOD international conference on Management of data - SIGMOD '99, pp. 49–60, New York, NY, USA, June 1999.
- G. Gupta, A. Liu, and J. Ghosh, “Automated hierarchical density shaving: a robust automated clustering and visualization framework for large biological data sets,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 2, pp. 223–237, 2010.
- M. Herbin, N. Bonnet, and P. Vautrot, “Estimation of the number of clusters and influence zones,” Pattern Recognition Letters, vol. 22, no. 14, pp. 1557–1568, 2001.
- R. J. G. B. Campello, D. Moulavi, and J. Sander, “Density-based clustering based on hierarchical density estimates,” in Advances in Knowledge Discovery and Data Mining, J. Pei, V. S. Tseng, L. Cao, H. Motoda, and G. Xu, Eds., Lecture Notes in Computer Science, pp. 160–172, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
- D. O. Loftsgaarden and C. P. Quesenberry, “A nonparametric estimate of a multivariate density function,” The Annals of Mathematical Statistics, vol. 36, no. 3, pp. 1049–1051, 1965.
- E. Aksehirli, B. Goethals, E. Müller, and J. Vreeken, “Cartification: a neighborhood preserving transformation for mining high dimensional data,” in 2013 IEEE 13th International Conference on Data Mining, pp. 937–942, Dallas, TX, USA, December 2013.
- R. A. Jarvis and E. A. Patrick, “Clustering using a similarity measure based on shared near neighbors,” IEEE Transactions on Computers, vol. C-22, no. 11, pp. 1025–1034, 1973.
- J. Schneider and M. Vlachos, “Fast parameterless density-based clustering via random projections,” in Proceedings of the 22nd ACM international conference on Conference on information & knowledge management - CIKM '13, pp. 861–866, New York, NY, USA, October-November 2013.
- J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.
- S. M. Omohundro, Five Balltree Construction Algorithms, International Computer Science Institute Berkeley, 1989.
- J. A. Hartigan, “Estimation of a convex density contour in two dimensions,” Journal of the American Statistical Association, vol. 82, no. 397, pp. 267–270, 1987.
- T. Pei, A. Jasra, D. J. Hand, A.-X. Zhu, and C. Zhou, “Decode: a new method for discovering clusters of different densities in spatial data,” Data Mining and Knowledge Discovery, vol. 18, no. 3, pp. 337–369, 2009.
- W. Stuetzle and R. Nugent, “A generalized single linkage method for estimating the cluster tree of a density,” Journal of Computational and Graphical Statistics, vol. 19, no. 2, pp. 397–418, 2010.
- A. Gionis, H. Mannila, and P. Tsaparas, “Clustering aggregation,” ACM Transactions on Knowledge Discovery from Data, vol. 1, no. 1, p. 4, 2007.
- H. Chang and D.-Y. Yeung, “Robust path-based spectral clustering,” Pattern Recognition, vol. 41, no. 1, pp. 191–203, 2008.
- M. Charytanowicz, J. Niewczas, P. Kulczycki, P. A. Kowalski, S. Łukasik, and S. Żak, “Complete gradient clustering algorithm for features analysis of X-ray images,” in Information Technologies in Biomedicine, E. Piȩtka and J. Kawa, Eds., vol. 69 of Advances in Intelligent and Soft Computing, pp. 15–24, Springer, Berlin, Heidelberg, 2010.
- G. Gates, “The reduced nearest neighbor rule (corresp.),” IEEE Transactions on Information Theory, vol. 18, no. 3, pp. 431–433, 1972.
- B. Vandeginste, “PARVUS: an extendable package of programs for data exploration, classification and correlation, M. Forina, R. Leardi, C. Armanino and S. Lanteri, Elsevier, Amsterdam, 1988, Price: US $645 ISBN 0-444-43012-1,” Journal of Chemometrics, vol. 4, no. 2, pp. 191–193, 1990.
- W. H. Wolberg, W. N. Street, and O. L. Mangasarian, “Machine learning techniques to diagnose breast cancer from image-processed nuclear features of fine needle aspirates,” Cancer Letters, vol. 77, no. 2-3, pp. 163–171, 1994.
Copyright © 2018 Rong Zhou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.