Complexity

Volume 2018, Article ID 2032461, 8 pages

https://doi.org/10.1155/2018/2032461

## A Novel Hierarchical Clustering Algorithm Based on Density Peaks for Complex Datasets

^{1}Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China^{2}University of Chinese Academy of Sciences, China^{3}Information Science and Engineering College, Xinjiang University, Urumqi, China

Correspondence should be addressed to Yong Zhang; nc.ca.tais@gnoygnahz

Received 13 March 2018; Accepted 5 June 2018; Published 18 July 2018

Academic Editor: Shyam Kamal

Copyright © 2018 Rong Zhou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Clustering aims to differentiate objects from different groups (clusters) by similarities or distances between pairs of objects. Numerous clustering algorithms have been proposed to investigate what factors constitute a cluster and how to efficiently find them. The clustering by fast search and find of density peak algorithm is proposed to intuitively determine cluster centers and assign points to corresponding partitions for complex datasets. This method incorporates simple structure due to the noniterative logic and less few parameters; however, the guidelines for parameter selection and center determination are not explicit. To tackle these problems, we propose an improved hierarchical clustering method HCDP aiming to represent the complex structure of the dataset. A -nearest neighbor strategy is integrated to compute the local density of each point, avoiding to select the nonnecessary global parameter and enables cluster smoothing and condensing. In addition, a new clustering evaluation approach is also introduced to extract a “flat” and “optimal” partition solution from the structure by adaptively computing the clustering stability. The proposed approach is conducted on some applications with complex datasets, where the results demonstrate that the novel method outperforms its counterparts to a large extent.

#### 1. Introduction

Clustering is a process of partitioning data objects into subsets. Each subset is a cluster whose objects are similar to each other while dissimilar to objects in other clusters. For decades, numerous clustering algorithms have been proposed and widely used in many fields, including business intelligence, image pattern recognition, web search, computational biology. The clustering by fast search and find of density peak (CDP) algorithm is proposed by Rodriguez and Laio in Science [1]. It is based on the assumption that the density peaks are candidates for cluster centers, and cluster centers are far away from each other. After determining the cluster centers, the remaining objects will be directly assigned to the nearest clustering center. Compared to most clustering algorithms, the CDP algorithm does not require to design an objective function for iterative optimization, and it can find clusters in spite of their shapes. However, the CDP algorithm has following shortcomings: (i)There is no explicit criterion for the selection of the key parameter , which is the threshold of scanning radius for density calculation, and it greatly affects the clustering results. The authors claimed that when the average density of data objects is of the datasets, one can get good results, without giving an explicit method of determining the optimal to achieve the best clustering effect.(ii)Initial cluster centers are selected interactively but not automatically, whereas it is quite difficult to get a correct selection of some datasets.

To overcome the above issues, this paper proposes a new algorithm where (1)-nearest neighbors method is introduced to estimate local density so that the deficiencies of CDP in computing the local density of an object can be avoided;(2)instead of directly clustering, a hierarchical clustering method is used to generate a complete clustering structure;(3)the task of extracting a set of significant clusters is formulated as an optimization problem and a control algorithm that finds the globally optimal solution to this problem is proposed.

The rest of this paper is organized as follows. Related works are introduced in Section 2. Section 3 describes the proposed method in detail. In Section 4, experimental results are presented and discussed. Conclusions and future work are stated in Section 4.

#### 2. Related Works

As a novel and efficient algorithm, the clustering using density peak algorithm is brought into sharp focus. However, there are still some shortcomings that cannot be ignored. In this section, we will first review CDP briefly and then introduce a hierarchical clustering method to represent the clustering structures of datasets.

##### 2.1. Cluster Using Density Peaks

The clustering using density peak (CDP) algorithm is on the basis of the assumption that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher density. By calculating these two quantities of each data object, CDP builds a decision graph for users to pick up cluster centers and exclude outliers.

Formally speaking, let denote a dataset of objects, for each data object , its local density is defined by (1), and its nearest distance from points of higher density is defined by (2).

where is the Euclidean distance between points and , is a cutoff distance specified by the users, and if and otherwise . For the point with the highest density, it takes .

For small datasets, the algorithm turns to the exponential kernel for density calculation, as described as follows:

The cluster centers are then recognized as points for which the values of both and are anomalously large. After the cluster centers are determined, the algorithm assigns each remaining objects to the same cluster as its nearest neighbor of higher density.

The problem is there is no objective metric to decide whether the dataset is large or small, so users might face a dilemma in selecting methods for computing density indicators. Moreover, clustering results may vary greatly according to the selection of . In [1], the authors suggested to choosing so that the average number of neighbors is around 1-2% of the total number of points in the dataset. However, it is easy to find out that the suggested choice is not always applicable when the size of dataset changes.

Some subsequent researchers attempt to solve this problem. Mehmood et al. [2] introduced heat diffusion method [3] to estimate point density and used the time parameter of heat diffusion to efficiently create clusters. This method is similar to the kernel density estimation, and the bandwidth parameter is determined according to [3]. Chen and He [4] calculated the field intensity and distance of every data point and fit them by a regression analysis. The cluster centers are determined by a residual analysis. Reference [5] computed local density of each point using its -nearest neighbors instead of in a kernel density estimation. Reference [6] also adopted a new local density metric using -nearest neighbors in a kernel density estimation too. Reference [7] analyzed the local density metrics using -nearest neighbors and tried to discriminate points belonging to different clusters more accurately.

Besides considering a single cutoff distance (or bandwidth), some studies also analyzed the case of multiple densities. In [8], a cover map procedure was applied iteratively with a decreasing locally adaptive window to build a multidimensional density map, which allows cluster center selection. Wang and Xu [9] proposed an adaptive peak detection with nonparametric multivariate kernel density estimation. The algorithm treats the dataset as a multivariate normal distribution, and the bandwidth matrix of kernel density estimation is correlated to the dataset’s dimensions . By narrowing down the possible ranges of and the number of clusters , the optimal values are chosen from all possible combinations of and . Mehmood et al. [10] considered density regions instead of cluster centers. They used CDP to find local clusters and merge them using the concept of shared density regions.

The other flaw of the algorithm is the vague, dull, and unclear choice of cluster centers. Some works [4, 11, 12] tried to identify the number of clusters by finding out singular points from the indicator curves; the others focus on merging microclusters [8, 10, 13] or splitting clusters [14] until some stopping conditions are satisfied. In this paper, we extract the optimal result from the hierarchical clustering structure, which will be introduced in the following section.

##### 2.2. Hierarchical Clustering

The cluster centers are interactively determined in [1] rather than using explicit criteria. Some existing works [4, 11, 12] determined the number of clusters first, then they sorted the data points based on both indicators and in descending order and chose the first -largest points as cluster centers. Instead of directly calculating , hierarchical methods try to depict the clustering structure of the dataset. For an easier interpretation of the structure, some automatic techniques are provided to extract a “flat” solution.

Hierarchical clustering (HC) methods represent data objects in a hierarchy or “tree” structure of clusters [15]. They are effective in detecting true clustering structures of datasets. Many works of HC have been published in recent years. They can be categorized into two classes: one uses distance-based methods and the other adopts density-based methods.

For distance-based HC methods, the core idea is to agglomerate or divide clusters according to the distance between two clusters, where each cluster contains a set of data points. The most common measure of cluster distance calculates the closest pair of points belonging to different clusters. It can be regarded as a nearest-neighbor clustering algorithm. Moreover, if there is a threshold to terminate the clustering process, it is called single-link method [16]. The merging process repeats until all the points eventually form one cluster. Similar approaches such as average-link or complete-link are also widely used [17]. If we take data objects as nodes of a graph, with distance-weighting edges, the clustering algorithm that uses the minimum distance measure is called a minimal spanning tree (MST) algorithm [15, 18]. Usually, a single, global threshold cutting through the hierarchical cluster representation can give a “flat” partition of the data, which users are most interested in.

To the contrast, density-based HC methods received less attention. The main idea of these methods is to investigate the reachable distance of all data points and form hierarchical structures using the concept of “density connectivity.” The most famous OPTICS algorithm [19] is able to represent a density-based clustering structure of the dataset. Although a postprocessing procedure to extract a simplified clustering result was proposed, the procedure did not become as popular as OPTICS itself since it has heavy reliance to the reachability plot and is sensitive to the choice of a critical parameter that cannot be determined easily. Gupta et al. [20] proposed a density-shaving strategy applying to the hierarchical structure referring to the work [21] to achieve cluster extraction. Campello et al. [22] presented HDBSCAN as an improvement over OPTICS. By defining cluster stability, they turned the cluster extraction problem to an optimization problem of maximizing the overall stability of the set of clusters extracted from the HDBSCAN hierarchy.

#### 3. The Proposed Method

In this section, we describe the proposed method in detail. Generally speaking, the proposed method consists of three steps: local density calculation, hierarchy representation, and optimal cluster extraction. Local density calculation is conducted based on the -nearest neighbors. Hierarchical clustering method is then applied to depict the cluster structure. Finally, by introducing a concept of cluster stability, we propose an algorithm to solve the extraction problem from a cluster hierarchy.

##### 3.1. Local Density Estimation Using -Nearest Neighbors

Most of the extended works for CDP attempt to improve density estimation from the perspective of kernel method assuming that the data points are in the same or different Gaussian distribution(s). These works did not solve the problem of choosing , on the contrast, they just turn the problem of selecting a suitable into the problem of determining a suitable bandwidth of the Gaussian distribution kernel. Unlike these works, we incline to assess an object’s local density using the information of its neighbors. -nearest neighbor (NN) has been shown to be a powerful technique for density estimation [23], clustering [24–26]. The goal of this approach is to find -nearest neighbors (NN) of each data object in the dataset. By using this approach, no assumptions on the distribution of the dataset are required, which means that the dataset can have arbitrary shapes and different density peaks.

To determine the density of a data object, we consider -nearest neighbor distance of the object. We call this -nearest density and it is formally defined as follows:

*Definition 1 *(-nearest density). Given a dataset and be a distance function on points in , for , point ’s -nearest density is defined as
where , and is the th nearest point to according to .

In general, the distance can be any distance measured. The most common choice is standard Euclidean distance. The most naive implementation of NN search involves the brute-force computation of distances between all pairs of data points in the dataset, which scales as , where is the size of the dataset. In order to avoid the computation inefficiencies, spatial indexing structures such as KD tree [27] and ball tree [28] can be applied, leading to computation cost.

##### 3.2. Hierarchical Clustering Based on Density Peaks

In this section, we propose to construct a hierarchy structure to represent the original dataset. The cluster hierarchy based on density peaks enables to represent the fact that each level corresponds to objects’ distance from their nearest neighbor with higher density.

First of all, for each data object in the dataset, we compute its local density according to (4), and its distance from points with higher density, which is shown below.

Then, we treat the data objects as nodes in a graph. For each node (except the node of maximal density), we connect it with its nearest neighbor of higher density by an edge weight . It is obvious that we finally get a tree, whose root is the point with maximal density, and each other node’s density is lower than its ancestors and higher than its descendants. Comparing to the MST method, this tree efficiently integrates information on not only distance but also density.

We can sort the edges and iteratively remove them from the tree in decreasing order of weights. After each edge cutting, the tree might be split, shrunk, or even disappeared, as defined below:

*Definition 2 *(tree split, shrink, and disappear). For a tree , remove edge(s) with weight .
(1)If the number of children in the subtree is less than a given threshold, all child nodes in the subtree will be regarded as “noise” and this subtree disappears.(2)If there is only one subtree with “nonnoise” nodes, we say that is shrunk.(3)If there are more than one subtrees with “nonnoise” nodes, we say that is split.

Algorithm 1 shows the main steps of our HCDP algorithm, which requires 2 input parameters and . It produces a clustering tree that contains all partitions obtainable by CDP in a nested way. This “HCDP hierarchy” can be implemented in time. Applying this algorithm transforms the clustering problem into a subtree partition problem.