Abstract

Clustering aims to differentiate objects from different groups (clusters) by similarities or distances between pairs of objects. Numerous clustering algorithms have been proposed to investigate what factors constitute a cluster and how to efficiently find them. The clustering by fast search and find of density peak algorithm is proposed to intuitively determine cluster centers and assign points to corresponding partitions for complex datasets. This method incorporates simple structure due to the noniterative logic and less few parameters; however, the guidelines for parameter selection and center determination are not explicit. To tackle these problems, we propose an improved hierarchical clustering method HCDP aiming to represent the complex structure of the dataset. A -nearest neighbor strategy is integrated to compute the local density of each point, avoiding to select the nonnecessary global parameter and enables cluster smoothing and condensing. In addition, a new clustering evaluation approach is also introduced to extract a “flat” and “optimal” partition solution from the structure by adaptively computing the clustering stability. The proposed approach is conducted on some applications with complex datasets, where the results demonstrate that the novel method outperforms its counterparts to a large extent.

1. Introduction

Clustering is a process of partitioning data objects into subsets. Each subset is a cluster whose objects are similar to each other while dissimilar to objects in other clusters. For decades, numerous clustering algorithms have been proposed and widely used in many fields, including business intelligence, image pattern recognition, web search, computational biology. The clustering by fast search and find of density peak (CDP) algorithm is proposed by Rodriguez and Laio in Science [1]. It is based on the assumption that the density peaks are candidates for cluster centers, and cluster centers are far away from each other. After determining the cluster centers, the remaining objects will be directly assigned to the nearest clustering center. Compared to most clustering algorithms, the CDP algorithm does not require to design an objective function for iterative optimization, and it can find clusters in spite of their shapes. However, the CDP algorithm has following shortcomings: (i)There is no explicit criterion for the selection of the key parameter , which is the threshold of scanning radius for density calculation, and it greatly affects the clustering results. The authors claimed that when the average density of data objects is of the datasets, one can get good results, without giving an explicit method of determining the optimal to achieve the best clustering effect.(ii)Initial cluster centers are selected interactively but not automatically, whereas it is quite difficult to get a correct selection of some datasets.

To overcome the above issues, this paper proposes a new algorithm where (1)-nearest neighbors method is introduced to estimate local density so that the deficiencies of CDP in computing the local density of an object can be avoided;(2)instead of directly clustering, a hierarchical clustering method is used to generate a complete clustering structure;(3)the task of extracting a set of significant clusters is formulated as an optimization problem and a control algorithm that finds the globally optimal solution to this problem is proposed.

The rest of this paper is organized as follows. Related works are introduced in Section 2. Section 3 describes the proposed method in detail. In Section 4, experimental results are presented and discussed. Conclusions and future work are stated in Section 4.

As a novel and efficient algorithm, the clustering using density peak algorithm is brought into sharp focus. However, there are still some shortcomings that cannot be ignored. In this section, we will first review CDP briefly and then introduce a hierarchical clustering method to represent the clustering structures of datasets.

2.1. Cluster Using Density Peaks

The clustering using density peak (CDP) algorithm is on the basis of the assumption that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher density. By calculating these two quantities of each data object, CDP builds a decision graph for users to pick up cluster centers and exclude outliers.

Formally speaking, let denote a dataset of objects, for each data object , its local density is defined by (1), and its nearest distance from points of higher density is defined by (2).

where is the Euclidean distance between points and , is a cutoff distance specified by the users, and if and otherwise . For the point with the highest density, it takes .

For small datasets, the algorithm turns to the exponential kernel for density calculation, as described as follows:

The cluster centers are then recognized as points for which the values of both and are anomalously large. After the cluster centers are determined, the algorithm assigns each remaining objects to the same cluster as its nearest neighbor of higher density.

The problem is there is no objective metric to decide whether the dataset is large or small, so users might face a dilemma in selecting methods for computing density indicators. Moreover, clustering results may vary greatly according to the selection of . In [1], the authors suggested to choosing so that the average number of neighbors is around 1-2% of the total number of points in the dataset. However, it is easy to find out that the suggested choice is not always applicable when the size of dataset changes.

Some subsequent researchers attempt to solve this problem. Mehmood et al. [2] introduced heat diffusion method [3] to estimate point density and used the time parameter of heat diffusion to efficiently create clusters. This method is similar to the kernel density estimation, and the bandwidth parameter is determined according to [3]. Chen and He [4] calculated the field intensity and distance of every data point and fit them by a regression analysis. The cluster centers are determined by a residual analysis. Reference [5] computed local density of each point using its -nearest neighbors instead of in a kernel density estimation. Reference [6] also adopted a new local density metric using -nearest neighbors in a kernel density estimation too. Reference [7] analyzed the local density metrics using -nearest neighbors and tried to discriminate points belonging to different clusters more accurately.

Besides considering a single cutoff distance (or bandwidth), some studies also analyzed the case of multiple densities. In [8], a cover map procedure was applied iteratively with a decreasing locally adaptive window to build a multidimensional density map, which allows cluster center selection. Wang and Xu [9] proposed an adaptive peak detection with nonparametric multivariate kernel density estimation. The algorithm treats the dataset as a multivariate normal distribution, and the bandwidth matrix of kernel density estimation is correlated to the dataset’s dimensions . By narrowing down the possible ranges of and the number of clusters , the optimal values are chosen from all possible combinations of and . Mehmood et al. [10] considered density regions instead of cluster centers. They used CDP to find local clusters and merge them using the concept of shared density regions.

The other flaw of the algorithm is the vague, dull, and unclear choice of cluster centers. Some works [4, 11, 12] tried to identify the number of clusters by finding out singular points from the indicator curves; the others focus on merging microclusters [8, 10, 13] or splitting clusters [14] until some stopping conditions are satisfied. In this paper, we extract the optimal result from the hierarchical clustering structure, which will be introduced in the following section.

2.2. Hierarchical Clustering

The cluster centers are interactively determined in [1] rather than using explicit criteria. Some existing works [4, 11, 12] determined the number of clusters first, then they sorted the data points based on both indicators and in descending order and chose the first -largest points as cluster centers. Instead of directly calculating , hierarchical methods try to depict the clustering structure of the dataset. For an easier interpretation of the structure, some automatic techniques are provided to extract a “flat” solution.

Hierarchical clustering (HC) methods represent data objects in a hierarchy or “tree” structure of clusters [15]. They are effective in detecting true clustering structures of datasets. Many works of HC have been published in recent years. They can be categorized into two classes: one uses distance-based methods and the other adopts density-based methods.

For distance-based HC methods, the core idea is to agglomerate or divide clusters according to the distance between two clusters, where each cluster contains a set of data points. The most common measure of cluster distance calculates the closest pair of points belonging to different clusters. It can be regarded as a nearest-neighbor clustering algorithm. Moreover, if there is a threshold to terminate the clustering process, it is called single-link method [16]. The merging process repeats until all the points eventually form one cluster. Similar approaches such as average-link or complete-link are also widely used [17]. If we take data objects as nodes of a graph, with distance-weighting edges, the clustering algorithm that uses the minimum distance measure is called a minimal spanning tree (MST) algorithm [15, 18]. Usually, a single, global threshold cutting through the hierarchical cluster representation can give a “flat” partition of the data, which users are most interested in.

To the contrast, density-based HC methods received less attention. The main idea of these methods is to investigate the reachable distance of all data points and form hierarchical structures using the concept of “density connectivity.” The most famous OPTICS algorithm [19] is able to represent a density-based clustering structure of the dataset. Although a postprocessing procedure to extract a simplified clustering result was proposed, the procedure did not become as popular as OPTICS itself since it has heavy reliance to the reachability plot and is sensitive to the choice of a critical parameter that cannot be determined easily. Gupta et al. [20] proposed a density-shaving strategy applying to the hierarchical structure referring to the work [21] to achieve cluster extraction. Campello et al. [22] presented HDBSCAN as an improvement over OPTICS. By defining cluster stability, they turned the cluster extraction problem to an optimization problem of maximizing the overall stability of the set of clusters extracted from the HDBSCAN hierarchy.

3. The Proposed Method

In this section, we describe the proposed method in detail. Generally speaking, the proposed method consists of three steps: local density calculation, hierarchy representation, and optimal cluster extraction. Local density calculation is conducted based on the -nearest neighbors. Hierarchical clustering method is then applied to depict the cluster structure. Finally, by introducing a concept of cluster stability, we propose an algorithm to solve the extraction problem from a cluster hierarchy.

3.1. Local Density Estimation Using -Nearest Neighbors

Most of the extended works for CDP attempt to improve density estimation from the perspective of kernel method assuming that the data points are in the same or different Gaussian distribution(s). These works did not solve the problem of choosing , on the contrast, they just turn the problem of selecting a suitable into the problem of determining a suitable bandwidth of the Gaussian distribution kernel. Unlike these works, we incline to assess an object’s local density using the information of its neighbors. -nearest neighbor (NN) has been shown to be a powerful technique for density estimation [23], clustering [2426]. The goal of this approach is to find -nearest neighbors (NN) of each data object in the dataset. By using this approach, no assumptions on the distribution of the dataset are required, which means that the dataset can have arbitrary shapes and different density peaks.

To determine the density of a data object, we consider -nearest neighbor distance of the object. We call this -nearest density and it is formally defined as follows:

Definition 1 (-nearest density). Given a dataset and be a distance function on points in , for , point ’s -nearest density is defined as where , and is the th nearest point to according to .

In general, the distance can be any distance measured. The most common choice is standard Euclidean distance. The most naive implementation of NN search involves the brute-force computation of distances between all pairs of data points in the dataset, which scales as , where is the size of the dataset. In order to avoid the computation inefficiencies, spatial indexing structures such as KD tree [27] and ball tree [28] can be applied, leading to computation cost.

3.2. Hierarchical Clustering Based on Density Peaks

In this section, we propose to construct a hierarchy structure to represent the original dataset. The cluster hierarchy based on density peaks enables to represent the fact that each level corresponds to objects’ distance from their nearest neighbor with higher density.

First of all, for each data object in the dataset, we compute its local density according to (4), and its distance from points with higher density, which is shown below.

Then, we treat the data objects as nodes in a graph. For each node (except the node of maximal density), we connect it with its nearest neighbor of higher density by an edge weight . It is obvious that we finally get a tree, whose root is the point with maximal density, and each other node’s density is lower than its ancestors and higher than its descendants. Comparing to the MST method, this tree efficiently integrates information on not only distance but also density.

We can sort the edges and iteratively remove them from the tree in decreasing order of weights. After each edge cutting, the tree might be split, shrunk, or even disappeared, as defined below:

Definition 2 (tree split, shrink, and disappear). For a tree , remove edge(s) with weight . (1)If the number of children in the subtree is less than a given threshold, all child nodes in the subtree will be regarded as “noise” and this subtree disappears.(2)If there is only one subtree with “nonnoise” nodes, we say that is shrunk.(3)If there are more than one subtrees with “nonnoise” nodes, we say that is split.

Algorithm 1 shows the main steps of our HCDP algorithm, which requires 2 input parameters and . It produces a clustering tree that contains all partitions obtainable by CDP in a nested way. This “HCDP hierarchy” can be implemented in time. Applying this algorithm transforms the clustering problem into a subtree partition problem.

Require: dataset , neighborhood , child threshold .
1: Compute -nearest density and distance for each point
2: Generate a tree by connecting from one point to its nearest point with higher density, and assign the whole tree as a single cluster.
3: Sort all the edges of the tree with respect to the weights in descending order.
4: repeat
5: Remove the highest edge(s) in (in case of same weights, edges must
be cut simultaneously) to get subtrees
6: for do
7:  if children of then
8:   All children nodes in this subtree are assigned as “noise”.
9:  else
10:   assign a new cluster to subtree
11:  end if
12: end for
13: until some stopping condition is satisfied.
3.3. Clustering Extraction and Evaluation

As mentioned above, the “HCDP hierarchy” can present the cluster structure of the dataset; however, interpreting the structure into a more understandable result, that is, extracting a “suitable” partition from the hierarchy to demonstrate the “focal” clusters, remains a problem. HCDP contains all possible CDP solutions with respect to given parameters of and . When decreasing the value of edge weight , more and more edges are cut, and clusters get split or shrunk until they disappear. Obviously, the significant clusters “last” longer than the insignificant ones. To capture these significant clusters, we hope to evaluate the quality of the generated clusters instead of simply providing a single, global threshold.

For the sake of simplicity, we consider a modified version of cluster stability from [22]. It is based on Hartigan’s model [29] and try to construct a tree of nested clusters by varying the threshold of density level. The power of Hartigan’s model is mainly relying on the following aspects: (1)It allows the concept of noise to be modeled as those objects lying in sparse regions of the data space.(2)It allows clusters of varied shapes to be modeled as the connected components of the density level sets. Such components are not restricted to the region of a single density peak; they can possibly represent the union of multiple density peaks.(3)It allows one to model the presence of nested clusters of varied densities in data, through the hierarchical relationships described by the density-contour tree.

Though the original definition is applied to density-based clustering algorithms such as DBSCAN, there is similar property between clustering by density peaks and clustering by density, that is, they both search dense regions based on density connectivity. So the original definition can also be employed after modification in clustering algorithms based on density peaks.

Given a cluster , we define its stability as

where is the maximal weight of which removing edges excludes point from the cluster, and is the maximal weight by removing edges of which cluster emerges (gets separated from the previous cluster).

Let be the collection of nonoverlapping clusters extracted from the hierarchy, and let denote the stability value of each cluster. We can treat the extraction problem as an optimization problem with the objective of maximizing the sum of stabilities of the clusters:

To solve (7), we start from the edge of the highest weight. Every time we cut an edge from the tree, we determine the current number of clusters and calculate their stability. Algorithm 2 gives the pseudocode for finding the optimal solution to (7). Here, we use for both neighborhood calculation and “noise” threshold as a classic smoothing factor whose effect can be well understood referencing [19, 20, 30, 31].

Require: dataset , parameter
1: Generate a tree of using Algorithm 1 and assign the whole tree as a single cluster.
2: Sort all the edges of the tree with respect to the weights in descending order .
3:
4:
5: for do
6: ,
7: Remove and split previous subtrees into new subtrees , where denotes subtrees split from previous tree
8: for do
9:  if then
10:   
11:  else
12:   
13:  end if
14: end for
15: 
16: end for

4. Experiments, Results, and Discussion

In this section, we conduct experiments to assess the effectiveness of the proposed method. To demonstrate that HCDP is effective to clusters with both convex and nonconvex shapes, we benchmarked the algorithm on some 2-dimensional datasets for easy visualization. Artificial datasets 1, 2, and 3 are selected. Dataset 1 is from [32], which consists of clusters with both convex and nonconvex shapes in a hierarchical structure; dataset 2 [33] consists of 3 nonconvex shape clusters; dataset 3 is a synthetic dataset consisting of 2 isotropic Gaussian blobs and 2 interleaving half-circles. By choosing an appropriate , the intuitive visualization of hierarchical structure and clustering results is shown in Figure 1.

We also considered datasets from the UCI Machine Learning Repository [3437] and compare HCDP with DBSCAN and CDP algorithms. For the clustering results, we evaluated them using normalized mutual information (NMI) score, which can be information theoretically interpreted. It is defined as below: where is the set of clusters and is the set of known labels.

is mutual information: where , , and are the probabilities of an object being in cluster , labeled , and in the intersection of and .

is entropy defined as and

The NMI score ranges from 0 (no mutual information) to 1 (perfect correlation). The NMI scores of the clustering results in the experiment are shown in Table 1.

5. Conclusion

This paper presented a hierarchical clustering method and introduced clustering stability which enables HCDP to extract an optimal clustering result. We used -nearest neighbors to calculate the local density of data objects and construct clustering hierarchy according to the concept of density peaks. Clustering stability was computed to evaluate and extract “suitable” partitions from the hierarchy. Our experiments have shown that our methods are robust and accurate compared to the original density peak clustering algorithm and DBSCAN algorithm.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research is supported by the National Natural Science Foundation of China (NSFC) (nos. 61433012 and U1435215) and Shenzhen Basic Research Grant JCYJ20160229195940462.