Abstract

Density peaks clustering (DPC) is an advanced clustering technique due to its multiple advantages of efficiently determining cluster centers, fewer arguments, no iterations, no border noise, etc. However, it does suffer from the following defects: (1) difficult to determine a suitable value of its crucial cutoff distance parameter, (2) the local density metric is too simple to find out the proper center(s) of the sparse cluster(s), and (3) it is not robust that parts of prominent density peaks are remotely assigned. This paper proposes improved density peaks clustering based on natural neighbor expanded group (DPC-NNEG). The cores of the proposed algorithm contain two parts: (1) define natural neighbor expanded (NNE) and natural neighbor expanded group (NNEG) and (2) divide all NNEGs into a goal number of sets as the final clustering result, according to the closeness degree of NNEGs. At the same time, the paper provides the measurement of the closeness degree. We compared the state of the art with our proposal in public datasets, including several complex and real datasets. Experiments show the effectiveness and robustness of the proposed algorithm.

1. Introduction

Clustering algorithm, usually as unsupervised learning, is a type of fundamental technique of machine learning [1]. It aims to divide a dataset into several subsets, which are also called categories, clusters, groups, etc, according to similarity, dissimilarity, or distance of samples. Hence, unlike supervised learning [218], clustering methods implement classification tasks without any prior knowledge and have been applied to image processing, pattern recognition, bioinformatics, data mining, the Internet of things, and other fields.

Due to flexibility and validity, various clustering algorithms have been proposed one after another. Jain classified these methods into partitioning-based, model-based, hierarchical-based, grid-based, and density-based approaches [19]. Partitioning methods aim for grouping the dataset into a preset number of clusters via an iterative process. K-means [20, 21] and Fuzzy c-means [22, 23] are two famous partitioning-based clusterings. Although they are simple to understand and easy to implement, K-means is extremely sensitive to outliers and the selection of the initial cluster centers; besides, Fuzzy c-means approaches suffer from initial partition dependence [1]. Model-based clustering methods require one or more appropriate probability models to represent the dataset and often use the expectation-maximization approach to maximize the likelihood function [24]. Hierarchical-based approaches [2528] partition the dataset into several categories using two opposite ways: top-down or bottom-up approach [23]. The first one considers the whole dataset as a cluster and split it into a suitable number of subclusters. Another regards each sample as a cluster and then merging these atomic clusters into more and more massive clusters. However, the effectiveness of hierarchical clustering algorithms depends on the type of distance measurement chosen for the clusters. Grid-based [29] and density-based [30, 31] approaches automatically determine the number of categories using suitable and preset parameters such as epsilon, min-pts, or others. While it is necessary to take a mass of argument adjustments to obtain optimal clustering results, these two types of algorithms generate noise at the cluster borders.

To overcome the above shortcomings, recently, density peaks clustering [32] is proposed and based on the assumption that cluster centers are relatively denser and are far from each other. Using a suitable value of cutoff distance (namely, dc, the only parameter of DPC), this approach manually selects the appropriate center of each cluster from a decision graph. It then assigns each of the remaining elements to the nearest denser point (NDP) that is the nearest one of neighbors possessing bigger density than the assigned sample. It has many advantages, including higher efficiency in finding cluster centers, fewer parameters, no iterations, and no noise around the cluster border. However, the algorithm is still affected by the following defects:(1)It is challenging to determine suitable dc. It must also be mentioned that the original DPC algorithm does not cover a reliable and specific method to ensure proper dc. Besides, this was demonstrated in several studies [33, 34] that DPC is sensitive to its parameter, and even when being normalized or using the relative percentage method, a small change in dc will still cause a conspicuous fluctuation in the result.(2)The formula of local density is too simple to find out suitable center(s) of the sparse cluster(s) and is only useful in datasets with balanced density [33]. As shown in Figure 1(a), the Jain dataset has two clusters: the upper one is sparse and the lower one is denser. However, DPC overlooks the center of the upper cluster, instead of a prominent density peak of the lower cluster.(3)Its assignment strategy is not robust [35]. Each point is assigned to its NDP, which results in some prominent density peaks (PDP) that are relatively bigger on density and value but not cluster centers are mistakenly attributed to a denser superordinate but are far away from each other. Accordingly, the subordinates of the incorrect-assigned PDP are portioned to an incorrect group. Figure 1(b) shows that we manually modify the center to the densest point of the upper cluster. However, the prominent local peak of the top cluster is assigned to its NDP belonging to the lower cluster, which leads to the incorrect assignment of its subordinates. And there is a distinct gap between the assignment path.

To improve the performance of DPC and inspired by the idea of natural neighbor (NN) [36], we propose an improved density peaks clustering based on natural neighbor expanded group. The main innovations and improvements in our algorithm are as follows:(1)Define natural neighbor expanded and natural neighbor expanded group based on the well-known K-nearest neighbor method and its optimal version named natural neighbor. The concept of natural neighbor expanded is to absorb those close neighbors overlooked by the NN method. And NNEG is able to overcome the shortcoming of the remote assignment of PDP and mine the potential structure of data.(2)Provide a density metric formula based on NNE. With the aid of NNE, the new measurement adaptively calculates the local density for each sample without any arguments, unlike one of the original DPC.(3)Propose the measurement of the closeness degree of NNEGs that based on the mutual and pairwise neighbors which belonged to different NNEGs. Due to its application, all NNEGs are divided into the goal number of sets as the final clustering result.(4)The time complexity is , where is a constant, while the time complexity of all of the optimization algorithms and DPC is [34].

The remainder of this paper comprises four sections. Section 2 describes the related works. Section 3 represents the DPC, NN method, and details of our algorithm. Section 4 presents the clustering results on our proposal and related works. In Section 5, we have a summary of the contributions and features of this paper.

To improve the performance of the DPC algorithm, scholars proposed many optimization methods, as shown in Figure 2. Xie et al. modified the density metric formula using the K-nearest neighbor (KNN), which used the number of the nearest neighbors to replace dc. Besides, they devised an entirely new assignment scheme based on fuzzy weighted K-nearest neighbors (FKNN-DPC) [33]. Furthermore, this method is easier to determine the suitable value of parameter. Lotfi et al. proposed a technique called IDPC [37]. The algorithm sorts samples using the local density and then apportions the labels of centers to their KNN to develop cluster cores. Finally, IDPC implements a specific propagation strategy to attach the remaining points with labels. Guo et al. capitalized on the linear regression method to fit the decision values of DPC with a preset proper dc required (DPC-LRA) and then choose the instances above the fitting function as the centers [38]. Ding et al. proposed an algorithm based on the generalized extreme value distribution (GEV) to fit the DPC decision values in the descending order (DPC-GVE). To reduce the time complexity, they also represented a substitution method using Chebyshev inequality (DPC-CI) [39]. Ni et al. presented the definitions of density gap and the density path, as well as a new threshold [35]. Instead of the decision graph of DPC, the proper value of dc is determined by manually observing a summary graph incorporating the density gaps calculated by different dc. The method, named PPC, is able to reduce obviously the difficulty on threshold determination. Jiang et al. provided a novel density peaks clustering algorithm based on K-nearest neighbors (DPC-KNN) to overcome the issue of the assignment [40]. In this method, there are two sets for each sample : the first one is , which is composed of sample and its KNN, while the second is , which covers the data points possessing higher densities than sample in the whole dataset. The cluster centers are determined via the decision graph of DPC, DPC-KNN assigns each remaindering sample to an element of , who has the smallest distance from any member of to any member of . Lotfi et al. improve DPC using density backbone and fuzzy neighborhood (DPC-DBFN). They use a fuzzy kernel for improving the separability of clusters. DPC-DBFN uses a density-based KNN graph for labeling backbones and effectively assigns correct category labels to samples around the group borders to effectively cluster data with various shapes and densities [34].

However, FKNN-DPC, IDPC, DPC-KNN, PPC, and DPC-DBFN require manual operations. And a preset dc is necessary for DPC-LRA, DPC-GVE, and DPC-CI. Moreover, DPC and these algorithms require the time complexity of [34].

3. Methods

This section aims to present the short versions of the original DPC algorithm and NN method and show a detailed description of our method.

3.1. The Original DPC Algorithm

DPC is the basis on which cluster centers are relatively denser and are distant from each other. For a given dataset , where , cluster centers are manually picked from the decision graph, which is two-dimensional with as the ordinate and the local density as the abscissa. Local density is to measure the neighbor number and distances of each sample in its neighborhood, which is a crucial concept of DPC. The ordinate is the distance between the sample and its nearest denser point. Since the centers have relative lager density, each of them must be far away from their NDP, namely, has an enormous value of . In the two-dimensional coordinate system, cluster centers simultaneously possess big values of and local density and appear in the upper right corner of the graph. To measure the local density of each element, the author provides two formulae expressed as equations (1) and (2). is calculated by equation (3):where is the distance between pairwise elements and , is the cutoff distance, the only argument of DPC. Therefore, The DPC algorithm inherits a defect, where Gaussian kernel is sensitive to bandwidth:

As shown in equation (3), is the minimum distance between elements and whose density is higher than . For with the highest density, its is the maximum distance between and . After the cluster centers have been found, each remaining point is assigned to the same cluster as its nearest neighbor of higher density.

3.2. Natural Neighbor Method

K-nearest neighbor is a popular method in machine learning to complete the tasks of classification and clustering. However, the crucial argument K is preset manually. And natural neighbor is an adaptive method to find the relative near neighbors of each sample. The basic idea of NN is that samples of the dense regions have more neighbors; data points of the sparse area have relatively fewer neighbors; the outliers only have a few or no natural neighbors.

In the dataset , the authors assume that is the similarity between two points and . With the help of comparing the similarity, let denote the function of KNN searching which returns the nearest neighbor of the point , is a subset of , and it is defined as follows:

Definition 1. (natural neighbor). Natural neighbor of is defined as

Definition 2. (natural neighbor eigenvalue). When the algorithm reaches the Stable Searching State, Natural Neighbor Eigenvalue (NaNE) λ is equal to the searching round :

3.3. The Proposed Method

In this section, the improved density peaks clustering based on natural neighbor expanded group is presented. Our method includes three major steps, including (1) calculating the local density of each sample according to the formula proposed, (2) determining natural neighbor expanded groups, and (3) grouping NNEGs into several sets as the final clustering result. The details of these steps are described in the remaining part of this section. To realize the above processing, we define the concept of natural neighbor expanded and then provide a straightforward but useful formula for local density. Besides, the definition of the natural neighbor expanded group is to reveal the structure of the dataset and divide the dataset into several local groups. For ensuring the grouping of NNEGs accuracy, we propose a measurement of closeness degree. And more details are presented in the rest content of this section.

3.3.1. Basic Concepts

The NN method only considers the relationship of mutual neighbors and overlooks the impact of distance between samples. And to fit the density metric and the searching of density peaks, we propose the concept of Natural Neighbor Expanded.

Definition 3. (natural neighbor expanded). Natural Neighbor Expanded is defined as the following equation:where we assume that the number of of is and the is the . Hence, . As shown in Figure 3, sample 1 is not the NN of sample 8, since it does not belong to the . However, sample 1 is closer to sample 8 than 14. Hence, for calculating the density more wholly and accurately, we expand the natural neighborhood of sample 8 to include samples 1, 2, and 7.
Natural Neighbor is the set of close neighbors. Still, as shown in equation (2), the local density formula measures not only the close neighbors whose distances to sample are smaller than dc but also the rest samples of whole datasets. In the latter part, the distance to sample being approximate to dc and the corresponding sample also impacts the density of . Therefore, of equation (7) is to cover the more secondary-adjacent samples beside the close neighbors. And the new local density formula based on NNE is shown aswhere , , and is the set of the distances of to all of the elements in . Inspired by the famous K-means method, equation (8) considers each point as core and calculates the sum of distances of it to its NNE. And the smaller the distance sum is, the more likely it is to be the local center.
Equation (2) maps the distances to similarities using the Gaussian kernel and calculates the accumulation sum of similarities linking to as . Hence, equation (2) based on Gaussian kernel can resist the interference of outliers that possess vast distances to . However, the equation covers too many negligible samples that have the distances to sample much bigger than dc because their contribution to density is tiny through the mapping of the Gaussian kernel. Moreover, it brings the original DPC to the time complexity of .
In contrast, our formula only considers NNE. It, therefore, also gets rid of the passive impact of outliers since they usually are distant from its nearest point and are not in any NNE(s) of other(s), at the same time, reduce the computational complexity. And unlike the Gaussian kernel mapping, equation (8) retains the original information of data, does not require any arguments, and avoids the sensitivity caused by dc.

Definition 4. (natural neighbor expanded group). Natural Neighbor Expanded Group consists of a prominent density peak and its subordinates.
In our method, each point is assigned to the nearest denser point of its NNE. The assignment process is stored in a list: the index numbers represent the samples in the given dataset, respectively; each unit stores the index number of its superordinate one, and if the density of a sample is bigger than all of its NDP, the related unit saves 0. Namely, zero samples are prominent density peaks. The assignment divides the dataset into several NNEGs, adaptively.
Essentially, NNEGs reveal the potential structure of the dataset analyzed and are relatively tighter subcluster and local groups in the cluster of the Ground Truth. Due to the application of NNEG, each sample only points to a neighbor, and our method could avoid the long-distance assignment of the PDP.
As shown in Figure 4, after NNEGs are determined, our method only needs to merge such local groups into the goal number of clusters and hence remove the operation of the center selection from the decision graph, which overcomes the mentioned issue of the density metric of DPC. To clarify the close relationship between NNEGs, we proposed the concept of the adjacent group graph.

Definition 5. (adjacent group graph). , where is a set of NNEGs, , and is a set of edges linked to NNEGs and , and subject toAdjacent Group Graph usually is a multigraph, since there could be several between and . And the more the edges are, the closer the two groups are. Obviously, in Figure 4, there are no edges between the upper and the lower clusters. Moreover, the degree of closeness (DC) of the neighboring pairwise NNEGs is calculated bywhere and . As shown in equation (10), the formula of closeness degree is constituted with two parts: the weight and the similarity normalized. It is based on an assumption where the more compact the endpoints and their respective NNEGs are, the more reliable the edge is. represents the compactness between the sample and the group , viz., the bigger number of intersected elements of and means the relationship between them is intenser. To ensure , the number of the elements intersected divided by .

3.3.2. The Specific Processing

Inputs: dataset X, the goal number of clusters.Output: the clustering result.Step 1: Create a k-d tree. Search NNE for each sample using the k-d tree.Step 2: Calculate local density according to equation (8).Step 3: Determine NNEG according to Definition 4.Step 4: Generate the Adjacent Group Graph as in Definition 5, and find all edges of each pairwise NNEGs as equation (9).Step 5: Calculate the degree of closeness, according to equation (10).Step 6: Break up the original cluster containing all NNEGs into the goal number of sets, according to the closeness degree.

To clarify Step 6 in detail, we present an example in Table 1. As shown in Table 1 (A), there are five NNEGs in a dataset. And the closeness degrees of adjacent pairwise NNEGs are recorded. Assume the goal number is 2. Our method considers the whole dataset as a cluster, since . We force the minimum as shown in Table 1 (B), which means those NNEGs are split into two parts: and , i.e., split is a for-loop operation which let the minimum until the cluster number equals to the goal one.

And more details are as shown in the pseudocode. In the 6th line, AGG is a matrix where each row and each column correspond to one of NNEGs. In the 16th line, inspired by the Top-down hierarchical clustering, we consider the whole dataset as a cluster containing all NNEGs and break the weakest in the AGG until the cluster number equals the goal, which corresponds to the process, Table 1 (A) and (B).

3.3.3. Time Complexity Analyses

This section aims to analyze the computational complexity of our method, and suppose that the number of total samples in a dataset is , the number of NNEG is equal to , the goal number of clusters is , the NDP of sample is the neighbor, and the biggest equals . (Algorithm 1).

Require: Dataset , the goal number of clusters G
Ensure: The result of clustering:
(1) Create a k-d tree;
(2) Search the k-d tree;
(3) Determine NN according to [34], and record , which means determined;
(4) Calculate local density according to equation (8).;
(5) Assign each point to its NDP of its NNE to generate several NNEGs;
(6) Create a matrix AGG = (, );
(7)for i = 1 : n do
(8)  for t = 1 : do
(9)   if the tth NNE and sample i belongs to different NNEGs do
(10)    Calculate the closeness degree of this edge, referring to equation (10);
(11)    Add the DC of this edge to the corresponding unit of AGG;
(12)   end if
(13)  end for
(14)end for
(15)while the number of clusters does not equal G do
(16)  Store zero in the unit with the min value but greater than zero;
(17)  Count the number of clusters;
(18)end while

The time complexity of creating a k-d tree is [41]. It is demonstrated that determining NN for all samples also requires the cost of [36]. And for finding NNE, we can record the in the processing of searching NN. Hence, the searching NNE of a sample only needs to times search operation, and its whole complexity for all samples is less than . Our local density metric is based on NNE, and it is not necessary to generate a distance matrix and only needs to times plus operations for each sample. Therefore, it is required with at most for the time cost to calculate local densities of all instances. For each sample, the method takes time search to find its NDP via the k-d tree in the round , and . In the process of generating each NNEG, we store the labels of its prominent density peak to a list where the first unit is any unallocated instance, and the end is an assigned one or prominent density peak. And the operation of storing labels of all samples only needs to the time cost of . And the cost required is on dividing a dataset into NNEGs. In equations (9) and (10), is requested and determined via searching the NNE of each sample to find the neighbors having different labels. Thus, for all edges, it is equal to for the magnitude of how many times the searching operation is performed. Furthermore, the time complexity of grouping in the last step must be less than . Overall, we can conclude that the time complexity of the entire algorithm is .

4. Results

In this section, several datasets are used to evaluate the performance of our method in comparison with some state-of-the-art techniques such as DPC-DBFN [34], DPC-KNN [40], IDPC [37], and FKNN-DPC [33]. The experiments are performed on a computer with a Windows 10, Intel (R) Core (TM) i7-8750H, 16 GB memory, and Matlab 2016b. The results represented are measured by several performance metrics, including Normalized Mutual Information (NMI) [42], Rand Index (RI) [43], and the Adjusted Rand Index (ARI) [44]. In this section, the similarity between points is measured using the Euclidean distance metric.

4.1. Datasets

In this paper, all tested datasets include three low-dimensional datasets and five high-dimensional datasets, which are public and from UCI. The two-dimensional datasets have different numbers of samples and different objective distributions. The DMI512 dataset containing 1024 elements with 512-dimensional features, which belonged to 16 Gaussian clusters sampled from a Gaussian distribution, is often used to test algorithm performance in high-dimensional space. Experiments of the four datasets, including Statlog (Shuttle), Abalone, Wine Quality, and Libras Movement, are applications of our method on Physical (the positioning of radiators in the Space Shuttle), Population Biology, Model Wine Preferences, and Hand Movement Recognition, respectively. And more details are presented in Table 2.

To reduce the influence of dimension weights and ensure the validity of the experimental comparison, we processed each dataset and normalized all dataset tested. The normalization formula is as follows:where is the feature value of the sample, while and represent the maximum and minimum values of the feature, respectively.

4.2. Evaluation Measures

We tested our algorithm and several related works on the above datasets. For intuitive comparison, we chose RI, ARI, and NMI to measure the clustering results.

The RI formula is shown inwhere TP indicates true positive, TN indicates real negative, and the denominator is the total number of sample pairs in a dataset consisting of n samples.

The ARI formula is shown inwhere represents the expectations of RI.

The NMI formula is shown inwhere , , represents the expectations of , and is expressed aswhere , , , , and . and represent two allocation methods for a dataset containing n elements, and and are clusters. In experimental verification, let and be the original labels and the clustering results of an algorithm, respectively. If the clustering results are as same as the real labels, the three metrics take the value of 1, and if the clustering results are entirely different from the labels, the values will be equal to 0.

4.3. Results

This section aims to show the detailed clustering results and evaluate the performance of different clustering algorithms on the various datasets. Tables 35 compare the performance of our method with DPC-DBFN, DPC-KNN, IDPC, and FKNN-DPC in terms of NMI, RI, and ARI measures, respectively. All these methods are using the KNN method, and the number of nearest neighbors (K) can be set from 1 to n. In these tables, the numbers in the parenthesis are the value of K, where the corresponding algorithm obtains the results represented, and boldface marks the best results.

The Jain dataset has 373 points and two clusters: the upper one and the lower one. As shown in Figure 5, DPC-NNEG divides the dataset into nineteen NNEGs and then successfully and efficiently groups them into two sets since there are no edges between the two clusters. Homoplastically, as shown in Figure 6, our algorithm divides the Spiral dataset into several local groups and subsequently merges all NNEGs accurately into the goal number of clusters.

Unlike Jain and Spiral, as shown in Figure 7, the Flame dataset containing 240 data points has no clear gap between the two adjacent clusters. Hence, it is more sensitive to the value of dc of the DPC algorithm because a tiny change in dc will cause the border point is assigned to another cluster. However, our method not only partitions all samples into eight NNEGs but also measures the tightness between different groups accurately, which realizes the correct grouping of those local groups. And Figure 7 shows that the clustering result of Flame by DPC-NNEG is consonant with the Ground Truth.

As shown in Tables 35, there is no difference in performance among our algorithm, DPC-DBFN, DPC-KNN, IDPC, and FKNN-DPC in three two-dimensional datasets. However, as shown in Table 2, the clustering results of more complex high-dimensional datasets show the outperformance of our method: DPC-NNEG gains the best marks measured by NMI in all datasets. For example, the results of DPC-NNEG in the Statlog (Shuttle), Abalone, Wine Quality, DIM512, and Libras Movement datasets are 0.6101, 0.1852, 0.0935, 1.0000, and 0.5855, respectively. Moreover, its improvements to the second-best method (in %) for Statlog (Shuttle), Abalone, Wine Quality, and Libras Movement datasets are respectively 11.13, 0.32, 33.38, and 0.12.

Tables 4 and 5 show similar results, respectively, measured by RI and ARI. These results also demonstrate that the proposed method, in most cases, obtains the biggest values of NMI except the Wine Quality dataset. Hence, based on these results, it can be concluded that DPC-NNEG has given an overall excellent performance in clustering.

5. Conclusions and Future Works

This paper proposed an efficient clustering algorithm called DPC-NNEG, which can easily split a dataset into local groups and then merge those groups into the goal number of clusters with various densities, shapes, and sizes. The proposed method aims at clustering the data by three major steps: calculating the local density of each sample, identifying natural neighbor expanded groups, and merge those groups into clusters. The first step utilizes the natural neighbor method in the local density calculation. And it is entirely different from the formula of the original DPC and could avoid the impact of outliners and reduce the sensitivity of dc. In the second step, the NNE defined is used to mine the potential structure of data, which is useful to divide the dataset into several relatively more compact local groups called NNEGs. And the last step groups all NNEGs into the goal number of clusters using the proposed formula of the closeness degree of local groups. And the application of the second and third steps not only overcomes the issue of remote assignment of the prominent density peaks but also removes the step of center selection in the original DPC. The effectiveness of the method proposed was verified on several datasets. The results show that our approach is more effective against the related improvement algorithms of DPC. In future work, we shall contribute to developing the concept of NNE to find a more suitable method for secondary-adjacent samples, instead of the given and fixated parameter in equation (7). Fuzzy theory is a proper technique to mine relatively adjacent samples, in which NNE is used to construct the membership function of closeness, and then deduce the functions of secondary-adjacent samples and remote samples.

Data Availability

All datasets in this paper are available in UCI.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

Acknowledgments

This study was funded by the National Natural Science Foundation of China (61972056, 61772454, 61402053, and 61981340416), Hunan Provincial Natural Science Foundation of China (2020JJ4623), Scientific Research Fund of Hunan Provincial Education Department (17A007, 19C0028, and 19B005), Changsha Science and Technology Planning (KQ1703018, KQ1706064, KQ1703018-01, and KQ1703018-04), Junior Faculty Development Program Project of Changsha University of Science and Technology (2019QJCZ011), “Double First-class” International Cooperation and Development Scientific Research Project of Changsha University of Science and Technology (2019IC34), Practical Innovation and Entrepreneurship Ability Improvement Plan for Professional Degree Postgraduate of Changsha University of Science and Technology (SJCX202072), Postgraduate Training Innovation Base Construction Project of Hunan Province (2019-248-51 and 2020-172-48), and Beidou Micro Project of Hunan Provincial Education Department (XJT(2020) No.149).