Abstract
This paper presents an improved clustering algorithm for categorizing data with arbitrary shapes. Most of the conventional clustering approaches work only with roundshaped clusters. This task can be accomplished by quickly searching and finding clustering methods for density peaks (DPC), but in some cases, it is limited by density peaks and allocation strategy. To overcome these limitations, two improvements are proposed in this paper. To describe the clustering center more comprehensively, the definitions of local density and relative distance are fused with multiple distances, including Knearest neighbors (KNN) and sharednearest neighbors (SNN). A similarityfirst search algorithm is designed to search the most matching cluster centers for noncenter points in a weighted KNN graph. Extensive comparison with several existing DPC methods, e.g., traditional DPC algorithm, densitybased spatial clustering of applications with noise (DBSCAN), affinity propagation (AP), FKNNDPC, and Kmeans methods, has been carried out. Experiments based on synthetic data and real data show that the proposed clustering algorithm can outperform DPC, DBSCAN, AP, and Kmeans in terms of the clustering accuracy (ACC), the adjusted mutual information (AMI), and the adjusted Rand index (ARI).
1. Introduction
The natural ecosystem has the characteristics of diversity, complexity, and intelligence, which provide infinite space for datadriven technology. As a new research focus, the datadriven prediction method has been widely used in energy, transportation, finance, and automobiles [1–7]. Clustering algorithm is an important branch of datadriven technology, which provides important information for further data analysis through mining the internal association of data [8, 9].
Due to the different definitions of clustering, different clustering strategies have been reported. Among them, the Kmeans algorithm is a simple and effective clustering algorithm. It preselects K initial clustering centers and then iteratively assigns each data point to the nearest clustering center [10]. Since the initial clustering center has certain impacts on the clustering results of Kmeans, the works [11, 12] provided several methods for selecting the initial clustering center and improving the accuracy of clustering. Since the Kmeans and its variants are based on the idea that data points are assigned to the nearest clustering center, these methods cannot facilitate the nonspherical clustering task well. Unlike the Kmeans algorithm, affinity propagation (AP) [8] has been developed based on the similarity between data points, and it can complete clustering by exchanging information between them. Hence, the AP algorithm does not need to determine the number of clusters in advance, and it has the time advantage in completing the clustering task of largescale datasets [13]. However, for complex datasets, the AP method may also lead to performance degradation as the Kmeans method [14].
To address the aforementioned problems, densitybased clustering methods have been proposed, which can find clusters of various shapes and sizes in noisy data, where the highdensity regions are considered as the clusters and separated by lowdensity regions [15–19]. In this line, densitybased spatial clustering of applications with noise (DBSCAN) [15, 16] was proposed as an effective densitybased clustering method. It needs to determine two parameters about the density of points ( and ) to achieve clustering of arbitrary shapes, where is the neighborhood radius and is the number of points contained in the neighborhood radius [15]. However, choosing a suitable threshold is a challenging task for these methods [15, 17]. Subsequently, Rodriguez and Laio [20] proposed a novel densitybased clustering algorithm through fast search and density peaking (named as DPC). The DPC algorithm uses the local density and the relative distance of each point to establish a decision graph, finds the cluster centers according to the decision graph, and then assigns the noncenter point to the cluster of the nearest higher density neighbor. Although the DPC algorithm is simple and effective for detecting arbitrary shape clustering, several issues are limiting its practical application. Firstly, DPC is sensitive to the cutoff distance , implying that the parameter is set suitably to retain satisfactory performance, which is not a trivial task. Secondly, the clustering centers should be manually selected, which may not be feasible and convenient for some datasets. Moreover, the allocation error of highdensity points will directly affect the allocation of lowdensity points around it, which may also contribute to propagating in the subsequent allocation process continuously.
To overcome these issues, the main advanced DPC algorithm has recently been studied. To avoid the influence of the cutoff distance , the concept of Knearest neighbors (KNN) has been introduced into the DPC algorithm, which proposed two different density measures, e.g., DPCKNN [19] and FKNNDPC [9]. Although both algorithms are based on the Knearest neighbor information, they have been developed separately. Moreover, to solve the problem of manual selection of clustering centers, Li et al. [21] proposed a density peak clustering method to automatically determine the clustering centers. In this algorithm, the potential clustering centers are determined by the ranking graph, and then the true clustering centers are filtered out using the cutoff distance d_{c}. To remedy the allocation error transmission, FKNNDPC [9] and SNN [22] both adopted a twostep allocation strategy to allocate noncentral points. In the first step, they use the breadthfirst search to assign nonoutlier points. In the second step, FKNNDPC uses the fuzzy weighted Knearest neighbor technology to allocate the remaining points, and the SNN is based on whether the number of shared neighbors reaches the threshold to determine the cluster of the remaining points.
This paper proposed an improved clustering algorithm based on the density peaks (named as DPCSFSKNN). It has the following new features: (1) the local density and the relative distance are redefined, and the distance attributes of the two neighbor relationships (KNN and SNN) are fused. This method can detect the lowdensity clustering center. (2) A new allocation strategy is proposed. A similarityfirst search algorithm based on weighted KNN graphs is designed to allocate noncenter points. It has to be ensured that the allocation strategy is fault tolerant.
In general, this paper is organized as follows: Section 2 briefly mainly introduces the DPC algorithm and its development and analyzes the DPC algorithm in detail. Section 3 introduces the DPCSFSKNN algorithm in detail and gives a detailed analysis. Section 4 tests the proposed algorithm on several synthetic and realworld datasets and compares its performance with DPC, DBSCAN, AP, FKNNDPC, and Kmeans in terms of several very popular criteria for testing a clustering algorithm, namely, clustering accuracy (ACC), adjusted mutual information (AMI), and adjusted Rand index (ARI). Section 5 draws some conclusions.
2. Related Work
The density peak clustering algorithm (DPC) was proposed by Alex and Alessandro in 2014. The core idea of the DPC algorithm lies in the characterization of the cluster center, which has the following two characteristics: the cluster center point has a higher local density, which is surrounded by neighbor points with lower local density; the cluster center point is relatively far from other denser data points. These characteristics of the cluster center are related to two quantities: the local density of each point and its relative distance , which represents the closest distance from the point to larger density points.
2.1. DPC Algorithm and Improvements
Suppose is a dataset for clustering and represents the Euclidean distance between data points and . The calculation of local density and relative distance depends on the distance . The DPC algorithm introduces two methods for calculating local density: the “cutoff” kernel method and the Gaussian kernel method. For a data point , its local density is defined in (1) with the “cutoff” kernel method and in (2) with the Gaussian kernel method:where is defined as a cutoff distance, which represents the neighborhood radius of the data point. The most significant difference between the two methods is that calculated by the “cutoff” kernel is a discrete value, while calculated by the Gaussian kernel is a continuous value. Therefore, the probability of conflict (different data points correspond to the same local density) in the latter is relatively smaller.
Moreover, is an adjustable parameter in (1) and (2), which is defined aswhere represents the average number of neighbors for each point, which is between 1 and 2 of all points [20]; is the serial number of the last data point after the ascending order of all the distances , and it is also the total number of points; 2 in formula (3) is the empirical parameter provided in reference [20], which can be adjusted according to different datasets.
The relative distance represents the minimum distance between the point and any other higher density points and is mathematically expressed aswhere is the distance between points and . When the local density is not the maximum density, the relative distance is defined as the minimum distance between the point and any other higher density points; when is the maximum density, takes the maximum distance to all other points.
After calculating the local density and relative distance of all data points, the DPC algorithm establishes a decision graph through the set of points and . The point with high values of and is called a peak, and the center of the cluster is selected from the peaks. Then, the DPC algorithm directly assigns the remaining points to the same cluster as the nearest neighbor peak.
For the DPC algorithm, the selection of has a great influence on the correctness of the clustering results. Both DPCKNN and FKNNDPC schemes introduce the concept of Knearest neighbors to eliminate the influence of . Hence, two different local density calculations are provided.
The local density proposed by DPCKNN [19] and FKNNDPC [9] is given in (5) and (6), respectively:where is the total number of nearest neighbors and represents the set of Knearest neighbors of point . These two methods provide a unified density metric for datasets of any size through the idea of Knearest neighbors and solve the problem of nonuniformity of DPC’s density metric for different datasets.
Based on Knearest neighbors, SNNDPC proposes the concept of sharednearest neighbors (SNN) [22], which is used to represent the local density and the relative distance . The idea of SNN is that if there are more same neighbors in the Knearest neighbors of two points, the similarity of two points is higher, and the expression is given by
Based on the SNN concept, the expression of SNN similarity is as follows:where is the distance between points and and is the distance between points and . The condition for calculating SNN similarity is that points and appear in each other’s Knearest neighbor set. Otherwise, the SNN similarity between the two points is 0.
Next, the local density of point is expressed by SNN similarity. Suppose point is any point in the dataset X, then represents the set of points with the highest similarity with point . The expression of local density is
At the same time, the equation for the relative distance of the point is as follows:
The SNNDPC algorithm not only redefines the local density and relative distance, but also changes the data point allocation strategy. The allocation strategy divides the data points into two categories: “unavoidable subordinate points” and “probable subordinate points.” The two types of data points have their allocation algorithms. Compared with the DPC algorithm, this allocation strategy method is better for the clustering of clusters with different shapes.
2.2. DPC Algorithm Analysis
The DPC algorithm proposes a very simple and elegant clustering algorithm. However, due to its simplicity, DPC has the following two potential problems to be further addressed in practice.
2.2.1. DPC Ignores LowDensity Points
When the density difference between clusters is large, the performance of the DPC algorithm can be significantly degraded. To show this issue, we take the dataset Jain [23] as an example, and then the clustering results calculated using the truncated kernel distance of the DPC are shown in Figure 1. It can be seen that the cluster distribution in the upper left is relatively sparse, and the cluster distribution in the lower right is relatively close. The red star in the figure represents the cluster centers in the upper left corner. Under the disparity in density, the clustering centers selected by the DPC are all on the tightly distributed cluster below. Due to the incorrect selection of the clustering centers, the subsequent allocations are also incorrect.
(a)
(b)
Analyzing the local density and the relative distance separately, from Figures 2(a) and 2(b), it can be seen that the value and the value of point A of the false cluster center are much higher than that of the true cluster center C. The results of Gaussian kernel distance calculation are the same, and the correct clustering center cannot be selected on the dataset Jain. Therefore, how to increase the value and the value of the lowdensity center and make it stand out in the decision graph is a problem that needs to be considered.
(a)
(b)
2.2.2. DPC Ignores LowDensity Point Allocation Strategy with Low Fault Tolerance
The fault tolerance of the allocation strategy of the DPC algorithm is not satisfactory, mainly because the allocation of points receives a higher impact than the density of points. Hence, if a highdensity point allocation error occurs, it will directly affect the subsequent allocation of points with a lower density. Taking the Pathbased dataset [24] as an example, Figure 3 shows the clustering result calculated by the DPC algorithm by using the “cutoff” kernel distance. It can be seen from the figure that the DPC algorithm can find a suitable clustering center, but the allocation results of most points are incorrect. The same is true of the results using the Gaussian kernel distance calculation. The results of point assignment on the Pathbased dataset are similar to those of “cutoff” kernel clustering. Therefore, the fault tolerance of the point allocation strategy should be further improved. Moreover, the points are greatly affected by other points during the allocation, which is also an issue to be further addressed.
(a)
(b)
3. Proposed Method
In this section, the DPCSFSKNN algorithm is introduced in detail. The DPCSFSKNN algorithm is proposed, where the five main definitions of the algorithm are introduced, and the entire algorithm process is introduced. Moreover, the complexity of the DPCSFSKNN algorithm is analyzed.
3.1. The Main Idea of DPCSFSKNN
The DPC algorithm relies on the distance between points to calculate the local density and the relative distance and is also very sensitive to the choice of the cutoff distance . Hence, the DPC algorithm may not be able to correctly process for some complex datasets. The probability that a point and its neighbors belong to the same cluster is high. Adding attributes related to neighbors in the clustering process can help to make a correct judgment. Therefore, we introduce the concept of sharednearest neighbor (SNN) proposed in [22], when defining the local density and the relative distance. Its basic idea is that if they have more common neighbors, the two points are considered to be more similar as said above (see equation (7)).
Based on the above ideas, we define the average distance of the sharednearest neighbor between point and point and the similarity between the two points.
Definition 1 (average distance of SNN). For any points and in the dataset , the sharednearest neighbor set of two points is , and the average distance of SNN is expressed aswhere point is any point of and is the number of members in the set . shows the spatial relationship between point and point more comprehensively by calculating the distances between two points and sharednearest neighbor points.
Definition 2 (similarity). For any points and in the dataset , the similarity between point and can be expressed aswhere is the number of nearest neighbors. is selected from 4 to 40 until the optimal parameter appears. The lower bound is 4 because a smaller may cause the algorithm to become endless. For the upper bound, it is found by experiments that a large will not significantly affect the results of the algorithm. The similarity is defined according to the aforementioned basic idea “if they have more common neighbors, the two points are considered to be more similar,” and the similarity is described using the ratio of the number of sharednearest neighbors to the number of nearest neighbors.
Definition 3 (Knearest neighbor average distance). For any point in the dataset , its Knearest neighbor set is , and then the expression of Knearest neighbor average distance is as follows:where point is any point in and the number of nearest neighbors of any point is . Knearest neighbor average distance can describe the surrounding environment of a point to some extent. Next, we use it to describe local density.
Definition 4 (local density). For any point in the dataset , the local density expression iswhere point is a point in the set and and are the Knearest neighbor average distances of point and point , respectively. In formula (14), the numerator (the number of sharednearest neighbor ) represents the similarity between the two points, and the denominator (the sum of the average distances) describes the environment around them. When is a constant and if the value of the sum of the average distances () is small, the local density of point is large. Point is one of the Knearest neighbors of point . When the values of and are small, it means and are closely surrounded by their neighbors. If has a larger value (point is far away from point ) or has a larger value (when the neighboring points of the distance are far away from the point ), the local density of the point becomes smaller. Therefore, only the average distances of the two points are small, and it can be expressed that the local density of point is large. Moreover, when the sum of the average distances of the two points is constant and if the number of sharednearest neighbors of the two points is large, the local density is large. A large number of shared neighbors indicate that the two points have a high similarity and a high probability of belonging to the same cluster. The higher the similarity points around a point, the greater its local density and the greater the probability of becoming a cluster center. This is beneficial to those lowdensity clustering centers. A large number of shared neighbors can compensate for the loss caused by their large distance from other points so that their local density is not only affected by distance. Next, we define the relative distance of the points.
Definition 5 (relative distance). For any point in the dataset , the relative distance can be expressed aswhere point is one of the Knearest neighbors of point , is the distance between points and , and and are the average distance from the nearest neighbor of points and . We can use the sum of the three distances to represent the relative distance. Compared to the DPC algorithm which only uses to represent the relative distance, we define the concept of relative distance and Knearest neighbor average distances of two points. The new definition can not only express the relative distance, but also be more friendly to lowdensity cluster centers. Under the condition of constant , the average distance of the nearest neighbors of the lowdensity points is relatively large, and its relative distance will also increase, which can increase the probability of lowdensity points being selected.
The DPCSFSKNN clustering center is selected in the same way as the traditional DPC algorithm. The local density and relative distance are used to form a decision graph. The points with the largest local density and relative distance are selected as the clustering centers.
For DPCSFSKNN, the sum of the distances from points of a lowdensity cluster to their Kneighbors may be large; thus, they receive a greater compensation for their value. Figures 4(a) and 4(b) show the results of DPCSFSKNN on the Jain dataset [23]. Compared to Figure 2(b), the values of points in the upper branch are generally larger than those of the lower branch. This is because the density of the upper branch is significantly smaller and the distances from the points to their respective Knearest neighbors are larger; thus, they receive a greater compensation. Even if the density is at a disadvantage, the higher value still makes the center of the upper branch distinguished in the decision graph. This shows that the DPCSFSKNN algorithm can correctly select lowdensity clustering centers.
(a)
(b)
3.2. Processes
The entire process of the algorithm is divided into two parts: the selection of clustering centers and the allocation of noncenter points. The main step of our DPCSFSKNN and a detailed introduction of the proposed allocation strategy are given in Algorithm 1.

Line 9 of the DPCSFSKNN algorithm establishes a weighted Knearest neighbor graph, and Line 11 is a Knearest neighbor similarity search allocation strategy. To assign noncenter points in the dataset, we designed a similarityfirst search algorithm based on the weighted Knearest neighbor graph. The algorithm uses the breadthfirst search idea to find the cluster center with the highest similarity for the noncenter point. The similarity of noncenter points and their Knearest neighbors is sorted in an ascending order, the neighbor point with the highest similarity is selected as the next visited node, and it is pushed into the path queue. If the highest similarity point is not unique, the point with the smallest SNN average distance is selected as the next visited node. The visiting node also needs to sort the similarity of its Knearest neighbors and select the next visiting node. The search stops until the visited node is the cluster center point. Algorithm 2 describes the entire search process. Finally, each data point except the cluster centers is traversed to complete the assignment.

Similarityfirst search algorithm is an optimization algorithm based on breadthfirst search according to the allocation requirements of noncenter points. Similarity is an important concept for clustering algorithms. Points in the same cluster are similar to each other. Two points with a higher similarity have more of the same neighbors. Based on the above ideas, the definition of similarity is proposed in (12). In the process of searching, if only similarity is used as the search criteria, it is easy to appear that the highest similarity point is not unique. Therefore, the algorithm chooses the average distance of the SNN as the second criterion, and a smaller point means that the two points are closer in space.
The clustering results of the DPCSFSKNN algorithm based on the Pathbased dataset are shown in Figure 5. Figure 3 clearly shows that although the traditional DPC algorithm can find cluster centers on each of the three clusters, there is a serious bias in the allocation of noncenter points. From Figure 5, we can see the effectiveness of the noncentral point allocation algorithm of the DPCSFSKNN algorithm. The allocation strategy uses similarityfirst search to ensure that the similarity from the search path is the highest, and a gradual search to the cluster center to avoid the points with low similarity is used as a reference. Besides, the similarityfirst search allocation strategy based on the weighted Knearest neighbor graph considers neighbor information. When the point of the highest similarity is not unique, the point with the shortest average distance of the shared neighbors is selected as the next visited point.
3.3. Complexity Analysis
In this section, the complexities of the DPCSFSKNN algorithm are analyzed, including time complexity and space complexity. Suppose the size of the dataset is , the number of cluster centers is and the number of neighbors is .
3.3.1. Time Complexity
The time complexity analysis of DPCSFSKNN is as follows.
Normalization requires a processing complexity of approximately ; the complexities of calculating the Euclidean distance and similarity between points are ; the complexity of computing the Knearest neighbor average distance is ; similarly, the complexity of the average distance between the calculation point and its sharednearest neighbors does not exceed at most; the calculation process of calculating the local density and distance of each point needs to acquire the KNN information complexity of each point as , so the complexities of local density and distance are ; the point allocation part is the search time of one point, and in the worst case, searching all points requires . There are points in the dataset, and the total time does not exceed . In summary, the total approximate time complexity of DPCSFSKNN is .
The time complexity of the DPC algorithm depends on the following three aspects: (a) the time to calculate the distance between points; (b) the time to calculate the local density for point , and (c) the time to calculate the distance for each point . The time complexity of each part is , so the total approximate time complexity of DPC is .
The time complexity of the DPCSFSKNN algorithm is times higher than that of the traditional DPC algorithm. However, is relatively small compared to . Therefore, they do not significantly affect the run time. In Section 4, it is demonstrated that the actual running time of DPCSFSKNN does not exceed times of the running time of the traditional DPC algorithm.
3.3.2. Space Complexity
DPCSFSKNN needs to calculate the distance and similarity between points, and its complexity is . Other data structures (such as and arrays and various average distance arrays) are . For the allocation strategy, in the worst case, its complexity is . The space complexity of DPC is , which is mainly due to the distance matrix stored.
The space complexity of our DPCSFSKNN is the same as that of traditional DPC, which is .
4. Experiments and Results
In this section, experiments are performed based on several public datasets commonly used to test the performance of clustering algorithms, including synthetic datasets [23–27] and real datasets [28–34]. In order to visually observe the clustering ability of DPCSFSKNN, the DPC [20], DBSCAN [15], AP [8], FKNNDPC [9], and Kmeans [10] methods are all tested for comparison. Three popular benchmarks are used to evaluate the performance of the above clustering algorithms, including the clustering accuracy (ACC), adjusted mutual information (AMI), and adjusted Rand index (ARI) [35]. The upper bounds of the three benchmarks were all 1. The larger the benchmark value, the better the clustering effect. The codes for DPC, DBSCAN, and AP were provided based on the corresponding references.
Table 1 lists the synthetic datasets used in the experiments. These datasets were published in [23–27]. Table 2 lists the real datasets used in the experiments. These datasets include the realworld dataset from [29–34] and the Olivetti face dataset in [28].
To eliminate the influence of missing values and differences in different dimension ranges, the datasets need to be preprocessed before proceeding to the experiments. We replace the missing values by the mean of all valid values of the same dimension and normalize the data using the minmax normalization method shown in the following equation:where represents the original data located in the row and column, represents the rescaled data of , and represents the original data located in the column.
Minmax normalization method processes each dimension of the data and preserves the relationships between the original data values [36], therefore decreasing the influence of the difference in dimensions and increasing the efficiency of the calculation.
To fairly reflect the clustering results of the five algorithms, the parameters in the algorithms are adjusted to ensure that their satisfactory clustering performance can be retained. For the DPCSFSKNN algorithm, the parameter needs to be specified in advance, and an initial clustering center is manually selected based on a decision graph composed of the local density and the relative distance . It can be seen from the experimental results in Tables 3 and 4 that the value of parameter is around 6, and the value of parameter for the dataset with dense sample distribution is more than 6. In addition to manually select the initial clustering center, the traditional DPC algorithm also needs to determine . Based on the provided selection range, is selected so that the number of neighbors is between 1 and 2% of the total number of data points [20]. The two parameters that DBSCAN needs to determine are and as in [15]. The optimal parameters are determined using a circular search method. The AP algorithm only needs to determine a preference, and the larger the preference, the more the center points are allowed to be selected [8]. The general method for selecting parameters is not effective, and only multiple experiments can be performed to select the optimal parameters. The only parameter of Kmeans is the number of clusters. The true number of clusters in the dataset is used in this case. Similarly, FKNNDPC needs to determine the nearest neighbors .
4.1. Analysis of the Experimental Results on Synthetic Datasets
In this section, the performance of DPCSFSKNN, DPC [20], DBSCAN [15], AP [8], FKNNDPC [9], and Kmeans [10] is tested with six synthetic datasets given in Table 1. These synthetic datasets are different in distribution and quantity. Different data situations can be simulated to compare the performance of six algorithms in different situations. Table 3 shows AMI, ARI, ACC, and EC/AC of the five clustering algorithms on the six comprehensive datasets, where the best results are shown in bold and “—” means no value. Figures 6–9 show the clustering results of DPCSFSKNN, DPC, DBSCAN, AP, FKNNDPC, and Kmeans based on the Pathbased, Flame, Aggregation, and Jain datasets, respectively. The five algorithms achieve the optimal clustering on DIM512 and DIM1024 datasets, so that the clustering of the two datasets is not shown. Since the cluster centers of DBSCAN are relatively random, only the positions of clustering centers of the other three algorithms are marked.
(a)
(b)
(c)
(d)
(e)
(f)
(a)
(b)
(c)
(d)
(e)
(f)
(a)
(b)
(c)
(d)
(e)
(f)
(a)
(b)
(c)
(d)
(e)
(f)
Figure 6 shows the results of the Pathbased dataset. DPCSFSKNN and FKNNDPC can complete the clustering of the Pathbased dataset correctly. From Figures 6(b), 6(d), and 6(f), it can be seen that the clustering results of DPC, AP, and Kmeans are similar. The clustering centers selected by DPC, AP, DPCSFSKNN, and FKNNDPC are highly similar, but the clustering results of DPC and AP are not satisfactory. For the DPC algorithm, the low fault tolerance rate of its allocation strategy is the cause of this result. A highdensity point allocation error will be transferred to lowdensity points, and the error propagation will seriously affect the clustering results. AP and Kmeans algorithms are not good at dealing with irregular clusters. The two clusters in the middle are too attractive to the points on both sides of the semicircular cluster, which leads to clustering errors. DBSCAN can completely detect the semicircular cluster, but the semicircular cluster and the cluster on the left of the middle are incorrectly classified into one category, and the cluster on the right of the middle is divided into two clusters. The similarities between points and manually prespecified parameters may severely affect the clustering. DPCSFSKNN and FKNNDPC algorithms perform well on the Pathbased dataset. These improved algorithms that consider neighbor relationships have a great advantage in handling such complex distributed datasets.
Figure 7 shows the results of four algorithms on the Flame dataset. As shown in the figure, DPCSFSKNN, DPC, FKNNDPC, and DBSCAN can correctly detect two clusters, while AP and Kmeans cannot completely correct clustering. Although AP can correctly identify higher clusters and select the appropriate cluster center, the lower cluster is divided into two clusters. Both clusters of Kmeans are wrong. The clustering results in Figure 8 show that the DPCSFSKNN, DPC, FKNNDPC, and DBSCAN algorithms can detect 7 clusters in the Aggregation dataset, but AP and Kmeans still cannot cluster correctly. DPCSFSKNN, DPC, and FKNNDPC can identify clusters and centers. Although the cluster centers are not marked for DBSCAN, the number of clusters and the overall shape of each cluster are correct. The AP algorithm successfully found the correct number of clusters, but it chose two centers for one cluster, which divided the cluster into two clusters. The clustering result of Kmeans is similar to that of AP.
The Jain dataset shown in Figure 9 is a dataset consisting of two semicircular clusters of different densities. As shown in the figure, the DPCSFSKNN algorithm can completely cluster two clusters with different densities. However, DPC, AP, FKNNDPC, and Kmeans incorrectly assign the left end of the lower cluster to the higher cluster, and the cluster centers of the DPC are all on the lower cluster. Compared with that, the distribution of the cluster centers of the AP is more reasonable. For the DBSCAN algorithm, it can accurately identify lower clusters, but the left end of the higher cluster is incorrectly divided into a new cluster so that the higher cluster is divided into two clusters.
According to the benchmark data shown in Table 3, it is clear that the performance of DPCSFSKNN is very effective among the six clustering algorithms, especially in the Jain dataset. Although DPC and FKNNDPC perform better than DPCSFSKNN on Aggregation and Flame datasets, DPCSFSKNN can find the correct clustering center of the aggregation and can complete the clustering task correctly.
4.2. Analysis of Experimental Results on RealWorld Datasets
In this section, the performance of the five algorithms is still benchmarked according to AMI, ARI, ACC, and EC/AC, and the clustering results are summarized in Table 4. 12 realworld datasets are selected to test DPCSFSKNN’s ability to identify clusters on different datasets. DBSCAN and AP algorithm cannot get effective clustering results on waveform and waveform (noise). The symbol “—” represents no result.
As shown in Table 4, in terms of benchmarks AMI, ARI, and ACC, DPCSFSKNN outperforms all five other algorithms on the Wine, Segmentation, and Libras movement datasets. At the same time, FKNNDPC performs better than the other five algorithms on the Iris, Seeds, Parkinsons, and WDBC datasets. It can be seen that the overall performance of DPCSFSKNN is slightly better than DPC on 11 datasets except for Parkinsons. On the Parkinsons, DPCSFSKNN is slightly worse than DPC in AMI but better than DPC in ARI and ACC. Similarly, DPCSFSKNN had a slightly better performance in addition to Iris, Parkinsons, WDBC, and Seeds of eight sets of data in FKNNDPC, and DPCSFSKNN is slightly worse than DPC in AMI, ARI, and ACC. DBSCAN gets the best results on the Ionosphere. Kmeans is the best on PimaIndiansdiabetes, and Kmeans is the best in AMI on waveform and waveform (noise) datasets. In general, the clustering results of DPCSFSKNN in realworld datasets are satisfactory.
4.3. Experimental Analysis of Olivetti Face Dataset
Olivetti face dataset [28] is an image dataset widely used by machine learning algorithms. Its purpose is to test the clustering situation of the algorithm without supervision, including determining the number of clusters in the database and the members of each cluster. The dataset contains 40 clusters, each of which has 10 different images. Because the actual number of clusters (40 different clusters) is equal to the number of elements in the dataset (10 different images, each cluster), the reliability of local density becomes smaller, which is a great challenge for densitybased clustering algorithms. To further test the clustering performance of DPCSFSKNN, DPCSFSKNN performed experiments on the Olivetti face database and compared it with DPC, AP, DBSCAN, FKNNDPC, and Kmeans.
The clustering results achieved by DPCSFSKNN and DPC for the Olivetti face database are shown in Figure 10, and white squares represent the cluster centers. The 32 clusters corresponding to DPCSFSKNN found in Figure 10(a) and the 20 clusters found by DPC in Figure 10(b) are displayed in different colors. Gray images indicate that the image is not assigned to any cluster. It can be seen from Figure 10(a) that DPCSFSKNN found that the 32 cluster centers were covered 29 clusters, and as shown in Figure 10(b), the 20 cluster centers found by DPC were scattered in 19 clusters. Similar to DPCSFSKNN, DPC may divide one cluster into two clusters. Because DPCSFSKNN can find much more density peaks than DPC, it is more likely to identify a cluster as two different clusters. The same situation occurs with the FKNNDPC algorithm. However, the performance of FKNNDPC is better than that of DPCSFSKNN in AMI, ARI, ACC, and EC/AC. From the data in Table 5, based on AMI, ARI, ACC, and EC/AC, the clustering results of these algorithms are compared. The performance of DPCSFSKNNC is slightly superior to the performance of the other four algorithms except FKNNDPC.
(a)
(b)
4.4. Running Time
This section shows the comparison of the time performance of DPCSFSKNN with DPC, DBSCAN, AP, FKNNDPC, and Kmeans on realworld datasets. The time complexity of DPCSFSKNN and DPC has been analyzed in Section 3.3.1, and the time complexity of DPC is and the time complexity of DPCSFSKNN is , where n is the size of the dataset. However, the time consumed by DPC mainly comes from calculating the local density and the relative distance of each point, while the time consumed by DPCSFSKNN comes mainly from the calculation of Knearest neighbors and the division strategy of noncenter points. Table 6 lists the running time (in seconds) of the six algorithms on the real datasets. It can be seen that although the time complexity of DPCSFSKNN is approximately k times that of DPC, their execution time on actual datasets is not k times.
In Table 6, it can be found that on a relatively small dataset, the running time of DPCSFSKNN is about twice or more times that of DPC, and the difference mainly comes from DPCSFSKNN’s allocation strategy. Although the computational load of the local densities for points grows very quickly with the size of a dataset, the time consumed by the allocation strategy in DPCSFSKNN increases randomly with the distribution of a dataset. This leads to an irregular gap between the running time of DPC and DPCSFSKNN.
FKNNDPC has the same time and space complexity as DPC, but the running time is almost the same as DPCSFSKNN. It takes a lot of running time to calculate the relationship between Knearest neighbors. The time complexity of DBSCAN and AP is approximate to , and the parameter selection of both cannot be determined by simple methods. When the dataset is relatively large, it is difficult to find their optimal parameters, which may be the reason that the two algorithms have no running results on the waveform dataset. The approximate time complexity of Kmeans is , and Table 6 proves its efficiency. Kmeans has almost no loss of accuracy under the premise of fast speed, which makes it a very popular clustering algorithm, but Kmeans is not sensitive to irregularly shaped data.
5. Conclusions and Future Work
A new clustering algorithm is proposed based on the traditional DPC algorithm in this paper. This algorithm proposes a density peak search algorithm that takes into account the surrounding neighbor information and develops a new allocation strategy to detect the true distribution of the dataset. The proposed clustering algorithm performs fast search, finds density peaks, say cluster centers of a dataset of any size, and recognizes clusters with any arbitrary shape or dimensionality. The algorithm is called DPCSFSKNN, which means that it calculates the local density and the relative distance by using some distance information between points and neighbors to find the cluster center, and then the remaining points are assigned using similarityfirst. The search algorithm is based on the weighted KNN graph to find the owner (clustering center) of the point. The DPCSFSKNN successfully addressed several issues arising from the clustering algorithm of Alex Rodriguez and Alessandro Laio [20] including its density metric and the potential issue hidden in its assignment strategy. The performance of DPCSFSKNN was tested on several synthetic datasets and the realword datasets from the UCI machine learning repository and the wellknown Olivetti face database. The experimental results on these datasets demonstrate that our DPCSFSKNN is powerful in finding cluster centers and in recognizing clusters regardless of their shape and of the dimensionality of the space in which they are embedded and of the size of the datasets and is robust to outliers. It performs much better than the original algorithm DPC. However, the proposed algorithm has some limitations: the parameter K needs to be manually adjusted according to different datasets; the clustering centers still need to be manually selected by analyzing the decision graph (like the DPC algorithm); the allocation strategy improves the clustering accuracy but takes time and cost. How to improve the degree of automation and allocation efficiency of the algorithm needs further research.
Data Availability
The synthetic datasets are cited at relevant places within the text as references [23–27]. The realworld datasets are cited at relevant places within the text as references [29–34]. The Olivetti face dataset is cited at relevant places within the text as references [28].
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (6160303040 and 61433003), in part by the Yunnan Applied Basic Research Project of China (201701CF00037), and in part by the Yunnan Provincial Science and Technology Department Key Research Program (Engineering) (2018BA070).
Supplementary Materials
It includes the datasets used in the experiments in this paper. (Supplementary Materials)