Abstract

Among numerous clustering algorithms, clustering by fast search and find of density peaks (DPC) is favoured because it is less affected by shapes and density structures of the data set. However, DPC still shows some limitations in clustering of data set with heterogeneity clusters and easily makes mistakes in assignment of remaining points. The new algorithm, density peak clustering based on relative density optimization (RDO-DPC), is proposed to settle these problems and try obtaining better results. With the help of neighborhood information of sample points, the proposed algorithm defines relative density of the sample data and searches and recognizes density peaks of the nonhomogeneous distribution as cluster centers. A new assignment strategy is proposed to solve the abundance classification problem. The experiments on synthetic and real data sets show good performance of the proposed algorithm.

1. Introduction

As an unsupervised machine learning algorithm, clustering groups sample data into reasonable class based on similarity between sample points. Such process tries to make the similarity between samples inside a same cluster as high as possible and the similarity between samples in different clusters as low as possible. Many different types of clustering algorithms are proposed in different applications. In general, clustering can be divided into divisive clustering [13], hierarchical clustering [4, 5], grid-based algorithms [6, 7], model-based algorithms [8, 9], and density-based algorithms [10, 11]. In practical applications, data sets are various and complex with high dimensions, which brings a huge challenge to clustering. Some scholars put forward the idea of considering multiple clustering algorithms comprehensively, that is, integrated clustering [12, 13], which effectively improves the accuracy of clustering. With the development of cluster analysis theory and technology, it plays an increasingly important role in image processing, machine learning, artificial intelligence, natural language processing, pattern recognition, information retrieval, and bioinformatics [14].

Clustering by fast search and find of density peaks (DPC) [15] proposes a totally new clustering frame and the type of redefining clustering center. The structures of data are mapped into two-dimensional space (local density and the nearest distance), in which centers are recognized and clusters are grouped. With DPC, density peaks of sample data are easily and quickly found and DPC also shows high efficiency in assignment and elimination of noises. However, there are still limitations in clustering with DPC. (1) There is no unified density measurement, and parameter is difficult to set because it is related with specific problems. (2) Clustering centers need to be selected manually, which is qualitative analysis with subjective factors. As a result, objective and reasonable centers are difficult to find in decision graphs. (3) In terms of sample distribution, sample points are assigned to the nearest clusters with high density, which easily results in continuous transmit of the mistake clustering. (4) According to the definition of distance , two points would be selected as clustering centers if density of the two points is both the highest and belongs to the same cluster, which means one cluster is divided into two clusters mistakenly. (5) DPC shows limits in clustering of data sets with high dimension, unevenly distributed density, and noises.

To improve DPC, a new algorithm is proposed from two aspects, density measurement and assignment of the remaining points. The classical DPC algorithm uses global density, which cannot effectively identify the density peaks in the low density area. In this paper, the nearest information of samples is employed to calculate the local relative density, in attempt to recognize the centers of data set with nonhomogeneous distribution. To solve the overclassification problem in DPC, a new assignment strategy with sorting of local density and defining of corresponding distances of data samples is proposed. Based on the two improvements, a density peak clustering algorithm based on relative density optimization (RDO-DPC) achieves satisfied clustering results on synthetic and real data sets with various density types and irregular shapes.

The reminder of the paper is organized as follows: Section 2 introduces the definition and process of classical DPC and related works; density peak clustering algorithm based on relative density (RDO-DPC) algorithm is proposed in Section 3; experiments on synthetic and real data sets are shown in Section 4; and Section 5 gives conclusion and prospect.

2.1. DPC Algorithm

Clustering by fast search and find of density peaks (DPC) [15] could find the clusters of various densities and shapes with a simple strategy. The fundamental principle of DPC is that the ideal density peaks possess two essential features: (1) the local density of the peak is higher than the density of the neighbors; (2) the distances between different peaks are relatively longer. To find density peaks meeting the two above conditions, DPC introduces local density of sample and the corresponding distance , which is the distance from to , the sample whose local density is higher than , and which is the nearest sample to.

Local density depends on distance, which means it can be regarded as a function of the distance, for example, kernel function. One of the local densities is defined by cut-off kernel:where represents the distance between point and . Positive number is the appointed parameter. The value of is set as if ; otherwise, it is set as . The other local density can be defined by Gaussian kernel: in equations (1) and (2) can control the influence of neighbors on sample points, which equal the function of neighborhood . When data set is of large scale (number of points it contains), clustering result from DPC is slightly influenced by cut-off distance, and the influence from cut-off distance becomes greater and greater, while data scale becomes smaller. To avoid the influence from cut-off distance on local density, or further on clustering results, DPC employs Gaussian kernel in equation (2) to calculate overall density of the sample, while it is used to cluster small-scale data.

Another feature of ideal clustering center is that the distance between different centers should be as far as possible. As a result, , the distance from sample to which is the nearest to , and whose local density is larger than , is defined as

The definition in equation (3) shows that if the density of sample is the largest local density or the largest overall density, distance of sample is far more larger than distance of the neighbors of . Therefore, cluster centers are often points with extremely large , and density of those center points is also very large. Through constructing decision graph of distance in relative to density , DPC selects sample points with relatively large and as cluster centers. For remaining points , DPC assigns the points to clusters, which are the nearest to and are larger than in density, thus completing the distribution of remaining points with high efficiency.

2.2. Related Work

The researchers have improved the DPC [15] in many ways to adapt it to different applications, mainly focusing on the definition of cluster centers and assignment strategy.

In terms of definition of cluster centers, some scholars try to expand the differentiation between cluster center and other sample points, so as to select cluster centers in the decision-making graph, such as the normalization of local density and distance [16], gravitational analogy minimum distance [17, 18], and the Laplacian centrality in the form of no parameter [19]. Although this kind of method expands the differentiation between the density peak point and other points to a certain extent, it is still difficult to determine the cluster centers directly and effectively in some complex decision-making, and it needs manual selection. Therefore, other scholars have proposed a method to quantitatively select the class center based on the decision graph, among which the most prominent algorithms are the fuzzy theory principle [20], the normal distribution criterion [21], the inflexion point [22] of data distribution in the decision graph, the linear fitting of the distribution curve of density and distance product [23], and the Chebyshev inequality [24] or the upper bound of generalized extremum [25]. This kind of method can automatically determine the potential class center of the data set without human intervention. However, due to the influence of multiple density extreme value, it is often necessary to merge subclusters to further optimize the sample allocation effect.

The remaining points assignment strategy of the classical DPC is prone to chain mistaken assignments. Many improvements are proposed to modify the assignment strategy of the classical DPC, such as the distribution of the remaining points based on the k-nearest neighbor [26, 27], the similarity measurement of the samples based on the shared nearest neighbor [28], the combination of initial clusters with boundary samples [29] or density reachable [30], and the assignment of remaining points in combination with other algorithms [11, 31]. The assignment strategy based on nearest and shared nearest neighbors takes full consideration of the neighbor information of samples, which is beneficial to get the reasonable cluster assignment of samples. However, the mere consideration of distances between samples cannot reflect the impact of the real cluster attribution on the similarities between samples. The assignment strategy of the remaining points based on the combination of initial clusters works well on multiple density peaks, but it shows high time complexity. Moreover, some algorithms use DPC as the initial cluster center selection strategy, which can better solve the impact of initial cluster center selection on clustering results, but these algorithms all show high time complexity and are not suitable for clustering of large-scale high-dimensional data.

For high-dimensional data with noises, noises filtering standard is constructed based on nearest , and the clustering centers recognition and remaining points assignment are conducted after filtering of noises [26, 27]. DenPEHC [23] takes sample points with a higher ratio of and as noises, but there were still errors and manual factors. Furthermore, dimension reduction is combined to reduce the dimensions of high-dimensional data [32], and then sample points are assigned with nearest neighborhood parameter . Furthermore, geodesic distance [33, 34] is used to calculate the manifold distance between data points, and isometric mapping is introduced to reduce the dimension of high-dimensional data sets. The above analysis shows that many improvements and optimization are proposed to solve the problems in DPC, and results are satisfying. However, many problems still exist in clustering of complex data sets, for example, uneven density of clusters, high dimensions, optimization of parameters, recognition of center, noise treatment, and high time complexity.

3. RDO-DPC Algorithm

The proposed RDO-DPC improves the classical DPC from two aspects: the definition of local density and assignment strategy of cluster members. Taking advantage of neighbor information, RDO-DPC defines a new measurement of relative density. Then, cluster centers are selected combining decision graph, so as to obtain satisfying results from the clustering of data set with uneven density between clusters. The remaining points are allocated according to the structure information of data set, which effectively avoid the disadvantage of one-step distribution strategy in DPC.

Recognizing cluster centers of different density areas is the guarantee of effective clustering results. Peaks of low density area are buried in high density peaks with local density definition in equation (2) because the local density of dense area is much higher than that of sparse area. In order to give prominence to peaks of sparse area, relative local density is defined aswhere the radius of influence is the quantile of pairwise distances from the smallest to the biggest. is the number of samples in spherical neighbor of sample . Revised local density is defined aswhere the strict condition in equation (5) is equivalent to truncated Gaussian kernel function in order to eliminate the interference from samples far away. Compared with classical DPC, relative local density (4) and (5) can recognize the cluster centers of regions with different densities by employing relative index rather than absolutely index.

The ideal cluster centers of DPC possess two features: one is that local density is higher than the density of samples around, and the other one is that cluster centers are far away from each other. It is shown that distance also is important in selection of cluster centers. As a result, cluster centers are often samples with a higher density and larger distance. If there are two largest density peaks in one cluster, the two points will be both selected as cluster centers according to equation (3). The result is that one cluster is mistakenly divided into two clusters, which eventually leads to unsatisfied clustering results. Therefore, relative density is ranked before calculation of the density higher than and the shortest distance to sample , which can help the distinction of two largest density peaks. The corresponding distance of is defined aswhere represents the subscript sequence of one descending order of , satisfying . If the biggest local density peaks of and in a data set according to equations (2) or (4) are very close, it is hard to identify the real peak in decision graph. Therefore, and may be recognized as their own cluster centers, respectively. After the ranking of the two peaks, if , the distance in corresponding to is set as the largest corresponding distance of other density peaks with equation (6). The distance corresponding to is the distance between and , which weakens the value of corresponding to . As a result, is no longer the cluster center.

Combined with equations (5) and (6), the peaks of areas with a greater density difference are easy to be recognized in decision graph, and the discriminability is strengthened with the decision distances that the peaks are corresponding to. Therefore, a stronger generalization ability is obtained. RDO-DPC algorithm is formed, as shown in Algorithm 1.

Input:
  Sample matrix and cut-off ratio parameter
 Output:
  Clustering label
(1)   Calculate distance matrix
(2)   Calculate relative local density according to equation (4)
(3)   Calculate distance with equation (6)
(4)   Draw decision graph and select cluster centers
(5)   Assign points to centers according to the nearest distance principle
(6)   Clustering result

RDO-DPC takes relative density as measurement of density. With relative density, density calculation of each point is restricted in scope, and the values are only related to points inside neighbor scope. The relative closeness of samples with sample inside scope can be revealed more clearly, and local information of each point and its sample point inside scope can also be shown clearly. Therefore, RDO-DPC suits not only data sets with relatively even density between clusters but also data sets with obvious density differences between clusters.

The time complexity of RDO-DPC is , which consists of the measurement of relative local density and the assignment of remaining points based on the nearest distance . The computation of lies in the Euclidean distance between sample points and the determination of neighborhood, whose computing complexity is . The assignment strategy of the remaining points based on nearest distance employs the classical sorting algorithm, whose computing complexity is .

4. Experiments

In this section, 8 synthetic and 7 real data sets are employed to test the new proposed algorithm. The data sets used are greatly different from each other in density distribution, scale, shapes, and so on. Among those data sets, DS1–DS5, aggregation, compound, and flame are synthetic two-dimensional data sets, which are shown in Figure 1. And the 7 real data sets are from UCI machine learning repository.

In the experiment, the clustering results of RDO-DPC are compared with that of the classical DPC. Both algorithms, RDO-DPC and DPC, need the setting of cut-off distance , which is defined as the distance at in the ascending sequence of all distances among samples.

The clustering results are measured with AMI (adjusted mutual information) and ARI (adjusted Rand index) [35]. The value range of the two indexes is , and the larger the value is, the better the clustering result is. Besides, the clustering results of two-dimensional synthetic data sets are labelled with different colors, and the centers are labelled with red star to give clear view of the results. The results shown in this section are both the best results from RDO-DPC and DPC with best parameters. In this way, the algorithms are better judged concerning their adaptability to data sets of different types and clustering effectiveness.

Eight two-dimensional synthetic data sets are employed to test the clustering efficiency of RDO-DPC and DPC. Both the two algorithms found centers quickly and assigned the reminder samples effectively. Some comparative visualization results of synthetic data sets are shown in Figure 2, in which sparse clusters can be recognized, and excessive clustering can be avoided.

The validation of the comparative clustering results of the 8 synthetic data sets is shown in Table 1, which includes ARI, AMI, and their variances. The variances of ARI and AMI are expressed as “ARI.Var” and “AMI.Var” in the table. With different parameter , AMI and ARI varied more or less. The variance in the accuracy with different parameter is given in the table to show the validation of the RDO-DPC algorithm. Furthermore, the ARI and AMI listed here in the table are the best with the proper parameter . Compared with the DPC algorithm, RDO-DPC exhibits superior performance in clustering of data sets with extremely large density differences among clusters and with various shapes.

The comparison of the quantitation indexes between RDO-DPC and DPC shows obvious superiority of RDO-DPC. RDO-DPC is slightly lower than DPC in clustering indexes of DS1 but was apparently higher than DPC in indexes of other data sets. The superior performance of RDO-DPC is because of its employment of relative density in clustering of data sets with uneven density among clusters. Therefore, RDO-DPC can recognize cluster centers more effectively and correctly and assign the remaining points correctly, thus achieving better clustering results than DPC.

Seven real data sets from UCI machine learning repository are employed to test the performance of RDO-DPC and classical DPC. These benchmark data sets include data of high dimensions, complicated structures, and various shapes. With different parameter , the efficiency of the two algorithms varies slightly. AMI and ARI are employed to measure the different clustering results, and the variance in the accuracy and best parameters are listed in Table 2.

From Table 2, the contrastive results of the two algorithms real data sets show the superior performance of the proposed RDO-DPC, which can find the center and meaningful group of real data sets. Especially for data set Wdbc, DPC could not find cluster centers and recognize meaningful groups of the data set because of its deficiency in clustering of high-dimensional data. ARI and AMI of the proposed RDO-DPC shows that RDO-DPC performs well on high-dimensional data.

The robustness of the new algorithm is also considered. In RDO-DPC, is important because it is used to determine the relative density of each sample, which has impact on many critical steps in clustering. The value of is closely related with parameter , which means determines the performance of RDO-DPC.

Figure 3 lists the influence of different values of on ARI and AMI of some synthetic and real data sets. The robust interval of is suggested to be set from 10 to 20 for the proposed algorithm in the experiments. As shown in Figure 3, the accuracy of new algorithm remains stable overall with respect to .

The above comparative results on synthetic and real data sets show that the new proposed algorithm RDO-DPC is effective in the clustering of data sets with extremely large density differences among clusters and with various shapes. And the algorithm is robust overall. In terms of data sets with low number of records and huge number of features, the new algorithm also shows certain efficiency although clustering on such data sets is difficult.

5. Conclusions

Based on neighborhood information of samples, relative density is introduced in this paper. The introduced relative density is used to describe the relative density between each sample and the samples around it and takes full advantage of the information of adjacent samples, thus facilitating the effective find of centers and distinction of clusters of different densities. In addition, the assignment strategy of the original DPC is also improved. The experiments on different types of data sets show that the proposed algorithm can perform effectively on data sets with arbitrary shapes, uneven density, and high dimensions, avoiding the mistaken assignment of samples of the original DPC. Compared with classical DPC, the proposed RDO-DPC not only considers the local density of the samples but also the relative density, which enables RDO-DPC to cluster data sets with uneven density with a higher efficiency. For further research, the reduction of calculation complexity is still an important problem.

Data Availability

The 7 real data sets used in this paper are from UCI machine learning repository. The other data sets used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The study presented in this article was supported by the National Science Foundation of China (Grant nos. 61305070 and 61703001).