Explainable and Reliable Machine Learning by Exploiting Large-Scale and Heterogeneous DataView this Special Issue
Density Peaks Clustering by Zero-Pointed Samples of Regional Group Borders
Density peaks clustering algorithm (DPC) has attracted the attention of many scholars because of its multiple advantages, including efficiently determining cluster centers, a lower number of parameters, no iterations, and no border noise. However, DPC does not provide a reliable and specific selection method of threshold (cutoff distance) and an automatic selection strategy of cluster centers. In this paper, we propose density peaks clustering by zero-pointed samples (DPC-ZPSs) of regional group borders. DPC-ZPS finds the subclusters and the cluster borders by zero-pointed samples (ZPSs). And then, subclusters are merged into individuals by comparing the density of edge samples. By iteration of the merger, the suitable dc and cluster centers are ensured. Finally, we compared state-of-the-art methods with our proposal in public datasets. Experiments show that our algorithm automatically determines cutoff distance and centers accurately.
Clustering algorithm , as the unsupervised learning method, divides the objectives that also are called elements, samples, and items, into several groups according to the similarity of objectives. Compared with supervised learning [2–16], it can carry out the grouping task even though the category labels are pending. Hence, it is widely used in image segmentation , bioinformatics , pattern recognition , data mining , and other fields [21, 22]. Representative clustering algorithms cover K-means [23, 24] and fuzzy c-means [25, 26] based on partitioning; AGNES , BIRCH [28, 29], and CURE [30, 31] based on hierarchy; DBSCAN  and OPTICS  based on density; STING  based on grids; and statistical clustering CMM  and spectral clustering  based on graph theory . K-means is extremely sensitive to noise and the selection of the initial clustering centers, and the number of clusters needs to be set a priori. Similarly, fuzzy c-means suffers from initial partition dependence, noise, and outliers. The hierarchical clustering requires to determine the number of clusters a priori, and its effect depends on the choice of distance measurement of groups. Density-based DBSCAN, OPTICS, and grid-based clustering algorithms determine the number of clusters without artificial intervention. Still, all require preset parameters epsilon and minpts, and a mass of argument adjustments were taken to obtain optimal clustering results. These two types of algorithms generate noises around the cluster boundaries. Statistics-based CMM needs to select one or more suitable probability models to fit a dataset.
Clustering by fast search and find of density peaks  was published in Science, by the preset threshold (cutoff distance, dc), manually selecting the cluster centers from the decision graph proposed by DPC. Compared with traditional clustering algorithms, it has many advantages, such as higher efficiency in finding cluster centers, fewer parameters, no iteration, no noise around the cluster border, and others. However, the algorithm still has the following defects:(1)The original DPC does not provide a reliable and specific selection method of dc. Hence, the cutoff distance is computed in different ways depending on the size of datasets, in which the inappropriate dc leads to performance degradation . Moreover, the dc is generally challenging to determine since the range of each attribute is unknown in most cases .(2)It is hard to manually select the cluster centers from a dataset with a large number of clusters. And the artificial option for cluster centers cannot meet the system with high timeliness.
To overcome the above defects, many scholars proposed improvements in the original DPC algorithm. Xie et al. proposed a local density metric based on fuzzy weighted k-nearest neighbors to solve the problem of difficult to determine dc in the DPC algorithm . Liu et al. proposed shared-nearest-neighbor-based clustering by fast search and find of density peaks clustering (SNN-DPC), which converts cutoff distance to the number of nearest neighbors . Mehmood presented a nonparametric method for DPC via heat diffusion for estimating the probability distribution of a given dataset . Guo et al. used linear regression to fit the decision values with a given dc and selected the elements above the fitting function as the central elements . Ding et al. proposed an algorithm based on the generalized extreme value distribution (GEV) to fit the decision values in descending order . In order to reduce the time complexity, an alternative method based on density peaks detection using Chebyshev inequality (DPC-CI) was also given. Ni et al. presented the concepts of density path and density gap, as well as a new threshold called dc percentage in . The density gaps are used to draw the summary graph of density gaps calculated by several dc percentages. Instead of the decision graph, the appropriate threshold value is determined by manually observing the summary graph. The algorithm is able to reduce the negative impact of inappropriate dc on the clustering result.
However, in [39–41, 44–47], it is necessary to select the centers or observe the summary graph of density gaps, with the human operation. Gu et al.  and Ding et al.  proposed the strategies of automatic center selection for the original DPC, but they depend on the given appropriate dc. However, Xie et al.  and Liu et al.  showed that it was challenging to select the proper dc.
In this paper, we propose the density peaks clustering by zero-pointed samples (DPC-ZPSs) of regional group borders. Our method not only determines the suitable range of dc and the center of each cluster but also reduces the negative impact caused by manual participation in the clustering process. The main innovations and contributions in our algorithm are as follows:(1)To merge the local clusters into individuals, we present a cluster merging strategy based on comparing density among elements of two cluster borders.(2)In order to find the border of each cluster, we propose two conceptions: neighboring cluster border (NCB) and pure cluster border (PCB).(3)For the determination of the correct number of clusters, we provide an iterative procedure, which can converge dc to a suitable value.
The remainder of this paper comprises four sections: Section 2 describes the details of the original DPC and our proposal; Section 3 presents the clustering results on our method and related works and discusses the impact and value range of the parameter of DPC-ZPS; in the final section, we have a summary of the contributions and features of this paper and put forward to future work.
2. Materials and Methods
2.1. The Original DPC Algorithm
For a given dataset , where .
DPC is based on an assumption where each cluster center has a higher local density than other elements and is far from each other. Centers are manually selected using a decision graph with the local density as the abscissa and as the ordinate. DPC algorithm provides two methods for calculating the local density for each element of the given dataset and is expressed in equations (1) and (2). is calculated by equation (3):where is the Euclidean distance between elements and and is the cutoff distance. As shown in equation (3), is the minimum distance between elements and whose density is higher than . Moreover, for with the highest density, its is the maximum distance between and .
Meanwhile, to simplify the selection of centers, DPC provides the decision value as follows:
After the cluster centers are determined, each of the remaining samples is assigned to the nearest denser one. And the assignment is recorded in the process of calculating .
2.2. Our Method
The main process of DPC-ZPS is to select multiple distances as dc at equal intervals and calculate the corresponding decision values. Then, among the decision values of each group, the elements greater than the sum of the mean and standard deviation of the decision values are selected as the potential centers. In the range of multiple groups of dc, the iterative merging process makes the number of clusters close to the real value gradually.
2.2.1. Related Concepts
Definition 1. (zero-pointed sample). in the assignment, each sample is assigned to the nearest denser one. And the zero-pointed sample (ZPS) is the one without any subordinates.
When dc is fixed, we use an array that consists of n zero units to store the assignment process. And the indexes of the array represent the sequence number of objectives. Let , in which sample is the nearest and has density more significant than sample . And cluster centers and potential cluster centers are not assigned. Subsequently, the array is broken at the zero units; then, trees can be obtained, and each tree is a cluster.
Definition 2. (initial border). in a cluster tree, the initial border (IB) consists of all leaf nodes and their father nodes.
As shown in Figure 1, elements 1, 7, and 8 are zero-pointed and leaf nodes because they are less dense than neighboring elements. Elements 3 and 32 are inner, but they are still the zero-pointed elements since they have no adjacent samples. And there are assignment paths of items 10 ⟶ 11 ⟶ 13 and 12 ⟶ 11 ⟶ 13.
Definition 3. (neighboring cluster border). clusters in a dataset are denoted as , where is the number of clusters in and , where , satisfies the following equation, and then :where is the distance between and , is an array storing all of cluster pair and in descending order, represents the distance, DF is the depth factor of the neighboring cluster-border, its range is , and is the integer part of .
Neighboring cluster border (NCB) consists of all , and it is expressed as follows, where is to delete the symmetrical cluster pairs:It is necessary that two clusters are far from each other with an enormous DF to attain a nonblank NCB. And the bigger the required DF value of the nonblank NCB is, the further distance the two clusters are. While for neighboring subclusters, DF is relatively minute. In the fourth chapter, the DF will be compared with parameters of DPC and is discussed to show the impact on the clustering result.
As shown in Figure 1, there are two clusters A and B in a dataset, and cluster B is misclassified into B1, B2, and B3. The elements I, 7 and 8, and II, 16, 17, 18, 19, 20, and 21, are marked with red wireframes. They belong to NCB.
Definition 4. (pure cluster border). in a cluster, the pure cluster border (PCB) is defined by the following equation:Correspondingly, elements 1, 2, 4, 5, 6, 9, 10, 11, 12, 22, 23, 24, 29, 30, and 31 belong to pure cluster border (PCB) of respective clusters. However, as shown in Figure 2, elements 3 and 32 are zero-pointed since they are relatively isolated, but their density is much larger than other ZPS.
To filter out interior and isolated ZPS, we use the three-point method in fuzzy math to measure the three memberships of the elements in the , including “low density,” “medium density,” and “high density.” In order to prevent the extreme value of elements density from affecting the membership value, we select the normal distribution function as the membership function, and three functions are expressed as follows:where is the standard deviation of the density values of all elements in .
In Figure 3, when , the membership of the element is smaller acute-angle border element than a higher density. For example, element 1 is an acute-angular border element, and elements 2, 12, and 23 belong to obtuse-angular border elements. When , the degrees of two memberships are equal. When , the higher the element density is, the smaller the membership degree of the element is, which is an obtuse-border element, and the higher the membership degree of the independent objective within the cluster. When, the two memberships are equal.
2.2.2. Merger Strategy
If a real cluster is mistakenly divided into several subclusters, there are some zero-pointed elements in the NCB since the NCB is not only the inner part of the actual group but also the border of subclusters. Due to the aggregation of zero-pointed objectives in the NCB, the density of NCB elements is smaller than other inner parts, which corresponds to in Figure 3. Meanwhile, the density of PCB is in . We propose a merging strategy based on the comparison of element density values of NCB and PCB.
If ∃ satisfies and , where and are equal to respective , then are merged; namely, if the density of the elements of the NCB is not more prominent than but more significant than , they must be the inner elements of the real cluster.
2.2.3. The Iteration Strategy
The value of each center depends on the minimum distance between the central objectives and the more significant density objectives. But when the dc is small and far from its suitable range, the algorithm does not measure the density of each sample accurately and precisely. The inexact measurement shows that, in some clusters, local center elements with more prominent local density and far from the suitable center of each group are selected, and their values are much larger than noncenter items. With the increase in dc, the density measurement capability gradually strengthens. The DPC-ZPS algorithm sequentially filters out fake centers with the weakest central attributes until . When dc is bigger than the most significant value of the suitable range, the clusters with smaller distribution areas will be filtered out; namely, there is not the center selected by the threshold. When dc continues to increase, in the groups with a larger distribution area, the fake centers will appear again. Essentially, the process of dc increase is a gradual transition of the density metric to measure the universal density of elements from their local density. This change process is generally shown in Figure 4.
Based on the above analysis, we propose an automatic iteration strategy as follows: Step 1: as shown in Figure 4, after counting cluster center combination and centers quantity of each dc, the algorithm determines the min-range and divides the rest into L-range and R-range. If the min-range is not only one, the DPC-ZPS chooses the biggest one to separate the dc range. Step 2: let the algorithm find the max L-num and record its center combination as well as the sequence number of its dc. Step 3: according to the center combination and dc, the noncenter elements are assigned to the closest element among the denser elements. Step 4: execute merge() with clusters of clustering result from step 3. Step 5: if the number of clusters after merge() does not change, the clustering result and the number of clusters are stores; if the number of groups reduces to merged num(r+1) from merged num(r), the third to fifth steps are repeated with the center combination corresponding to the merged num(r+1). Step 6: the second to fifth steps are performed in the R-range after finding the max R-num. Step 7: the final result is the maximum value of the final number of clusters in two subranges and its clustering results stored by step 5.
2.2.4. Time Complexity Analysis
Suppose that the number of samples in a dataset is , the max center-num is , the number of pairwise points in SNB is , the max center-num in dc domain is , and the number of zero-pointed samples is . Just like DPC, our method needs time complexity to calculate the distance matrix D. We search the nearest denser neighbor for each sample via a K-D tree. And the complexity of building the K-D tree is . Searching nearest neighbor queries has an average running time of , and hence, for n groups of dc, the complexity of searching nearest neighbor of each sample queries is . For the determination of NCB, we need a matrix M, and the rows and columns represent the samples of two clusters. In the matrix M, each cell stores the distance from matrix D, and then, all distances in the M are sort in ascending order to find the NCB by equation (5). Therefore, the time complexity of NCB depends on the assignment to M, the times of assignment of the matrix M are, the average cost is , and the total time complexity is . How many times the operation for PCB is to be done depends on the number of zero-pointed samples, so the time complexity is less than . In the merger process, the density of each pairwise points is compared, and hence, the complexity of the merger depends on the number of pairwise points in SNB and is , where , and only when DF = 1, . However, the reasonable range of DF is (0, 0.05], which will be discussed in Section 3.3. Therefore, the time complexity of the merger is far less than . And iteration is based on the max center-num, and . We can conclude that the time complexity of the entire algorithm is .
3. Results and Discussion
We tested our algorithm and several related works, including PPC , DPC , DBSCAN , OPTICS , and AP , on several datasets. These datasets have different numbers of samples and stimulate different element distributions. The detailed information is shown in Table 1. Like DPC, AP (affinity propagation) is another advanced clustering algorithm published in Science. The basic idea of the AP algorithm is to treat all data points as potential cluster centers (called exemplar), then connect the data points in pairs to form a network (similarity matrix), and finally transmit the information (responsibility and availability) of each edge in the network to calculate the cluster center of each sample.
3.1. Evaluation Criteria, Parameters of Each Algorithm, and Code Sources and Preprocessing
3.1.1. Evaluation Criteria
The ARI formula is shown as follows:where E [RI] represents the expectations of RI. RI is calculated as follows:where TP indicates the true positive, TN indicates the real negative, and is the total number of sample pairs in a dataset containing n samples.
The AMI formula is shown as follows:where , , and represents the expectations of ; is expressed as follows:where , , , , and . and represent two allocation methods for a dataset containing n elements, and and are clusters. In experimental verification, let and be the original labels and the clustering results of an algorithm, respectively. The value ranges of the two evaluation criteria are , and “1” denotes the best experimental result.
3.1.2. Parameters of Each Algorithm
DF, the parameter of our proposal, was set from 0.01 to 0.05, in which 0.005 is the interval. And by an equal interval, we choose dc from all in ascending order, where is the number of samples of a given dataset. When performing DBSACN and OPTICS experiments, we took “” as the step and as the initial value to attain 100 epsilons, let the minpts be from 1 to 50, and choose the best result among five thousand clustering results. During the AP experiment, we set the initial value of the unique parameter “performance” of the AP algorithm to 1.5 times the maximum value of the similarity matrix, and each cycle is reduced by 0.03%; the optimal result is selected. The specific situation is shown in Table 2, where the DPC algorithm parameter is a suitable dc, and the PPC algorithm parameter is dc_percent. The results and arguments of DPC and PPC are obtained from .
3.1.3. Code Sources and Preprocessing
To ensure that the experimental comparison is valid, we processed each dataset according to the method described in  and normalized the low-dimensional dataset and the DIM512 dataset. For preparing the Olivetti faces dataset, we first scaled each image (originally 92 × 112) to a smaller size of 15 × 15 and then performed principal component analysis (PCA) to filter out attributes of cumulative contribution rates greater than 90%. The normalization formula is as follows:where represents the value of the data in the dataset and and represent the maximum and minimum values of the feature in the dataset , respectively.
The DBSCAN codes are all built-in functions of Matlab 2019a. The OPTICS code is from the pyclustering library, the AP code is from the sklearn library, and we provide the DPC-ZPS codes. We executed all methods on a personal computer with Windows 10, Intel(R) Core (TM) i7-8750H, 16 GB memory, and Matlab 2019a or Python 3.0.
3.2. Experimental Results and Analyses
As shown in Table 3, the performance of DPC-ZPS is better than other control groups. Next, we will analyze the specific iterative process of our proposal from Figures 5–9. And each of the Figures 5–8 consist of three subgraphs. The left subgraphs represent the cutoff distance and the number of cluster centers determined by the DPC-ZPS algorithm, and the red line marks the suitable range of dc. The middle subgraph represents the clustering results of DPC-ZPS, and the right subgraph represents the category labels. Figure 9 shows the clustering results of our method and the original DPC on the Olivetti face dataset.
As shown in Figure 5, our algorithm selects seven appropriate centers and successfully converges dc to the appropriate value interval through iteration. In the iterative processes, the change of center-num in the L-range is “14-8-7-7.” The number of centers remains unchanged, which means the seven clusters are relatively dependent. The final center-num of the R-range is “4,” so the clustering result of the L-range is selected as the final result.
In Figure 10(a), there is a min-range, and center-num is one. And in the L-range, the process of iteration is “6-2-2,” and that of the R-range is “2-1-1.” Therefore, the final clustering result lies in the L-range.
In the spiral dataset, three spiral clusters are far from each other. So in Figure 6(a), in most of the dc range, there are three suitable cluster centers. There is no R-range. And our method successfully merges all subclusters to three correct groups, which is consonant with Figure 6(c).
In the L-range of R15, the biggest center-num is 15, and the merge does not happen, while the last center-num of the R-range is 14. Hence, the actual clustering result is determined and is shown in Figure 7(b). The change process of D31 L-range is from 33 to 31. The ultima center number of the R-range is approximate to the minimum in Figure 8(a). Hence, the final cluster number is thirty-one.
The Olivetti faces dataset contains 40 (person) × 10 (photo) photos and is widely used in machine learning to test various algorithms. As shown in Table 3, the evaluation results of the DPC-ZPS on ARI are better than other algorithms. Figure 9 shows the clustering results of the DPC-ZPS and DPC. The image marked with a white dot in the upper right corner is the cluster center, and the gray photos indicate that there are less than three elements in the cluster.
In Figure 9(b), there are no centers in the , , , , , and group photos, which suggest that the traditional DPC algorithm may also incorrectly merge multiple clusters into one cluster. However, as shown in Figure 9, there are only the and group photos without centers. It demonstrates that DPC-ZPS is less likely to merge clusters incorrectly.
Xie et al. [39, 40, 44] manifest that the selection rule of dc provided in  cannot meet various datasets. Table 2 shows that the values of dc and dc_percentage are diverse in diverse datasets, which increases the tuning cost and magnitude of difficulty, while in the six of the seven tested datasets, our argument is equal to 0.02.
The depth factor, the only parameter of the DPC-ZPS algorithm, is used in equation (6) to control the depth of the border between two adjacent clusters. When DF = 1, the neighboring cluster borders will contain all the elements in the two clusters. However, the edge should be composed of the elements with a shallow depth, so there are minimal parameter values in different datasets. Therefore, [0.005, 0.05] is a reasonable range for all of the tested datasets. As shown in Figure 11, most datasets severely fluctuate before DF = 0.015, which is just a small part of the whole; after that, our algorithm is not sensitive to the parameter changes. In addition, compared with the DPC and PPC algorithms, the DPC-ZPS algorithm does not require human intervention in the entire clustering process, which can overcome many defects caused by manual operation.
In this paper, to overcome the defects of human operation and the difficulty in determination of the suitable dc, we proposed the density peaks clustering by zero-pointed samples (DPC-ZPSs) of regional group borders. DPC-ZPS is based on the in-depth analyses of not only the changing rule between the dc and centers but also the relationship between the density of NCB and PCB. Our proposal covers two main parts: the merger strategy of subclusters based on the cluster borders and the iteration strategy. The merger strategy adaptively determines the threshold of merge for each pairwise local cluster. And the iterative process is to find a suitable range of dc automatically. And experimental results indicate our method is more accurate without artificial operation and has a more reasonable and less sensitive threshold value range. Additionally, we will use the natural nearest neighbors to optimize the local density measurement and assignment process.
All datasets in this paper are from UCI. All readers are able to access datasets from it.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported by the National Natural Science Foundation of China (61972056, 61772454, 61402053, and 61981340416), the Natural Science Foundation of Hunan Province of China (2020JJ4623), the Scientific Research Fund of Hunan Provincial Education Department (17A007, 19C0028, and 19B005), the Changsha Science and Technology Planning (KQ1703018, KQ1706064, KQ1703018-01, and KQ1703018-04), the Junior Faculty Development Program Project of Changsha University of Science and Technology (2019QJCZ011), the “Double First-class” International Cooperation and Development Scientific Research Project of Changsha University of Science and Technology (2019IC34), the Practical Innovation and Entrepreneurship Ability Improvement Plan for Professional Degree Postgraduate of Changsha University of Science and Technology (SJCX202072), the Postgraduate Training Innovation Base Construction Project of Hunan Province (2019-248-51).and the Beidou Micro Project of Hunan Provincial Education Department (XJT No.149).
J. Wang, Y. Yang, T. Wang, R. S. Sherratt, and J. Zhang, “Big data service architecture: a survey,” Journal of Internet Technology, vol. 21, no. 2, pp. 393–405, 2020.View at: Google Scholar
J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, Oakland, CA, USA, 1967.View at: Google Scholar
M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231, Portland, OR, USA, 1996.View at: Google Scholar
M. Ankerst, M. M. Breunig, H. P. Kriegel, and J. Sander, “Optics: ordering points to identify the clustering structure,” in Proceedings of the ACM Sigmod Record, pp. 49–60, Philadelphia, PA, USA, 1999.View at: Google Scholar
W. Wang, J. Yang, and R. Muntz, “Sting: a statistical information grid approach to spatial data mining,” in Proceedings of the 23rd International Conference on Very Large Data Bases, pp. 186–195, Athens, Greece, August 1997.View at: Google Scholar
G. McLachlan and D. Peel, “Finite mixture models,” in Encyclopedia of Autism Spectrum Disorders, F. R. Volkmar, Ed., p. 1296, Springer, New York, NY, USA, 1st edition, 2013.View at: Google Scholar
I. Anderson and R. Diestel, “Graph-theory,” The Mathematical Gazette, vol. 85, no. 502, p. 176, 2001.View at: Google Scholar
P. Guo, X. Wang, Y. Wang, Y. Chen, and Y. Zhang, “Research on automatic determining clustering centers algorithm based on linear regression analysis,” in Proceedings of the 2017 2nd International Conference on Image, Vision and Computing (ICIVC), pp. 1016–1023, Chengdu, China, June 2017.View at: Publisher Site | Google Scholar
N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance,” Journal of Machine Learning Research, vol. 11, pp. 2837–2854, 2010.View at: Google Scholar