Abstract

Aiming at density peaks clustering needs to manually select cluster centers, this paper proposes a fast new clustering method with auto-select cluster centers. Firstly, our method groups the data and marks each group as core or boundary groups according to its density. Secondly, it determines clusters by iteratively merging two core groups whose distance is less than the threshold and selects the cluster centers at the densest position in each cluster. Finally, it assigns boundary groups to the cluster corresponding to the nearest cluster center. Our method eliminates the need for the manual selection of cluster centers and improves clustering efficiency with the experimental results.

1. Introduction

Clustering [14] is an unsupervised or semisupervised learning method. This method aims at dividing the samples into different clusters according to the similarity between samples so that the samples in the same cluster are as similar as possible and the samples in different clusters are as dissimilar as possible. Clustering has a wide range of applications, such as image analysis [5], pattern recognition [6], data analysis [7], and wireless sensor networks [8]. Under the wide applications, many clustering methods have emerged, such as K-means [9] and fuzzy c-means (FCM) [10] clustering methods, which are only effective for spherical data and have an inferior effect on nonspherical data. But density-based clustering methods [11] such as density peaks clustering (DPC) [12] did not have this problem. However, DPC still has some drawbacks, so improving the density-based clustering method has great significance.

Aiming at the problem that DPC needs manual participation in selecting cluster centers, Flores et al. [13] proposed a density peaks clustering with a gap-based automatic center detection method. This method calculates a threshold to distinguish between cluster center samples and noncenter samples. Lv et al. [14] proposed a density peak clustering algorithm based on shared nearest neighbor and adaptive cluster center. This method selects the cluster center by narrowing the search range. However, the cluster centers obtained by the above methods are the same as that of DPC. In other words, if DPC cannot achieve good results on some data, the above methods cannot achieve good results.

Aiming at the high time complexity of DPC, Lu et al. [15] proposed a fast distributed density peaks clustering method based on the Z-value index. This method effectively reduces the time complex from to , but the clustering effect is significantly reduced because the data need to be reduced from multidimensional to one dimension. Xu et al. [16] proposed a fast density peaks clustering method based on spare search. This method ensures the clustering effect and effectively reduces the time complexity below , but it still needs to select the cluster centers manually. Although the above methods improve the efficiency of DPC, they also have shortcomings.

Recently, there have been many ways to improve the precision of DPC [1720]. These methods have more advantages for some data with considerable noise and complex structures. In [17], dense cores are introduced as a representative of the original data to reduce runtimes. The density threshold is used to eliminate the interference from noise samples. However, it has difficulties processing high-dimensional data. In [18], local density peaks are used to construct a minimum spanning tree to avoid noises and reduce runtimes. However, it cannot automatically determine the number of clusters. In [19], local cores representation dataset is introduced, which avoids noise interference and reduces runtimes by only calculating the graph distance between local cores. Nevertheless, it needs to build a decision diagram. In [20], introducing natural neighbor to find local representations, calculating the adaptive distance between local representations effectively reduces the runtimes. Nevertheless, it cannot automatically determine the cluster centers. Du et al. [21] proposed a k-nearest neighbor DPC method based on principal component analysis. It uses k-nearest neighbor to calculate the sample density and principal component analysis to process high-dimensional data. Nevertheless, it is not effective in dealing with manifold data.

This paper presents a new density peak clustering method (GDPC) that can fast and autodetermine the cluster centers by grouping. Our method aims at grouping the datasets to ensure that sample categories of each group are the same and try to minimize the number of groups. In this way, we no longer need to calculate the similarity between all samples but only between groups. At the same time, we find that the cluster centers usually appear in the densest place of each cluster. We can divide each cluster into core and boundary regions according to the feature that the density from cluster center to cluster edge decreases gradually. The distance between the two core regions is considerable so that we can find the cluster centers.

The rest of this paper is as follows: in Section 2, we review the DPC. Section 3 introduces our proposed method. In Section 4, we do experiments on the effectiveness and efficiency of synthetic datasets and real datasets. Finally, we summarize this article and put forward further challenges.

2. DPC

2.1. Quantities

DPC needs to use the density of the sample () and the distance between the sample and its nearest high-density sample () when finding the cluster centers and determining the sample labels.

The density of the sample is the number of other samples within its cutoff distance () range. The density of the sample is defined aswhere is the distance between sample and sample , and if ; otherwise, .

The distance between sample and its nearest high-density sample is defined as

For the sample with the highest density, .

2.2. Similarity Matrix

When calculating and and determining the sample label, the distance between samples will be calculated many times. To improve efficiency, DPC only calculates the distance between samples once and saves it. It is no longer necessary to recalculate it when it is used again. The similarity matrix is defined aswhere is the distance between sample and sample . is the number of samples.

2.3. Process

After calculating and with the above method, we use as abscissa and as ordinate to draw a decision diagram, as shown in Figure 1. Figure 1(a) displays the distribution of samples. Figure 1(b) is the corresponding decision diagram. We can find that the upper right corner of the decision diagram indicates the cluster centers.

After selecting the cluster centers, the labels of the remaining samples are the same as that of the nearest high-density sample.

2.4. Time Complexity

The time complexity of DPC has three main parts: (a) calculating requires time complexity; (b) calculating needs complexity; and (c) determining labels of noncenter samples demands time complexity, where is the size of the dataset. Based on the above three parts, the time complexity of DPC is .

3. The Proposed Method

To solve the manual intervention in selecting cluster centers of DPC and the high time complexity of DPC, we proposed GDPC. In GDPC, we isolate the core regions of different clusters to find the cluster centers and use the grouping method to reduce the amount of calculation.

3.1. Grouping

Under the condition of ensuring that the sample labels in each group are the same, the fewer the number of groups, the more efficient our method can be.

We define the distance between the ungrouped sample and the first sample in group as . We divide sample into group if . If no group makes , we create a new group. Sample is the first sample in this group. Algorithm 1 explains the grouping method in detail.

These groups have the following characteristics:(1): the distance between the first sample and other samples in the group is less than .(2): the same sample does not exist in two different groups.(3), where is cluster and is first sample in group . All samples in the same group belong to the same cluster.

To distinguish whether group is a core group or a boundary group, we need to use the group density, which is defined as

(1)Input: data set ,
(2)Output: group set
(3)Create set
(4)For each sample in do
(5)If exist then
(6)  
(7) Else
(8)   Create new group ()
(9) End if
(10) End for

Definition 1. (zero density group). The group with only one sample. If group is zero density group, .
Through experience, we first remove groups with zero density and then arrange the remaining group density in descending order. Finally, if the density of the group is greater than or equal to the density threshold (), group is defined as the core group; otherwise, it is defined as the boundary group. The density threshold is defined aswhere is the number of nonzero density group and is the density of the -th group after descending sorting.
As shown in Table 1, there are 7 nonzero density groups. The density threshold calculated by equation (5) is , . Because the density of groups 0, 1, 2, 3, 4, 5 is greater than or equal to , they are core groups. The density of groups 6, 7 is less than , so it is a boundary group. The density of group 7 is , so it is a zero density group.

3.2. Select Cluster Centers and Determine Core Group Sample Labels

When we determine the core group set, we need to find the core region of each cluster through its spatial relationship. Then, we can select the cluster center of each cluster where the density is the densest.

We propose the following definitions.

Definition 2. (key sample). The first sample in each group.

Definition 3. (transitivity). If the distance between any sample in group and the key sample in group is less than , can reach through transitivity.
The two groups and with transitivity satisfy the following equation:where is the distance between any sample in group and the key sample in group . We can combine core groups (join the same cluster) through equation (6).
Figure 2, because , so, group 1 can reach group 2 through transitivity. Therefore, there is also transitivity between groups and . By transitivity, we can connect group , group , and group , so they belong to the same cluster.
Assume that the density of group is the highest in groups , , and , and group is already in cluster , whereas groups and are not in any cluster. Because there is transitivity between groups and (), and group is in cluster . So it can be regarded as transitivity between group and cluster (). After group is added to cluster , group and cluster also have transitivity, so group can be added to cluster .
Algorithm 2 shows the specific steps, where is the distance between sample in core set and any sample in cluster . is the set of all samples in the group whose first sample is . is the first sample in the current core set. Firstly, we select the first sample in the core set as the initial cluster center. Secondly, connect its corresponding group with other core groups through transitivity. The following unconnected sample in the core set is selected as the cluster center when more core groups cannot be connected. Repeat the above steps until all the core groups have been assigned clusters.

(1)Input: Core set , Group set ,
(2)Output: Center set , Cluster set
(3)Create set
(4)While exist in do
(5) / Merge all core groups /
(6)
(7)For each sample in do
(8)  / When a core group is added to a cluster or a cluster is created, the transitivity between the remaining core groups and the cluster is judged /
(9)  If exist then
(10)   / transitivity judgment /
(11)   
(12)   
(13)   
(14)   Break
(15)  End if
(16)End for
(17)Ifthen
(18)  / group is not transitive to any existing label group /
(19)   Create new cluster ()
(20)   
(21)   
(22)   
(23)End if
(24) End while
3.3. Determine Boundary Group Sample Labels

When selecting the cluster center, we mark two transitive core groups as the same cluster. In this way, we can determine the labels of the core group sample simultaneously when selecting the cluster centers.

After finding the cluster centers, we calculate the distance between the boundary group and the core group sample. If the distance between group and core group sample is less than , all samples in group have the same label as sample , where the distance between group and sample is the distance between the key sample in group and sample . After that, we need to determine the label of the remaining boundary group sample. We calculate the distance between the key samples in each nonzero density boundary group and each cluster center. We divide the nonzero density boundary group into the cluster corresponding to the nearest cluster center and mark the zero density boundary group samples as noise samples.

3.4. Process

Our method (Algorithm 3) consists of the following five main steps: (1) group the datasets using ; (2) select the cluster centers and determine the core group sample labels through transitivity; (3) calculate the distance between the boundary group and the core group sample and determine the boundary group labels with distances less than ; (4) calculate the distance between the remaining nonzero density boundary groups and the cluster centers, and determine the labels of the remaining nonzero density boundary group; and (5) mark the remaining zero density group samples as noise samples.

(1)Input: data set ,
(2) Output: center set , cluster set
(3)  Grouping data using Algorithm 1
(4)  Select the cluster centers and determine the core group sample labels using Algorithm 2
(5)  Calculate the distance between the boundary group and the core group sample and determine boundary group labels with distances less than
(6)  Determine the label of the remaining nonzero density boundary group sample
(7)  Mark the remaining zero density group samples as noise samples
3.5. Example

In this subsection, we introduce the process of our method through an example, as shown in Figure 3 and Table 2.(1)Group the data: calculate the distance between sample and the existing group. Since there is no group yet, create a new group , and add sample to group . Calculate the distance between sample and the existing group (). Because , create a new group and add sample to group . Calculate the distance between sample and the existing group (, ). Because and , so add sample to group . Repeat the above steps until all samples are added to the group. Calculate the density of each group by equation (4). The grouping results and density are shown in Table 2.(2)Distinguish core or boundary groups: firstly, we sort the groups in descending order of density, and the result is . Secondly, we remove the zero density group and calculate the density threshold is 1 from equation (5) (), where 4th group is , and its density is 1. So groups are the core groups, and groups are the boundary groups.(3)Determine core group labels: select the key sample of group as the first cluster center, and create a new cluster to add group to cluster . Since , add group to cluster . Because the distances between groups and cluster are greater than , take the key sample of group as the second cluster center, create a new cluster , and add group to cluster . Since and , add group to cluster . Since and , add group to cluster . Since and , add group to cluster .(4)Determine boundary group labels: because and , group is not processed temporarily. Because and , group is not processed temporarily. Since , add group to cluster . Next, we divide the nonzero boundary groups into a cluster corresponding to the nearest cluster center. In this example, there is no nonzero density boundary group, so this step ends here.(5)Determine noise samples: in the previous step, the zero density boundary groups not divided into cluster are considered noise samples. In this example, samples and in groups and are marked as noise samples.

After completing the above steps, the clustering process is over. The clustering results are shown in Table 3. This example generates two clusters, including 10 samples in cluster , 10 samples in cluster , and 2 noise samples.

3.6. Time Complexity

The time complexity of our method has four main parts:

It is assumed that the dataset with samples is divided into core groups and boundary groups.(a)Grouping requires time complexity(b)Sorting core group density needs time complexity(c)Finding the cluster centers and determining the core group sample labels need time complexity(d)Determining labels of noncenter samples demands time complexity

Based on the above four parts, the time complexity of our method is , where . Therefore, the time complexity of our method is less than .

4. Experimental Evaluation

This section has carried out precision and efficiency experiments and has exposed our method source code and synthetic data (download URL=[https://github.com/yongbiaoLi/GDPC.git]), respectively.

4.1. Precision Experiment
4.1.1. Synthetic Datasets

In this section, we used K-means, FCM, DPC, and GDPC to experiment in ThreeCircles, Lineblobs, Spiral, and Compound [22] synthetic datasets. Table 4 shows specific information about the datasets.

From Figures 4, 5, 6, and 7, the cluster centers obtained by K-means and FCM are not necessarily samples in the data, but those obtained by DPC and GDPC are samples. For nonspherical data such as Figures 4 and 6, K-means and FCM cannot obtain the correct cluster centers and result. For data such as Figures 5 and 7, which combine spherical and nonspherical data, K-means and FCM can only obtain partially correct cluster centers and results. DPC can obtain some clustering centers and results of data such as Figures 5 and 7 because DPC is effective not only for nonspherical data but also for spherical data. However, from Figures 4, 5, 6, and 7, because DPC is stringent in distributing nonspherical data, it is impossible to obtain correct cluster centers and results. Our method is not affected by these distributions.

4.1.2. Real-World Datasets

In this section, we use Zoo [23], Thyroid [23], Ecoli [24], Machine [23], Hayes-Roth [23], Sobar-72 [25], Segment [23], and Pendigits [23] real-world data (details are shown in Table 5) to experiment with K-means, FCM, DPC, and our algorithm.

From Table 6, we can see that ARI [26, 27], NMI [28], and homogeneity [29] of GDPC are all higher than those of K-means, FCM, and DPC on Zoo and Thyroid datasets. On machine, Sobar-72 and Sobar-72 datasets, GDPC has two higher metrics than K-means, FCM, and DPC. On Ecoli, Segment, and Pendigits data, only one metric of GDPC is the best. In parentheses is the value of the parameter .

The experimental results show that although GDPC is inferior to K-means, FCM, and DPC in some datasets and metrics, its performance is better than that in general.

4.2. Efficiency Experiment
4.2.1. Synthetic Datasets

In this section, we use the moons dataset to test the efficiency of DPC and our methods,. The distribution of the moons dataset is shown in Figure 8. From Figure 9, we can see that, with the increasing amount of data, the time required for DPC increases significantly, and our method is better than DPC, and it becomes more and more apparent.

4.2.2. Real-World Datasets

In this section, we use the real-world datasets to test the efficiency of DPC and our methods. As shown in Table 7, our method has always been more efficient than DPC. The larger the , the faster our method is. The more significant the amount of data, the greater the gap between our method and DPC.

4.3. Influence of Parameter

In this section, we use the moons dataset to test the influence of parameter on the clustering results. As shown in Table 8,  = 0.05, 0.2, 0.25, and 0.3 cannot make the clustering results completely correct. As increases from 0.05 to 0.3, the running time decreases from 277.5 seconds to 1.34375 seconds.

The influence of on clustering results mainly has two aspects. (a) Efficiency: the larger the , the fewer the number of groups after grouping, and the faster the clustering speed. (b) Precision: too small will lead to grouping errors, which will reduce the precision. Too small will lead to too many noise samples. Combining the above two aspects, we select the largest to ensure that all samples in each group belong to the same cluster.

According to the synthetic data experiment, is generally half of the distance between the core regions of the two nearest clusters. Because high-dimensional real data cannot be observed, it is challenging to select for real data. How to obtain the value of in real data is also a limitation of this method.

5. Conclusion

This paper presents a new method to improve DPC. We divide the data into minor groups while ensuring that the sample labels in each group are the same, so we only need to calculate the similarity between groups to reduce the amount of calculation. At the same time, we isolate the core regions of different clusters, which makes it easy for us to find the cluster center at the densest location. Among many methods to improve DPC efficiency, the automatic selection of cluster centers is not solved. Our method not only improves DPC efficiency but also solves the problem of automatic selection of cluster centers.

GDPC has a good effect on some spherical and nonspherical data, but it does not perform well in some complex data with noise samples. Because GDPC only uses one in one data, it is difficult to achieve a good grouping result in some complex data, resulting in unsatisfactory clustering results and efficiency. In addition, how to obtain the value of in real data is also a limitation of this method. These are also where we need to improve.

The advantages of the density-based clustering method in nonspherical data are difficult to replace. In the future, we still need to consider how to improve the efficiency under reducing parameters.

6. Application

Clustering can be applied to wireless sensor data annotation. We apply the clustering method to the activity data [30] of the elderly to identify the four states of the elderly: (a) sit on the bed; (b) sit on a chair; (c) lying; and (d) ambulating. The data were obtained by Torres et al. through a battery-free wearable sensor. As shown in Figure 10, the experimental results show that our proposed method has certain advantages in wireless sensor data annotation.

Data Availability

Previously reported data were used to support this study and are available at http://archive.ics.uci.edu/ml. These prior studies are cited at relevant places within the text as references.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The National Natural Foundation Project of China: Research on Retina Image Segmentation and Aided Diagnosis Technology Based on Deep Learning (61962054) supported this study.