Abstract

Density peaks clustering (DPC) is an efficient and effective algorithm due to its outstanding performance in discovering clusters with varying densities. However, the quality of this method is highly dependent on the cutoff distance. To improve the performance of DPC, the gravitation-based clustering (GDPC) algorithm is proposed. However, it cannot identify the clusters of varying densities. We developed a novel density peaks clustering algorithm based on the magnitude and direction of the resultant force acting on a data point (RFDPC). RFDPC is based on the idea that the resultant forces acting on the data points in the same cluster are more likely to point towards the cluster center. The cluster centers are selected based on the force directional factor and distance in the decision graph. Experimental results indicate superior performance of the proposed algorithm in detecting clusters of different densities, irregular shapes, and numbers of clusters.

1. Introduction

The main goal of clustering is to divide a data set into groups of data points so that the points within the same group are close to one another and those from different groups are distinct [1]. Clustering is widely used in many fields, such as image analysis, medical applications, data mining, and bioinformatics [25]. In general, the clustering algorithms can be categorized as partitional clustering [6], density-based clustering [7], hierarchical clustering [8], model-based clustering [9], grid-based clustering [10], and graph-based clustering [11].

K-means algorithm partitions data into k clusters by iteratively minimizing the sum of distances between each data point and the cluster center [12]. Nevertheless, as a result of its sensitivity to the initial cluster centers, this algorithm may reach local optimum. In terms of density-based clustering, DBSCAN is one of the most widely adopted algorithms, which can discover arbitrarily shaped clusters using neighboring relationships among data points [13]. However, the choice of the density threshold has a significant impact on this algorithm.

Rodriguez and Laio proposed a fast clustering algorithm that employs the density peaks of the data (DPC) [14]. In this algorithm, it is assumed that cluster centers will have a higher density than their neighbors and that the cluster centers will be relatively far from points with a higher density. DPC can detect cluster centers rapidly and recognize certain types of shaped data sets. However, DPC cannot detect clusters with no obvious center, and the cutoff distance impacts the performance. The clustering results of DPC on the two-circles data sets [1] are shown in Figure 1, where dc is the cutoff distance. We can see from Figure 1 that DPC cannot detect two-circles correctly when the cutoff distance dc is set as 0.1 and 1.21, respectively.

Numerous methods have been proposed to improve the DPC algorithm [1518]. A residual error computation has been proposed by Parmar et al. [19] to measure local density in a neighborhood region. Guo et al. [20] proposed an improved DPC algorithm (DPC-CE) to estimate the local center connectivity. Wang et al. [21] developed a hierarchical density peak clustering algorithm that assumes that cluster centers are relatively far apart from one another. A new model by Wang et al. [22] illustrates the local gravitational effects of data points, where each data point is viewed as a mass, subject to a local force resulting from its neighbors. A density peaks clustering algorithm using gravity theory (GDPC) has been developed by Jiang et al. [23], which assumes that gravity is inversely proportional to the distance between data points. And the horizontal axis of the decision graph is density, and the vertical axis is the reciprocal of gravity. GDPC can discover outliers. However, this algorithm is challenged when clustering a data set with a wide variation in density among the clusters.

Figure 2 shows the clustering results of GDPC, where the cluster centers are represented by green stars. As shown in Figure 2(a), GDPC can identify the clusters with similar densities. But GDPC cannot discover the expected clusters with various densities shown in Figure 2(b). This is mainly because most of the data points in the upper right corner in Figure 2(b) have lower densities than those in the lower-left corner. And the density will affect the determination of the cluster center to a certain extent. So the two cluster centers are located in the same cluster, which leads to wrong clustering results.

This paper proposes a density peaks clustering algorithm based on the resultant force, called RFDPC. As shown in Table 1, in comparison to algorithms such as k-means, DBSCAN, DPC, and GDPC, the proposed RFDPC algorithm has the ability to detect clusters of various densities, irregular shapes, and the number of clusters. Here are our thoughts:(1)It is highly likely that the resultant forces acting on the data points in the same cluster point towards the cluster center. That is, for the cluster center ci, the resultant forces act on its neighbors towards ci.(2)The cluster centers are generally located farther from points with higher densities.

Therefore, it is more likely for a data point to be associated with a cluster center if most resultant forces acting on other points point towards it. In addition, two cluster centers have a larger distance than the points in the same cluster. Thus, the noncentral point can be assigned to its nearest cluster center.

The key contributions of this paper are as follows:(1)A data point can be viewed as a force object, and the resultant force acting on it can determine whether the data point is located near the cluster center.(2)The clustering centers are selected based on force directional factor and distance in the decision graph.(3)This paper proposes a clustering algorithm that is less sensitive to the shape or density of clusters.

The rest of this paper is organized as follows: in Section 2, we review the DPC algorithm. We next present our proposed algorithm in Section 3. The time complexity of the proposed algorithm is analyzed in Section 4. Section 5 discusses the proposed algorithm. Section 6 presents the experimental evaluations. Finally, the conclusion of our work is presented in Section 7.

The DPC algorithm is based on the idea that the cluster centers are surrounded by neighbors with lower density and that they are at a relatively large distance from any point with a higher density. The density of data point i can be defined aswhere dc is the cutoff distance, and dij is the Euclidean distance between points xi and xj.

In addition, the density of data point can also be defined aswhere if and otherwise.

Let be the minimum distance from point xi to any other point with higher density.

The decision graph can be obtained after calculating the value of and . The cluster centers are determined by selecting the data points with higher values of and .

DPC suffers many shortcomings such that it cannot identify the clusters of varying densities, as well as the number of clusters. The reason for the problem is that the density of the data point is determined by the cutoff distance dc, which can affect the clustering results. To address this issue, the algorithm proposed in this paper uses the force directional factor instead of the density of the data point by analyzing the resultant force acting on the data point.

3. Method

The RFDPC algorithm assumes that: (1) All points attract each other with a force of gravitational attraction; (2) The direction of the resultant force acting on a data point is mainly along with the line between the point and the cluster center. (3) The intercluster distance is larger than the intracluster distance. The RFDPC includes three major steps:Step 1: for data point i, calculate the density , distance , and resultant force .Step 2: calculate the angle between the vector connecting two points and the resultant force, as well as the force directional factor .Step 3: determine the cluster centers and assign each remaining point to its nearest cluster.

3.1. Calculate and Sort the Density of Data Point

First, calculate the Euclidean distance dij in equation (2) between two data points i and j. The density of the data point can be calculated according to equations (1) or (3). The choice of cutoff distance dc is presented in Ref. [14], that is, one can choose dc so that the average number of neighbors is about 1 to 2% of the total point number in the data set.

Then, all data points are sorted according to their densities in descending order. The distance is calculated from a data point i to any other point with a higher density as determined by equation (2).

3.2. Resultant Force Acting on a Data Point

The concept of gravity is introduced on the basis of DPC. Newton’s law of universal gravitation [23] states that any particle in the universe attracts any other with a force varying directly as the product of the masses and inversely as the square of the distance between their centers. The gravitational force is formulated as follows:where G is the gravitational constant, m1 and m2 represent the two masses, r represents the distance between the centers of the masses. Table 2 provides a mapping of Newton’s law of universal gravitation with parameters of DPC.

In equation (6), the Euclidean distance is used to calculate the gravitational force, which avoids the sensitivity of the algorithm to dc value. The gravitational force F acting between data points i and j is defined as equation (6) according to Table 2.where fij is the gravitational force between the data point i and the data point j, whose direction is from i to j.

For each data point i, the gravitational force fij from the other data point j is calculated according to equation (6), and we further obtain the magnitude and direction of the resultant force Fi acting on data point i through decomposing and synthesizing the gravitational forces, as shown in equation.

An example of two-dimensional space is shown in Figure 3. Suppose that the data point i is located at the origin of a coordinate system. Two gravitational forces F1 and F2 act on point i. F1 can further be decomposed into a force fx1 along the x axis and a force fy1 along the y axis. Similarly, F2 can also be decomposed into fx2 and fy2. Finally, the resultant force Fi acting on point i is obtained by combining the sum of the components of all forces along the x axis (including fx1 and fx2) and the y axis (including fy1 and fy2).

Figure 4 shows the resultant force acting on each data point in Flame data set, where the direction of the arrow represents the direction of the resultant force acting on the data point and the length of the arrow represents the magnitude of the resultant force acting on the data point. As can be seen from Figure 4, most resultant forces point towards the two cluster centers. Meanwhile, the farther away the data point is from the cluster center, the less affected by the cluster center. In addition, the magnitude of resultant force in a higher density area is usually larger. In contrast, the resultant force will have a small magnitude in a lower density area.

3.3. Force Directional Factor

Here, a parameter is introduced which represents the effect of data point i on the resultant forces acting on all other points. The detailed method is given as follows.

Take points i and j in Figure 4, for example. For clarity, the points i and j are magnified in Figure 5. As shown in Figure 5, αij represents the angle between the vector connecting point j to point i and the resultant force Fj. And αij is in the range . The range of αij can be normalized to [−1, 1] by cosαij, as shown in equation (8). cosαij is equal to 1 when αij is equal to 0, which means that the resultant force Fj acting on j points towards i. The larger the value of αij is, the smaller cosαij is, and the more the direction of resultant force Fj acting on j deviates from the data point i.

In addition, from Figure 4, it can be seen that the resultant force acting on data point k also points towards point i. However, i and k belong to different clusters, which indicates that it will produce the clustering error while only considering . To address this issue, not only the resultant force should be considered but also the distance between data points. Thus, the parameter is defined as

For data point i, is calculated between it and all other points. The force directional factor is further defined as follows:

A larger value of indicates that the point i has a significant effect on the resultant force acting on the other points. Hence, the point i may be most likely to be a cluster center or located near the cluster center.

3.4. Determine the Cluster Centers

For the data point i near the cluster center, its value of will be larger due to the effect of the cluster center. Thus, both of and can be considered. Therefore, the cluster centers are selected according to the product of by . Each remaining point is assigned to the same cluster as its nearest neighbor of higher density. Algorithm 1 shows the proposed clustering algorithm based on the resultant force. Figure 6 illustrates the flowchart of RFDPC algorithm.

Input: Data set X =  (n is the number of data points), number of clusters K
Output: A clustering result
 Step 1. Calculate the density and the distance of each data point i.
  (1.1)Calculate dij from X according to equation (3).
  (1.2)Determine dc value so that the average number of neighbors is about 1 to 2% of the total point number in the data set.
  (1.3)Calculate based on equation (1) or equation (4) and sort all in descending order.
  (1.4)Calculate the distance according to equation (2).
 Step 2. Calculate the resultant force acting on the data point.
 (2.1)  Calculate the gravitational force acting on the data point according to equation (6), and the resultant force according to equation (7).
  (2.2)Calculate according to equations (9) and (10).
 Step 3. Determine the cluster center.
  (3.1)Select the K cluster centers according to the product of by , that is, the cluster center is selected from the data points with larger product values.
  (3.2)Each remaining point is assigned to the same cluster as its nearest neighbor with higher density.

4. Complexity Analysis

The computational complexity of RFDPC is analyzed as follows:

Firstly, we calculate the density ρi and the distance δi of each data point i, which requires calculations.

Next, we calculate the resultant force acting on the data point. The time complexity of calculating the gravitational force acting on a data point and the resultant force is . The time for the calculation of γi is .

Finally, we determine the cluster center. Since , the time for the selection of K cluster centers can be ignored. It takes about to assign each remaining point to its nearest cluster.

As a result, the total time complexity of the RFDPC algorithm is approximately .

5. Experimental Results

For the purpose of testing the feasibility and effectiveness of the RFDPC algorithm, it is compared with k-means [6], DPC [14], GDPC [23], single linkage [8], spectral clustering [24], DPC-CE [20], and McDPC [21] on 11 synthetic data sets DS1-DS11 (https://github.com/milaan9/Clustering-Datasets) and 15 UCI real data sets [6] listed in Table 3. Both Shuttle and Eeg are large-scale data sets and Gene is a high-dimensional data set. The code used in this paper is released, which is written in Matlab and available at https://github.com/djhahaha/cluster.

An example of a hierarchical clustering algorithm is a single linkage algorithm. Spectral clustering is one of the graph-based clustering algorithms. For k-means and spectral clustering, the best clustering result was obtained from 50 trial runs according to the external clustering validity index. In addition, the parameter of dc in DPC, GDPC, DPC-CE, and RFDPC set as 1.2. McDPC includes three parameters , , and , and the values of these three parameters are determined by a heuristic algorithm [21]. The algorithms were all implemented in MATLAB. The experiments were carried out on a machine with Intel Core i5 2.2 GHz CPU and 8 GB RAM running Windows 10.

This paper uses Rand Index (RI) [25], Adjusted rand index (ARI) [26], Jaccard (JC) [27], Purity (PR) [28], F-measure (F1) [23], and NMI [29] to evaluate the clustering performance of the above compared algorithms. The larger values of these indices mean better clustering result.

5.1. Detect Clusters of Varying Densities

There are two clusters in the DS1 data set, one with a higher density and another with a lower density. The DS2 has four clusters, one with higher densities and the other three with lower densities. The experiments are carried out on DS1 and DS2 in detecting clusters of varying densities. For DS1, in Figure 7, it is obvious that k-means, GDPC, single linkage, and spectral clustering cannot identify the clusters of varying densities. All four remaining clustering algorithms can produce the correct clustering results. Figure 8 illustrates the clustering results on DS2. For DS2, the results of k-means, DPC-CE, and McDPC are not perfect, but they are better than those from DPC, GDPC, and single linkage. Both spectral clustering and RFDPC can identify the clusters correctly. The reason why RFDPC can aggregate clusters of varying densities properly is that it finds the cluster center based on the direction of the resultant force acting on the data point instead of the density of the data point.

5.2. Detect Clusters of Irregular Shapes

We evaluate the performance of RFDPC using nine data sets (DS3-DS11). The DS3 data set comprises a spherical cluster as well as three semi-moon clusters. The DS4 data set consists of three spiral clusters which are isolated from one another. The DS5 data set includes seven Gaussian distributed clusters. With regard to the DS6 data set, there are 31 two-dimensional Gaussian clusters, each with 100 data points. There are a crescentic cluster and a spherical cluster in the DS7 data set. In the DS8 data set, there are three spiral clusters and four Gaussian distributed clusters. The DS9 data set consists of four sphere-shaped clusters and one nonsphere-shaped cluster. The DS10 data set contains the following cluster types: linear cluster, ring-shaped cluster, and compact rectangular cluster. There are two round clusters in the DS11 data set, along with one Gaussian distributed cluster.

Figure 9 illustrates that spectral clustering and RFDPC perform better than other algorithms on DS3. Figure 10 illustrates the clustering results on DS4. Except for k-means and spectral clustering, all algorithms identify the expected clusters. Figure 11 describes the clustering results on DS5. Among these methods, DPC, GDPC, spectral clustering, DPC-CE, McDPC, and RFDPC give correct partitions, while k-means and single linkage do not. Figure 12 illustrates the clustering results on DS6. GDPC, DPC-CE, and RFDPC can discover the 31 clusters. We can see from Figure 13 that both of DPC-CE and RFDPC can produce the proper partitions. As for DS8 shown in Figure 14, DPC, DPC-CE, and RFDPC can discover the expected clusters. For DS9 shown in Figure 15, DPC, DPC-CE, McDPC, and RFDPC can identify the satisfactory partitions except for the outliers in the upper right corner. Figure 16 describes the clustering results on DS10. DPC, single linkage, DPC-CE, and RFDPC are capable of generating the proper partitions. In addition, we can see from Figure 17 that the proper partitions can only be identified by single linkage, spectral clustering, and DPC-CE.

We further explain the reason why RFDPC cannot identify the proper partitions for DS11. Figure 18 shows the three cluster centers determined by RFDPC, where the cluster centers are represented by purple stars. According to Figure 18, it is evident that each of the three cluster centers is located in a different cluster. Following the determination of the three cluster centers, each remaining point is assigned to a cluster based on its nearest neighbor of higher density. An incorrectly assigned data point may result in the domino effect, that is, once one data point is incorrectly assigned, subsequent data points may also be incorrectly assigned.

5.3. Determine the Number of Clusters

Figures 1921 show the decision graph for three data sets, including Wine, Seeds, and DS3. It can be seen from Figure 19 that RFDPC is more accurate than both DPC and GDPC in detecting the cluster number. DPC can distinguish only two cluster centers. And, GDPC can find one cluster center and cannot identify all of the three cluster centers. As can be seen from the decision graph (Figure 19(c)), there are three distinct points. Hence, RFDPC can find three cluster centers correctly. Figure 20 shows that GDPC can only discover one cluster center. Figure 21(a) shows that DPC can discover four cluster centers; however, the point in the red circle can easily be mistaken for the center. In contrast, it can be seen from Figure 21(c) that RFDPC can discover the proper cluster centers.

5.4. Experiments on the Real Data Sets

Tables 418 list the performance comparison of eight clustering algorithms on 15 real data sets. In Tables 418, the optimal value for each index is indicated in bold. It can be seen from Tables 418 that RFDPC obtains the best performance for most data sets. For the large-scale Shuttle data set, RFDPC achieved the best results for all indices except the Purity index. In addition, RFDPC achieved the best results for FM and NMI on the Gene data set.

5.5. Parameter Setting

In this section, we will discuss the value of dc involved in RFDPC. Table 19 shows the F-measure on the 15 UCI real data sets when dc is from 1 to 2. It can be seen from Table 19 that RFDPC performs better on the 15 data sets when dc is from 1.2 to 1.7. Thus, the best setting for dc is between 1.2 and 1.7.

5.6. Running Time Comparison

A list of the time complexity is provided in Table 20 for k-means, DPC, GDPC, single linkage, spectral clustering, DPC-CE, McDPC, and RFDPC, where n, D, K, and I denote the data point number in the data set, dimensionality, cluster number, and the iteration number for k-means, respectively. According to Table 20, single linkage and spectral clustering have the highest time complexity. In addition, DPC, GDPC, DPC-CE, McDPC, and RFDPC have a complexity of O (n2). Table 21 lists the running time of 8 algorithms on 15 real data sets. Table 21 indicates that the running time on real data sets is comparable to those in Table 20. In addition, the running time of single linkage and spectral clustering on three large-scale or high-dimensional data sets (including Shuttle, Gene, and Eeg) is significantly higher than other data sets. Although the calculation of resultant force is added to RFDPC, we optimize the code of RFDPC. So the running time of RFDPC on Gene and Eeg data sets is less than that of DPC.

6. Discussion

To further explain the proposed algorithm, let us consider the test case in Figure 22. In Figure 22(a), 25 data points are embedded in a two-dimensional space. The data points are ranked in order of decreasing γ. The black arrow represents the direction of the resultant force acting on the data point. All arrows in Figure 22(a) are of the same length because the magnitude of the resultant force is not involved in the calculation of γ. Figure 22(b) shows the corresponding decision graphs.

According to Step 1 of Algorithm 1, we calculate the density and the distance of each data point. Specifically, we calculate the Euclidean distance from data point xi to other points. For simplicity, the value of dc in this example is set to 2%. Then, we calculate and of each data point according to equations (1) and (2), respectively. Next, we calculate the resultant force acting on each data point according to Step 2 of Algorithm 1. We can see from Figure 22(a) that for the red cluster, most resultant forces point towards data point 1. In addition, we can find that there is no resultant force acting on the data point 1. This is mainly due to the fact that the magnitude of the resultant force acting on the data point 1 is equal to 0. Furthermore, the number of data points in the blue cluster is much smaller than that in the red cluster, so the resultant forces acting on the data points in the blue cluster are greatly affected by the data points in the red cluster. Hence, we can see from Figure 22(a) that for the blue cluster, the resultant forces do not point towards a certain data point. Next, we calculate γi according to equations (9) and (10). Finally, we determine the cluster centers according to the product of by γi. It can be seen from Figure 22(b) that the data points 1 and 17 with a larger product value are selected as the cluster center. Then, each remaining point is assigned to the same cluster as its nearest neighbor of higher density. The density of the red cluster is higher than that of the blue one in Figure 22(a). The γ value of data point 17 is smaller than that of the data point 1. But our algorithm considers both of δ and γ which can tackle the problem of varying densities. Since the δ value of data point 17 is close to 1, the data point 17 is selected as a cluster center.

7. Conclusions

This paper proposes a resultant force-based clustering algorithm based on gravitation theory. The clustering performance of RFDPC was evaluated using 5 synthetic data sets and 15 real data sets. The results indicate that the RFDPC performs well in the following aspects:(i)Aggregating clusters with different shapes and densities in an efficient manner.(ii)Detecting the cluster number.

Experimental results indicate that RFDPC is superior to k-means, DPC, GDPC, single linkage, spectral clustering, DPC-CE, and McDPC. GDPC considers the magnitude of the gravitational force between two points but ignores the effect of gravitational forces coming from the other points. We extend the gravitational force in GDPC to the resultant force and select the cluster center depending on both the force directional factor and distance in the decision graph. Hence, RFDPC can accurately recognize the cluster centers.

A major limitation of RFDPC is the assignment scheme for the remaining data points, which is prone to producing consecutive assignment errors. The future work will plan to improve RFDPC by reducing consecutive assignment errors. In addition, ensemble clustering techniques can improve clustering robustness by fusing the information of multiple clustering results [30]. Our algorithm may be improved by using ensemble clustering.

Data Availability

The code data that supports the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors are grateful to the support of the National Natural Science Foundation of China (61373004).