Abstract
Due to the adoption of global parameters, DBSCAN fails to identify clusters with different and varied densities. To solve the problem, this paper extends DBSCAN by exploiting a new density definition and proposes a novel algorithm called deviation density based DBSCAN (kDDBSCAN). Various datasets containing clusters with arbitrary shapes and different or varied densities are used to demonstrate the performance and investigate the feasibility and practicality of kDDBSCAN. The results show that kDDBSCAN performs better than DBSCAN.
1. Introduction
DBSCAN is a classical density based clustering method [1] and has many desirable features including good robustness to noise and outliers. However, due to the adoption of global parameters, especially the introduction of neighborhood radius Eps, DBSCAN fails to identify clusters with different and varied densities. To solve this problem, two main methods have been proposed as follows.
(1) Adaptive Local Density or Eps. GRIDBSCAN [2] and GMDBSCAN [3] use the grid technique to calculate the local density (Eps, MinPts), where MinPts is defined as the minimum neighbors of a point when considering the point as the core point. APSCAN [4] uses the Affinity Propagation (AP) algorithm to partition a dataset into some patches and calculate the local density of each patch. VDBSCAN [5] uses a dist plot to select several Eps values for different densities. MultiDBSCAN [6] uses the mustlink constraint and nearest distance to calculate Eps values for different densities. DBSCANDLP [7] partitions a dataset into many subsets with different density levels by analyzing the statistical characteristics of its density variation and then estimates the Eps value for each subset. DSetsDBSCAN regards the data in the dominant set as core points and those from extrapolation as border ones, so Eps can be determined automatically based on the dominant set [8]. After the local density or Eps is estimated, all these algorithms apply DBSCAN to merge those data with similar density.
EDBSCAN [9] assigns varied values for Eps according to the local density based on the nearest neighbors, and the clustering process starts from the highest local density point towards the lowest local density one. DDSC [10] uses the Homogeneity test to detect the density difference between different regions; if their density difference is less than , those regions will be merged into the same cluster.
(2) Redefinition of the Density with No Parameter Eps. density [11] estimates the local density of the nonnormalized probability distribution according to the neighborhood of radius , and the hierarchical agglomerative strategy is used to merge clusters according to the disimilarity measures. In the multidensity DBSCAN, two adjacent spatial regions are separated into two clusters when the difference between DST and AVGDST violates a threshold, where DST is the average distance between one point and its nearest neighbors and AVGDST is the average distance between any point in one cluster and its neighbors [12]. In DBSCAN [13], the means clustering algorithm is employed to divide all points into level groups based on their density values (here, density value is the average distance of the point and its nearest neighbors), and then DBSCAN is used to merge similar data according to the density levels.
Among these methods, Eps is automatically calculated according to different densities in the first method, and the definitions of different densities are proposed in the second method. Hereinto, one kind of definition is based on the nearest neighborhood method such as the local density of the nonnormalized probability distribution [11], the average distance between one point and its nearest neighbors [12], density [13], and neighborhood density [7]. Based on these definitions, the varied densities can be represented separately by disimilarity measures [11], the difference between DST and AVGDST [12], and the density variation or density level [7].
It is known that the main objective of defining the density concept is to cluster the objects with similar density into the same cluster. For example, in Figure 1, the blue circles of A, B, C, and D can be viewed as one normal cluster, while the red circle E is an abnormal one. However, it is difficult to describe the density with Eps and MinPts because of different and varied densities. For example, in Figure 1(a), the fixed Eps have different densities in all these circle regions and the difference between circle E and another circle is not apparent.
(a)
(b)
Inspired by the above discussion, redefinition of the density based on the nearest neighbors seems a better solution to solve the problem of different and varied densities. Unfortunately, the density based on the average distance between one point and its nearest neighbors [7, 12, 13] is not good enough to describe the varied densities. For example, in Figure 1(b), if , the average distance is different in all these circle regions and circle E cannot be distinguished from other circles.
In this work, a new density definition called deviation density is proposed to describe the different and varied densities. The deviation density is defined as the proportion between the maximum distance and average distance. Given in Figure 1(b), the densities of the circle regions of A, B, C, and D are similar, while the density of the circle region E is different from that of A, B, C, or D. Thus a deviation density based clustering algorithm (kDDBSCAN) can be put forward.
2. The kDDBSCAN Algorithm
The basic idea of this paper is that the objects in the same cluster have the similar and small deviation density which can reflect the deviation of an object from others. Based on the deviation density and DBSCAN, kDDBSCAN is proposed. In this section, some basic concepts or definitions are given, and the process of the proposed algorithm is described in detail.
The points used to calculate the deviation are sampled through the nearest neighborhood (KNN) method. According to [14], the shared nearest neighborhood (SNN) method can be used due to its robustness in high dimension dataset. However, SNN is not efficient because of its complexity. In view of this, the neighborhood method is proposed to combine KNN and SNN.
When points are sampled through the neighborhood method, the deviation density can be calculated. The smaller deviation density means that it is more likely that these samples are in the same cluster. Hence, a deviation factor is proposed as the given threshold to decide whether the given datasets are in the same cluster. If the deviation is greater than the deviation factor, the given datasets are not in the same cluster. This process is called directly densityreachable (DDR) and can be used to identify core points. Furthermore, densityreachable is proposed to decide whether two core points belong to the same cluster.
2.1. Basic Definitions
Definition 1 (nearest neighborhood). The nearest neighborhood (KNN) of a point is denoted by .
Definition 2 (mutual nearest neighborhood). Given two points and , if and , then and are mutual nearest neighbors. The mutual nearest neighborhood (mKNN) of a point is denoted by .
Definition 3 ( neighborhood). The neighborhood of a point is denoted by , where is the minimal number of the mutual neighborhood:
Definition 4 (deviation density). Let and be the distance from a point to its nearest neighbor. Then, the deviation density is defined as
Definition 5 (directly densityreachable). A point is directly densityreachable from a point if and (core point condition), where is the deviation factor.
Definition 6 (densityreachable). A point is densityreachable from a point if the following conditions are satisfied:
(1) ,
(2) ,
(3) ,
(4) ,
(5) , .
2.2. The Process of kDDBSCAN
The procedure of ExpandCluster in Algorithm 1 is shown in Algorithm 2. The proposed kDDBSCAN, which is an extension of DBSCAN, is depicted as Algorithm 1.


3. Experiments
To thoroughly evaluate the effectiveness of kDDBSCAN, this section covers various types of datasets including clusters with arbitrary shape, uniform density, or varied densities. Since clusters in 2D datasets are easy to be visualized and compared by different algorithms, the performance comparison of kDDBSCAN with DBSCAN and the evaluation on parameter sensitivity are firstly conducted in the twodimensional space. Then we demonstrate that the proposed algorithm is applicable to multidimensional datasets. UCI datasets with ground truth are used to quantify the performance according to the clustering results from the perspectives of Homogeneity, Completeness, measure, ARI, FMI, and NMI. Finally, BSDS500, Olivetti Face dataset, and MNIST are used to investigate the feasibility and practicality of kDDBSCAN. In kDDBSCAN, the default values are and .
3.1. Clustering on the TwoDimensional Dataset
Twodimensional datasets are chosen according to their different characteristics. Spiral dataset represents the dataset with wellseparated and nonspherical cluster. Aggregation and Flame datasets have adjacent regions with uniform density. D1 and Path based datasets represent the dataset containing embedded and adjacent clusters with different densities. Jain dataset contains sparse data regions with different densities. Compound dataset contains adjacent, embedded regions with varied and different densities.
3.1.1. Comparison with DBSCAN
Figures 2–8 show the results from DBSCAN and kDDBSCAN on those twodimensional datasets, respectively, and the corresponding parameters for each algorithm on each kind of dataset are also given. Since the neighborhood method is different for and , the results when are given separately. When the parameters and are set as default values, kDDBSCAN has only one parameter .
(a) DBSCAN
(b) kDDBSCAN
(c) kDDBSCAN
(a) DBSCAN
(b) kDDBSCAN
(c) kDDBSCAN
(a) DBSCAN
(b) kDDBSCAN
(c) kDDBSCAN
(a) DBSCAN
(b) kDDBSCAN
(c) kDDBSCAN
(a) DBSCAN
(b) kDDBSCAN
(c) kDDBSCAN
(a) DBSCAN
(b) kDDBSCAN
(c) kDDBSCAN
(a) DBSCAN
(b) kDDBSCAN
(c) kDDBSCAN
Spiral dataset contains wellseparated and nonspherical cluster. Figure 2 shows that both DBSCAN and kDDBSCAN can obtain correct results.
Aggregation and Flame datasets contain adjacent regions with uniform density. Figure 4 shows that the KNN method performs worse than DBSCAN or mKNN method.
D1 and Path based datasets contain embedded and adjacent clusters with different densities.
In Figure 5, kDDBSCAN performs better than DBSCAN because the deviation density has considered the deviation factor of the dataset with different densities. Figure 6 shows that mKNN performs better than DBSCAN and KNN, indicating that mKNN is effective for adjacent clusters.
Jain dataset contains sparse data regions with different densities.
In Figure 7, DBSCAN cannot obtain correct results because of data sparsity, whereas kDDBSCAN performs well because the deviation density has considered the deviation factor of the dataset with sparse data.
Compound dataset contains adjacent, embedded regions with varied and different densities. In Figure 8, DBSCAN takes the sparse data as noise while kDDBSCAN does not.
Through the above discussion, it can be seen that the deviation density can handle the dataset with varied densities or sparse data. The parameter is effective in clustering the adjacent region according to the deviation density, as illustrated in Figures 3–5. The method of mKNN has the ability to cluster the wellseparated data, as illustrated in Figures 2, 7, and 8.
3.1.2. Parameter Sensitivity
In Figure 9, is sensitive to different densities in sparse data regions as the value decides the nearest neighborhood.
(a) kDDBSCAN
(b) kDDBSCAN
In Figure 10, is sensitive to varied and different densities when adjacent regions exist.
(a) kDDBSCAN
(b) kDDBSCAN
(c) kDDBSCAN
(d) kDDBSCAN
In Figure 11, as the parameter is effective in clustering the adjacent regions according to the deviation density, it is sensitive when clustering adjacent regions. However, is not sensitive to uniform density.
(a) kDDBSCAN
(b) kDDBSCAN
(c) kDDBSCAN
(d) kDDBSCAN
(e) kDDBSCAN
(f) kDDBSCAN
3.2. Clustering on Multidimensional Datasets
To demonstrate the applicability of kDDBSCAN to multidimensional dataset, UCI datasets with ground truth are used to evaluate its performance according to the clustering results from the perspectives of Homogeneity, Completeness, measure, ARI, FMI, and NMI. The input parameters are listed as Table 1, and the corresponding results are shown in Table 2. As can be seen, both algorithms can get the same results with Iris, but kDDBSCAN is more effective for other datasets.
3.3. Application of kDDBSCAN to Image Segmentation
Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500) consist of 500 natural images with groundtruth human annotations [15]. Among these images, three are used to demonstrate the effectiveness of kDDBSCAN relative to DBSCAN, and the results are shown in Table 3 and Figure 12. Here, the data point is composed of the pixel location and the corresponding RGB value. In addition, the parameter sensitivity of kDDBSCAN is analyzed in this section and the results are shown in Figures 13–18.
3.3.1. Comparison with DBSCAN
As shown in Figure 12, the segmentation and boundary detection is basically matched with the ground truth although some noise is required to be further processed. In kDDBSCAN, more noise can be found. Nevertheless, kDDBSCAN can separate the flower of BSD35008 and the hill of BSD22090 more clearly than DBSCAN does. The results in Table 3 indicate that kDDBSCAN can achieve better cluster performance than DDBSCAN. In particular, the values of Homogeneity and FMI for kDDBSCAN are both above 0.7.
3.3.2. Parameter Sensitivity of kDDBSCAN
In this part, the visual image segmentation results by kDDBSCAN are discussed, and then the clustering results with different parameter values using kDDBSCAN are evaluated.
(1) Results of the Segmentation with Different Parameter Values. In Figure 13, with the increase of value, the likelihood that similar pixels are clustered into the same cluster also increases, because higher value means larger tolerance to the difference between pixels. Similarly, in Figure 14, with the increase of value, similar pixels have less likelihood to be clustered into the same cluster. In general, higher value and lower value lead to larger likelihood for similar pixels to be clustered into the same cluster.
As the parameter decides the number of sampled points, the deviation density of every core point does not vary monotonically when increases. Consequently, there must be multiple optimum values to achieve better performance when and are constant. This is in accordance with the different results of the last images with different values in Figure 15.
(2) Clustering Evaluation. Figures 16–18 show the results with different parameter values using kDDBSCAN for different evaluation indexes. Based on the analysis of Figures 13 and 14, higher value and lower value can lead to larger likelihood for similar pixels to be clustered into the same cluster. As can be seen from Figures 16 and 17, most evaluation indexes vary monotonically with and . By contrast, in Figure 18, the Homogeneity value does not vary monotonically with , so there is an optimum value required to be set manually.
3.4. Application of kDDBSCAN to Olivetti Face Dataset and MNIST
kDDBSCAN and DBSCAN are applied to the Face dataset to group the images for the same person to a cluster without any previous training. Olivetti Face dataset is a widespread benchmark for machine learning algorithms. There are ten different images for each of 40 distinct persons. Similar to [16], the similarity between two images is calculated according to the method in [17]. Figures 19 and 20 show the clustering results, where the images of the same color correspond to the same cluster and the images with red point are taken as noise. It can be seen that DBSCAN identifies six persons as the same person “1,” whereas kDDBSCAN only identifies two persons as the same person “4.” The values of those clustering evaluation indexes such as Homogeneity, Completeness, measure, ARI, FMI, and NMI are given in Table 4. It is clear that kDDBSCAN performs better than DBSCAN.
The MNIST database of handwritten digits has a training set of 60,000 examples and a testing set of 10,000 examples. The digits have been sizenormalized and centered in a fixedsize image [18]. 500 examples of every digit are randomly chosen and put into kDDBSCAN. Table 5 provides the results of clustering evaluation, and Figure 21 displays the visual clustering results. As can be seen from Table 5, the maximum Homogeneity can reach 0.72, whereas the maximum ARI is only 0.323. The higher Homogeneity value in Table 5 means the larger maximum number of the clusters in the topleft of Figure 21, while the higher ARI value means that more similar examples can be clustered into the same cluster. Obviously, these two clustering indexes are contradictory. Hence, which clustering index is better depends on the actual application circumstance. From Figure 21, it can be concluded that the digits and are easy to be grouped into the same cluster.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
4. Conclusions
In summary, the basic idea of this paper is that the objects in the same cluster have the similar and small deviation density which can reflect the deviation of an object from others. On this basis, kDDBSCAN is proposed based on the deviation density and DBSCAN to identify clusters with different and varied densities, and various datasets containing clusters with arbitrary shapes, uniform density, and different or varied densities are used to demonstrate the performance of kDDBSCAN. The results show that kDDBSCAN can achieve better results than DBSCAN.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (no. 61502423), Zhejiang Provincial Natural Science Foundation (nos. Y14F020092, LY14F040002, and LY16G020012), Zhejiang Public Welfare Technology Research Project Foundation (no. 2017C31040), and Ningbo Natural Science Foundation (nos. 2015A610130 and 2013A610006).