Abstract

DBSCAN is a base algorithm for density-based clustering. It can find out the clusters of different shapes and sizes from a large amount of data, which is containing noise and outliers. However, it is fail to handle the local density variation that exists within the cluster. Thus, a good clustering method should allow a significant density variation within the cluster because, if we go for homogeneous clustering, a large number of smaller unimportant clusters may be generated. In this paper, an enhancement of DBSCAN algorithm is proposed, which detects the clusters of different shapes and sizes that differ in local density. Our proposed method VMDBSCAN first finds out the “core” of each cluster—clusters generated after applying DBSCAN. Then, it “vibrates” points toward the cluster that has the maximum influence on these points. Therefore, our proposed method can find the correct number of clusters.

1. Introduction

Unsupervised clustering techniques are an important data analysis task that tries to organize the data set into separated groups with respect to a distance or, equivalently, a similarity measure [1]. Clustering has been applied to many applications in pattern recognition [2], imaging processing [3], machine learning [4], and bioinformatics [5].

Clustering methods can be categorized into two main types: fuzzy clustering and hard clustering. In fuzzy clustering, data points can belong to more than one cluster with probabilities [6]. In hard clustering, data points are divided into distinct clusters, where each data point can belong to one and only one cluster. These data points can be grouped with many different techniques, such as partitioning, hierarchical, density based, grid based, and model based.

Partitioning algorithms minimize a given clustering criterion by iteratively relocating data points between clusters until a (locally) optimal partition is attained. The most popular partition-based clustering algorithms are the 𝑘-means [7] and the 𝑘-mediod [8]. The advantage of the partition-based algorithms is the use of an iterative way to create the clusters, but the limitation is that the number of clusters has to be determined by user and only spherical shapes can be determined as clusters.

Hierarchical algorithms provide a hierarchical grouping of the objects. These algorithms can be divided into two approaches, the bottom-up or agglomerative and the top-down or divisive approach. In case of agglomerative approach, at the start of the algorithm, each object represents a different cluster and at the end, all objects belong to the same cluster. In divisive method at the start of the algorithm all objects belong to the same cluster, which is split, until each object constitutes a different cluster. Hierarchal algorithms create nested relationships of clusters, which can be represented as a tree structure called dendrogram [9]. The resulting clusters are determined by cutting the dendrogram by a certain level. Hierarchal algorithms use distance measurements between the objects and between the clusters. Many definitions can be used to measure distance between the objects, for example, Euclidean, City-block (Manhattan), Minkowski and so on.

Between the clusters, one can determine the distance as the distance of the two nearest objects in the two clusters (single linkage clustering) [10], or as the two furthest (complete linkage clustering) [11], or as the distance between the mediods of the clusters. The disadvantage of the hierarchical algorithm is that after an object is assigned to a given cluster, it cannot be modified later. Also only spherical clusters can be obtained. The advantage of the hierarchical algorithms is that the validation indices (correlation and inconsistency measure), which can be defined on the clusters, can be used for determining the number of the clusters. The popular hierarchical clustering methods are CHAMELEON [12], BIRCH [13], and CURE [14].

Density-based algorithms like DBSCAN [15] and OPTICS [16] find the core objects at first and they are growing the clusters based on these cores and by searching for objects that are in a neighborhood within a radius epsilon of a given object. The advantage of these types of algorithms is that they can detect arbitrary form of clusters and they can filter out the noise.

Grid-based algorithms quantize the object space into a finite number of cells (hyper-rectangles) and then perform the required operations on the quantized space. The advantage of this approach is the fast processing time that is in general independent of the number of data objects. The popular grid-based algorithms are STING [17], CLIQUE [18], and WaweCluster [19].

Model-based algorithms find good approximations of model parameters that best fit the data. They can be either partitional or hierarchical, depending on the structure or model they hypothesize about the data set and the way they refine this model to identify partitionings. They are closer to density-based algorithms in that they grow particular clusters so that the preconceived model is improved. However, they sometimes start with a fixed number of clusters and they do not use the same concept of density. Most popular model-based clustering methods are EM [20].

Fuzzy algorithms suppose that no hard clusters exist on the set of objects, but one object can be assigned to more than one cluster. The best known fuzzy clustering algorithm is FCM (Fuzzy 𝐶-MEANS) [21].

Categorical data algorithms are specifically developed for data where Euclidean, or other numerical-oriented, distance measures cannot be applied.

Rest of the paper is organized as follows. Section 2 provides related work on density-based clustering. Section 3 presents DBSCAN clustering algorithm is presented. In Section 4, the proposed algorithm. In Section 5, simulation and results are presented and discussed. Finally, Section 6 presents conclusion and future work.

The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [15] is a pioneer algorithm of density-based clustering. It requires user predefined two input parameters, which are radius and minimum objects within that radius. The density of an object is the number of objects in its 𝜀-neighborhood of that object. DBSCAN does not specify upper limit of a core object, that is, how much objects may present in its neighborhood. So, due to this, the output clusters are having wide variation in local density so that a large number of smaller unimportant clusters may be generated.

OPTICS [16] algorithm is an improvement of DBSCAN to deal with variance density clusters. OPTICS does not assign cluster memberships, but this algorithm computes an ordering of the objects based on their reachability distance for representing the intrinsic hierarchical clustering structure. Pei et al. [22] proposed a nearest-neighbor cluster method, in which the threshold of density (equivalent to Eps of DBSCAN) is computed via the expectation-maximization (EM) [20] algorithm, and the optimum value of 𝑘 (equivalent to MinPts of DBSCAN) can be decided by the lifetime individual 𝑘. As a result, the clustered points and noise were separated according to the threshold of density and the optimum value of 𝑘.

In order to adapt DBSCAN to data consisting of multiple processes, an improvement should be made to find the difference in the mth nearest distances of processes. Roy and Bhattacharyya [23] developed new DBSCAN algorithm, which may help to find different density clusters that overlap. However, the parameters in this method are still defined by users. Lin et al. [24] introduced new approach called GADAC, which may produce more precise classification results than DBSCAN does. Nevertheless, in GADAC, the estimation of the radius is dependent upon the density threshold 𝛿, which can only be determined in an interactive way.

Pascual et al. [25] developed density-based cluster method to deal with clusters of different sizes, shapes, and densities. However, the parameters of neighborhood radius 𝑅, which is used to estimate the density of each point, have to be defined using prior knowledge and finding Gaussian-shaped clusters and is not always suitable for clusters with arbitrary shapes.

Another enhancement of the DBSCAN algorithm is DENCLUE [25], based on an influence function that describes the impact of an object upon its neighborhood. The result of density function gives the local density maxima value, and this local density value is used to form the clusters. It produces good clustering results even when a large amount of noise is present.

EDBSCAN (an Enhanced Density-Based Spatial Clustering of Application with Noise) [26] algorithm is another extension of DBSCAN; it keeps tracks of density variation which exists within the cluster. It calculates the density variance of a core object with respect to its 𝜀-neighborhood. If density variance of a core object is less than or equal to a threshold value and also satisfies the homogeneity index with respect to its neighborhood, then it will allow the core object for expansion. But, it calculates the density variance and homogeneity index locally in the 𝜀-neighborhood of a core object.

DD_DBSCAN [27] algorithm is another enhancement of DBSCAN, which finds the clusters of different shapes and sizes which differ in local density but, the algorithm is unable to handle the density variation within the cluster. DDSC [28] (a Density-Differentiated Spatial Clustering Technique) is proposed, which is again an extension of the DBSCAN algorithm. It detects clusters, which are having nonoverlapped spatial regions with reasonable homogeneous density variations within them.

In VDBSCAN [29] (Varied Density-Based Spatial Clustering of Applications with Noise), the authors have also tried to improve the results using DBSCAN algorithm. The method computes 𝑘-distance for each object and sort them in ascending order, then plotted using the sorted values. The sharp change at the value of 𝑘-distance corresponds to a suitable value.

CHAMELEON [12] finds the clusters in a data set by two-phase algorithm. In first phase, it generates a 𝑘-nearest-neighbor graph. In the second phase, it uses an agglomerative hierarchical clustering algorithm to find the cluster by combining the sub clusters.

Most of the algorithms are not robust to noise and outlier density-based algorithms are more important in this case. However, most of the density based clustering algorithms, are not able to handle the local density variation. DBSCAN [15] is one of the most popular algorithms due to its high quality of noiseless output clusters. However, also failing to detect the density varied clusters, there are many researches existing as an enhancement of DBSCAN for handling the density variation within the cluster.

3. DBSCAN Algorithm

The DBSCAN [30] is density fundamental cluster formation. Its advantage is that it can discover clusters with arbitrary shapes and sizes. The algorithm typically regards clusters as dense regions of objects in the data spaces that are separated by regions of low-density objects. The algorithm has two input parameters, radius 𝜀 and MinPts. For understanding the process of the algorithm, some concepts and definitions have to be introduced. The definition of dense objects is as follows.

Definition 1. The neighborhood within a radius 𝜀 of a given object is called the 𝜀-neighborhood of the object.

Definition 2. If the 𝜀-neighborhood of an object contains at least a minimum number of 𝜎 objects, then the object is called an 𝜎-core object.

Definition 3. Given a set of data objects, 𝐷, we say that an object 𝑝 is directly density reachable from object 𝑞 if 𝑝 is within the 𝜀-neighborhood of 𝑞 and 𝑞 is a 𝜎-core object.

Definition 4. An object 𝑝 is density reachable from object 𝑞 with respect to 𝜀 and 𝜎 in a given set of data objects, 𝐷, if there is a chain of objects 𝑝1,𝑝2,𝑝3,,𝑝𝑛,𝑝1=𝑞 and 𝑝𝑛=𝑝 such that 𝑝𝑛+1 is directly density reachable from 𝑝𝑖 with respect to 𝜀 and 𝜎, for 1𝑖𝑛,𝑝𝑖𝐷.

Definition 5. An object 𝑝 is density-connected from object 𝑞 with respect to 𝜀 and 𝜎 in a given set of data objects, 𝐷, if there is an object 𝑜𝐷 such that both 𝑝 and 𝑞 are density-reachable from o with respect to 𝜀 and 𝜎.

According to the above definitions, it only needs to find out all the maximal density-connected spaces to cluster the data objects in an attribute space. And these density-connected spaces are the clusters. Every object not contained in any clusters is considered noise and can be ignored.

Explanation of DBSCAN Steps
(i)DBSCAN [31] requires two parameters: radius epsilon (Eps) and minimum points (MinPts). It starts with an arbitrary starting point that has not been visited. It then finds all the neighbor points within distance Eps of the starting point. (ii)If the number of neighbors is greater than or equal to MinPts, a cluster is formed. The starting point and its neighbors are added to this cluster, and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors' recursively. (iii)If the number of neighbors is less than MinPts, the point is marked as noise. (iv)If a cluster is fully expanded (all points within reach are visited), then the algorithm proceeds to iterate through the remaining unvisited points in the dataset.

4. The Proposed Algorithm

One of the problems with DBSCAN is that it is has wide density variation within a cluster.

To overcome this problem, new algorithm VMDBSCAN based on DBSCAN algorithm is proposed in this section. It first clusters the data objects using DBSCAN. Then, it finds the density functions for all data objects within each cluster. The data object that has the minimum density function will be the core for that cluster. After that, it computes the density variation of a data object with respect to the density of core object of its cluster against all densities of other core's clusters. According to the density variance, we do the movement for data objects toward the new core. New core is one of other core's clusters, which has the maximum influence on the tested data object.

We intuitively present some definitions.

Definition 6. Suppose that 𝑥 and 𝑦 two data objects in a 𝑑-dimensional feature space, 𝐷. The influence function of data object 𝑦 on 𝑥 is a function 𝑓𝑦𝐵: 𝐷𝑅+0 and can be defined as basic influence function 𝑓𝐵: 𝑓𝑦𝐵(𝑥)=𝑓𝐵(𝑥,𝑦).(1)
The influence function we will choose will be function that can determine the distance between two data objects, as Euclidean distance function. 𝑓Euclidean(𝑥,𝑦)=(𝑥𝑦)2.(2)

Definition 7. Given a 𝑑-Dimensional feature space, 𝐷, the density function at a data object 𝑥𝐷 is defined as the sum of all the influence to 𝑥 from the rest of data objects in 𝐷. 𝑓𝐷𝐵(𝑥)=𝑛𝑖=1𝑓𝐵𝑥𝑖(𝑥),1𝑖𝑛.(3)

According to Definitions 6 and 7, we can calculate the density function for each data point in the space.

Definition 8. Core, the core object for each cluster, is the object that has the minimum density function value according to Definition 7. That is, we can calculate the density function for each object in the cluster, which is given initially by DBSCAN, and the object which has the minimum connection to all other objects will be the core for that cluster.

Definition 9. Total Density Function 𝐸 represents the difference among the data objects, which is based on the core. That is, the Total Density Function 𝐸 for data object 𝑥𝑖𝐸𝑖𝑥=𝑑𝑖,𝐶𝑗,1𝑖𝑛,1𝑗𝑘,𝑖numberofpoints,𝑘numberofcores(4) is the difference between the data object 𝑥𝑖 and the core of its cluster.

In addition, according to our initial clusters which are given by the density-based clustering methods, we can takeover the influence function (Definition 6) and density function (Definition 7) to calculate the Total Density Function 𝐸 of the data objects by subtracting the value of their density function to the value of the core's:𝐸𝑖=||𝑓𝐷𝐵𝑥𝑖𝑓𝐷𝐵||(𝐶).(5)

4.1. Vibration Process

Our main idea is the vibration of data objects according to the density of the data object with respect to core (Definition 8), the core that represents each cluster, and measure of the Total Density Function 𝐸 of each data object as in (5). Then, if its Total Density Function𝐸𝑖 with respect to its core is greater than Total Density Function𝐸𝑖for some other cores, vibrate all points in that cluster toward the core object which has the maximum influence on that object point, according to:𝑒𝑥(𝑖+1)=𝑥(𝑖)+𝜂(𝑥(𝑐)𝑥(𝑖))1/2𝜎2,(6) where 𝜎=𝑒𝑖/𝑇, 𝑥(𝑖) is the current tested point, 𝑥(𝑐) is the current tested core, 𝜂 is the learning rate, and 𝑇: is the control of reduction in sigma.

We use 𝜏 in the vibration equation to control the winner of the current cluster, and we can adapt it to get the best clustering result. 𝑇 is used in our formula to control the reduction in sigma, that is, as the time increased, the movement (vibrate) of the point toward the new core is reduced.

Formally, we can describe our proposed algorithm as follows(1)Calculate the Density Function for all the data objects.(2)Do clustering for the data objects using traditional DBSCAN algorithm.(3)Calculate the Density Function for all the data objects again, and then find out the core of each generated cluster.(4)For each data object, if its Total Density Function with respect to its core is greater than with respect to other cores, then vibrate the data objects in that cluster.

The proposed method of the algorithm is described as pseudo code in Algorithm 1.

  VMDBSCAN()
(1) Begin initialize 𝜂
(2)  For 𝑖 = 1 𝑡 𝑜 𝑛
(3)     𝑑 𝑖 𝑑 𝑒 𝑛 𝑠 𝑖 𝑡 𝑦 ( 𝑥 𝑖 )
(4)  end_For
(5)   𝐶 𝑙 𝑎 𝑠 𝑠 𝐷 𝐵 𝑆 𝐶 𝐴 𝑁 ( )
(6)  For 𝑗 = 1 𝑡 𝑜 𝑐
(7)     𝑐 𝑗 𝑐 𝑜 𝑟 𝑒 ( 𝑥 𝑗 )
(8)  end_For
(9)  For 𝑖 = 1 𝑡 𝑜 𝑛
(10)     𝐸 𝑖 = 𝑐 𝑖 𝑑 𝑒 𝑛 𝑠 𝑖 𝑡 𝑦 ( 𝑥 𝑖 )
(11)    For 𝑗 = 1 𝑡 𝑜 𝑐
(12)        𝐸 𝑖 𝑐 = 𝑐 𝑗 𝑑 𝑒 𝑛 𝑠 𝑖 𝑡 𝑦 ( 𝑥 𝑖 )
(13)       if 𝐸 𝑖 𝑐 < 𝐸 𝑖
(14)         vibrate the point
(15)         else no vibrate
(16)       end_If
(17)    end_For
(18)  end_For

The first step initializes the value of learning rate 𝜂 it can take small values from [0,1]; 𝑛 is the number of data points in the 𝐷 data set. For each data point in the data set, we compute the Density Function of this data point according to (3), and then store results in an array list of Point Density (𝑑). Line 5 of the algorithm calls the DBSCAN algorithm to make initial clustering. From lines 6–8, we find the core object for each cluster resulting from DBSCAN. Line 10 calculates the Total Density Function 𝐸 for each point 𝑥𝑖 with respect to its core object. Line 12 calculates the Total Density Function 𝐸 for that point 𝑥𝑖 with respect to all other core objects. From line 13 to line 16 we check the effect of core objects on the data object 𝑥𝑖 if the effect of its core object is less than other core objects 𝑐𝑗 then vibrate the whole points which data object belongs to toward the core 𝑐𝑗.

5. Simulation and Results

We evaluated our proposed algorithm on several artificial and real data sets.

5.1. Artificial Data Sets

We use three artificial two-dimensional data sets, since the results are easily visualized. The first data set is shown in Figure 1. which consists of 226 data points with one cluster.

Figure 1(a) shows the original dataset plotting. In Figure 1(b), after applying the DBSCAN algorithm, with MinPts=5, Eps=11.8, we get 2 clusters. In Figure 1(c), after applying our proposed algorithm with 𝜂=0.0005, we get the correct number of clusters, that is, we have only 1 cluster. And we note that the points deleted by DBSCAN, as DBSCAN considered it then noise points, now appeared after applying our proposed algorithm.

Figure 2(a) shows the original dataset plotting. Figure 2(b) shows the result of applying DBSCAN on the second dataset, with MinPts=5, and Eps=0.2. The resulting clusters are 3 clusters. But, if we applied our proposed algorithm Figure 2(c) with 𝜂=0.005, we get the correct number of clusters, which are 2 clusters.

Figure 3(a) shows the original dataset plotting. In Figure 3(b), after applying the DBSCAN algorithm, with MinPts=5, Eps=8, we get 4 clusters. In this dataset, DBSCAN treats some points as noise and removes them. In Figure 3(c), after applying our proposed algorithm with 𝜂=0.0005, we get the correct number of clusters, that is, we have only 5 clusters.

5.2. Real Data Sets

We use the iris data set from the UCI (http://archive.ics.uci.edu/ml/datasets/Iris) which contains three clusters, 150 data points with 4 dimensions. For measuring the accuracy of our proposed algorithm, we use an average error index in which we count the misclassified samples and divide it by the total number of samples. We apply the DBSCAN algorithm with Eps=0.35 and MinPts=5, and obtain an average error index of 45.33%, while, when applying the VMDBSCAN algorithm with 𝜂=0.00005, we have an average error index of 20.00%.

We apply another data set, which is Haberman data set from UCI (http://archive.ics.uci.edu/ml/datasets/Haberman's+Survival) to show the efficiency of our proposed algorithm. The Haberman data set contains tow clusters, 306 data points with 3 dimensions. The obtained results are shown in Table 1. We get an average error index of 33.33% when we apply DBSCAN algorithm with Eps=4.3 and MinPts=5, while, when applying the VMDBSCAN algorithm with 𝜂=0.0005, we have an average error index of 27.78%.

We apply another data set, which is Glass data set from UCI (http://archive.ics.uci.edu/ml/datasets/Glass+Identification). The Glass data set contains six clusters, 214 data points with 9 dimensions. The obtained results are shown in Table 1. We get an average error index of 66.82% when we apply DBSCAN algorithm with Eps=0.85 and MinPts=5, While, when applying the VMDBSCAN algorithm with 𝜂=0.0005, we have an average error index of 62.15%. We note in this dataset the error rate result by using DBSCAN or VMDBSCAN is large. This is due to the fact that as the number of dimensions increases, the clustering algorithms fail to find the correct number of clusters.

6. Conclusions and Future Works

We have proposed an enhancement algorithm based on DBSCAN to cope with the problems of one of the most used clustering algorithm. Our proposed algorithm VMDBSCAN gives far more stable estimates of the number of clusters than existing DBSCAN over many different types of data of different shapes and sizes. Future work will focus on determining the best value of the parameter 𝜂 and improve the results for high dimensions data sets.