Abstract
Clustering is widely used in data analysis, and densitybased methods are developed rapidly in the recent 10 years. Although the stateofart density peak clustering algorithms are efficient and can detect arbitrary shape clusters, they are nonsphere type of centroidbased methods essentially. In this paper, a novel local density hierarchical clustering algorithm based on reverse nearest neighbors, RNNLDH, is proposed. By constructing and using a reverse nearest neighbor graph, the extended core regions are found out as initial clusters. Then, a new local density metric is defined to calculate the density of each object; meanwhile, the density hierarchical relationships among the objects are built according to their densities and neighbor relations. Finally, each unclustered object is classified to one of the initial clusters or noise. Results of experiments on synthetic and real data sets show that RNNLDH outperforms the current clustering methods based on density peak or reverse nearest neighbors.
1. Introduction
Clustering is the task to find a set of groups in which similar objects are in the same group, but different objects are separated into different groups. Since clustering can uncover the inherent, potential, and unknown knowledge, principles, or rules in the realworld, it has been widely used in many fields, including data mining, pattern recognition, machine learning, information retrieval, image analysis, and computer graphics [1–3]. According to the strategies used, clustering algorithms are traditionally classified into connectivitybased approaches, centroidbased approaches, distributionbased approaches, and densitybased approaches [1, 2]. Among these kinds of approaches, densitybased approaches allow to discover clusters with arbitrary shapes and different sizes without specifying the number of clusters.
In densitybased clustering, clusters are considered to be dense regions of objects separated by lowdensity regions representing noise. With respect to clustering, the procedure can be broken up into two steps: estimating the density of each object and grouping densityconnected objects.
The first approach adopted the densitybased strategy proposed by Ester et al. [4] in the paper “DensityBased Spatial Clustering of Applications with Noise,” which is dubbed as DBSCAN. In this approach, the density of each object is defined as the number of objects contained in its eps neighborhood. If the number is greater than minpts, the object is regarded as core objects, otherwise, as noise. Then, all objects that are reachable from one of the unclustered core objects are grouping to a cluster. However, it is difficult for DBSCAN to select two proper parameters eps and minpts. Another drawback of DBSCAN is that the adjacent clusters of different densities could not be properly identified [5, 6].
Density peak clustering (DPC) [7] is another famous strategy for densitybased clustering, which is based on the idea that cluster centres have higher densities than their neighbors and are far away from each other. This method can identify the cluster centres from the decision graph, which is constructed by the density and the distance attributes of each objects. Moreover, it only needs one parameter. Although it seems more convenient than DBSCAN in completing clustering, it has some inner defaults. First, it is a nonsphere type of centroidbased method essentially according to DPC’s definition of density peak and the strategy of grouping. So, in some cases, complex shapes still cannot be recognized by this method. Second, the cluster centres are picked out from the decision graph manually, which limits the application of DPC. Besides, it is very difficult to select the true centres on some specific data sets. Third, errors will be propagated in the subsequent assignment process.
To remedy these limitations in DPC, there are many improved methods that have been proposed [5, 8–14]. FKNNDPC [8] defines a uniform local density metric based on the knearest neighbors and uses a fuzzy technique to complete the assignment procedure after the cluster centres have been found out manually. ADPC [5] calculates local density of each object on its knearest neighbors by using Gaussian kernel function and applies the divideandconquer strategy to find cluster centres and group other objects automatically. RECOME [10] defines a new density measure as the ratio of each object’s density to the maximum density of its knearest neighbors and also uses the divideandconquer strategy to partition a data set. Although these algorithms have improved DPC in some aspects, they still suffer from some drawbacks of centroidbased methods.
In contrast to the algorithms listed above, RECORD [15], RNNDSC, [16], ISDSC [17], and ISBDSC [6] use reverse nearest neighbors to define object density. From the graph theory angle to interpret, these algorithms use the directed graph to complete clustering. In the graph, each vertex is an object of a data set, and for any two vertexes a and b, there is a directed edge from a to b if b is one of the knearest neighbors of a. RECORD defines those vertexes as core objects whose out degrees are not lower than the input parameter k and outliers otherwise. Outliers are regarded as noises and eliminated from the graph, while core vertexes and their edges form a subgraph, from which all strong connected components are found out as the result of clustering. The main distinction between RNNDSC and RECORD lies on the outliers’ assignment. In the former, an outlier will be grouped into the cluster that its nearest neighbor belongs to if the nearest neighbor is a core object. ISDSC defines the kinfluence space of each object as the intersection of its knearest neighbors set and reverse knearest neighbors set. The method applies the STRATIFY algorithm to remove outliers firstly and then performs the similar clustering procedure on remaining objects as RECORD, but each vertex is defined as the core object if its size of kinfluence space is greater than 2k. ISBDSC also uses kinfluence space to create subgraph like ISDSC, but the clustering procedure is applied on the whole data set. Comparing to DPC, the superiority of these approaches is that they no longer need to find the cluster centres. However, since RECORD, ISDSC, and ISBDSC employ a global threshold to predetermine outliers, they partition too many objects to noise. PIDC [18] uses the size of the unique closest neighbor set as an estimate of object density and growing strategies to complete clustering. Although this method is parameter independent, it is sensitive to noise and has high computing complexity.
In this paper, we propose an improved clustering approach by combining the kreverse nearest neighbor graph model and density hierarchical relationship model. Based on the reverse nearest neighbor model and parameter k, a directed graph is constructed from objects of a data set. By searching strong connected components in the graph, the data set is partitioned into several initial clusters. Then, we use density dependence and k nearest neighborhood to build the density hierarchical relationships of all objects. Each unclassified object is grouped into the same cluster in which its parent is. The algorithm has the following advantages: (1) The method searches core regions instead of density peaks. So it can find out the true clusters automatically rather than to get some false cluster centres. (2) A novel density measure is proposed based on the k reverse nearest neighbor, which can reflect the aggregate relations of objects. (3) It is more efficient that our approach classifies the unclustered objects by the local density hierarchy relationships. It also reduces the risk of misclassification by the orders of the objects.
The proposed algorithm is performed on synthetic and realworld data sets, which are widely used for the performance tests of clustering algorithms. The results of RNNDPC are compared with ISDSC, ISBDSC, RNNDSC, and ADPC in terms of three very popular benchmarks: Fmeasure (F1) [19], adjusted mutual information (AMI), and Adjusted Rand Index (ARI) [20]. The ratio of the noise number to total objects number is taken as the benchmark too.
The rest of the paper is organized as follows: Section 2 makes a detailed description of the notations and definitions used in our algorithm. Section 3 describes the procedures of RNNLDH in detail. Section 4 gives our experiment results and discusses the choice of parameter k briefly. Section 5 draws some conclusions.
2. RNNDHR Algorithm
In this section, we give the detail description of RNNLDH theoretically. Some definitions in the section were introduced in other papers but modified by our method.
2.1. Notations and Definitions
The notations used in this paper are listed below:(1): cardinal of a set(2): set of data with d dimension; d: the dimension number of data(3): any three objects in (4): distance between two objects and (5): set of the knearest neighbor of object ; k: the input parameter with the integer value(6)Especially, is the set containing only one object which is nearest to object x(7): set of the kreverse nearest neighbor of object , which is defined as(8)std: standard deviation function.
Definition 1 (directly density reachable). Object x is directly density reachable from an object if(1) and (2) and
Definition 2 (density reachable). Object x is directly density reachable from object if there is a chain of objects , , and , which satisfies the conditions listed below:(1) is a core object(2) is directly density reachable from
Definition 3 (core region). A core region (R_{c}) is a none empty subset of such that(1)(2), is densityreachable from
Definition 4 (extended core region). Given a core region (R_{c}), its extended core region (R_{e}) composes of all elements in R_{c} and any object x which is satisfying the following conditions:(1) does not belong to any core region(2), and
Definition 5 (local density). Local density of an object x is defined aswhere .
Definition 6 (parent). The parent of an object is defined aswhere .
The parent represents a local density hierarchical relationship of object to its knearest neighbors.
Definition 7 (hierarchical distance). The hierarchical distance of an object x is defined as
Definition 8 (inner distance). The inner distance of an extended core region is defined aswhere is the standard deviation of distances of all objects to their closest neighbors. Usually, the distance of a core object to its closest core object is less than the distance to its closest boundary object. The offset b can help to capture the global distribution of objects.
Definition 9 (density connected). An object x is density connected to an extended core region if there exists an object u and a chain , , and such that(1)(2)
Definition 10 (cluster). Given an extended core region , a cluster C is the union of and all objects in X which are density connected to the .
Definition 11 (noise). An object x is a noise if it does not belong to any cluster of X.
NR is the noise ratio, which is defined as .
3. Procedures of the RNNDHR Algorithm
In this section, we discuss our algorithm in detail.
Algorithm 1 lists the procedures for performing RNNLDH, which accepts two inputs: data set and nearest neighbor parameter k and outputs a label vector. The value of each element in the label vector indicates which cluster that the corresponding object belongs to, and the object is a noise if its label value is zero.

In the procedure of Algorithm 2, function GetLDH is called to get the local density hierarchical relationship (parent) of each object and the result is saved into the array variant parent firstly; then, from step 5 to 18, all extended core regions in data set X are found out by calling the FindECR procedure and saved to set variant ECRs, and each extended core region is an initial cluster; finally, from steps 20 to 29, each noise object connected to an initial cluster is identified as the same cluster by an iterative way if it satisfies the distance condition.

The algorithm GetLDH is realized according to Definition 6. The main purpose of this algorithm is to find out each object’s densitydependent object dubbed as parent in its knearest neighbors. Meanwhile, the hierarchical distance of each object is calculated by formula (5) and saved into array.
Figure 1 shows the result of GetLDH algorithm on Compound [21] data set. In this figure, the red circle represents the core object and the black circle represents the boundary or noise. The larger the density of the object is, the bigger the circle shows. A line with direct arrow represents the local density hierarchical relationship of two connected objects.
Algorithm 3 represents the processing of FindECR for finding a new cluster. An unlabelled core object x is input as a seed and appended into a queue. The algorithm pops the first seed of the queue and performs the searching procedure in knearest neighbors of the seed iteratively. For all unlabelled objects which are visited in the procedure, those core objects are set to the same cluster number cid, while the others are labeled as noise. Step 3 to Step 17 finds out a core region starting from core object x. Steps 18–21 discard this core region if its size is not greater than . Step 22–29 extends the core region by using Definition 4.

Figure 2 shows the result of the FindECR algorithm on Compound [21] data set. In this figure, six extended core regions are found out and black dots represent the boundary and noise. The final results of our algorithm and the comparison with the stateofart methods on Compound will be shown in the next section.
3.1. Choice of k
The five algorithms discussed in this paper all need one parameter k. ISDSC and ADPC did not give the way how to set the value of k. ISBDSC compared its parameter setting with DBSCAN and drew a conclusion that it is more robust than DBSCAN for different setting of k, but it did not address the choice of k too. RNNDSC discussed 2 approaches to determine an appropriate value of k. Each approach chose the best k from 1 to 100 by a criterion. RNNLDH also can use these two ways to choose k. By analysing results of large amount of experiments, we cannot yet find out the theoretical bases of k choice. For achieving the best performance of each testing algorithm, we choose the k in the range independently. By this value, the number of clusters grouped by the algorithm is as close as possible to the true class number and F1 measure is largest.
3.2. Complexity of the Algorithm
The time complexity of RNNLDH depends on the following aspects: (1) computing the distance between points O(n^{2}); (2) sorting the distance vector of each object (O(n^{2})), the time complexity will be down to O(nlog(n)); (3) computing the local density with kreverse nearest neighbors (O(kn)) but k is not great than n; (4) calculating the distance for each object (O(kn)); (5) finding extended core regions (O(n^{2})); and (6) classifying noise (O(n^{2})). So the overall time complexity of RNNLDH is O(n^{2}).
The above analysis shows that RNNLDH has the same complexity as RNNDSC and ADPC.
4. Results and Discussion
To evaluate the performance of RNNLDH, we perform a set of experiments on synthetic and real world data sets which are commonly used to test the performance of clustering algorithms. Indeed, we compare the performance of RNNLDH with wellknown clustering algorithms including RNNDSC in [15], ISDSC in [16], ISBDSC in [6], and ADPC in [5]. Three popular criteria F1 measure (F1) [19], adjusted mutual information (AMI), and adjusted rand index (ARI) [20] are used to evaluate the performance of the above clustering algorithms. The upper bounds of these criteria are all 1.0. The better the clustering is, the larger the benchmark values are.
4.1. Synthetic Data
Table 1 shows the synthetic data sets we used in this paper. These data sets are all composed of classes with different densities, shapes, and orientations. The first 6 data sets were obtained from [21], and the remains were downloaded from [22]. The result of each algorithm for some of these synthetic data sets is displayed in Figure 3, plotted by different marks and color points, and all noises are plotted as black points. The parameter setting (k), cluster number found (C), noise ratio (NR), and values of benchmarks as F1, AMI, and ARI are listed in Table 2.
There are 300 objects in pathbased data set. They are classified to 3 classes. One class forms a 3/4 circular ring, and the other two classes distribute at the both ends of the horizontal diameter of the ring. As shown in the first row of Figure 3, RNNLDH gets the best result, RNNDSC also gets the correct number of clusters, and the other three algorithms classify the data set incorrectly.
Compound has six classes with different densities. Two adjacent classes in the upperleft corner are subject to Gaussian distribution, and in the right of the figure, the class with the irregular shape is surrounded by the class with lowest density. In the bottomleft corner, the smallest class is encircled by the ringshape class. As shown in the second row of Figure 3, our method partitions three classes exactly which are labeled as yellow hexagram, blue leftpointing triangle, and fuchsia upwardpointing triangle, and one object in the contiguous zone of two classes in the upperleft corner is classified incorrectly. Only part objects (green diamond) in the lowest density class are recognized, and unrecognized objects are labeled as noise (black points). Although all objects are classified to one of six classes by RNNDSC, many objects are partitioned wrong. Although ISDSC gets the best benchmarks, it only finds out core objects, but too many other objects are treated as noise. ISBDSC and ADPC even cannot find correct number of classes.
A particularly challenging feature of Frame, t7.10k, and t8.8k is that classes have homogeneous distributions and are very close to each other. RNNLDH outperforms the other algorithms on the data sets. On the data set Frame, RNNLDH takes two outliers in the upperleft corner as noise while ADPC classifies these two objects to the upper class. RNNDSC misclassifies one object in the adjacent area of two classes. On t8.8k, the result of RNNDSC is closed to RNNLDH. Although ISBDSC has the highest benchmarks, we can see it partition the data set incorrectly from Figure 3.
Spiral has 3 classes which embrace each other, and Dim1024 is a highdimensional data set and has 16 Gaussian classes with 1024 points. From Table 2, we can see the clustering algorithms all can get good results, but ISDSC has a high noise ratio. Jain has two moon shape classes with different densities. ADPC divides the high density class into two parts and classifies the lower density class to the nearest part. Results of these three data sets are not displayed.
t4.8k has six classes with random noise. A thin sine curve runs across classes. RNNLDH partitions the data set into 8 clusters for the sine curve is divided into several segments: the upperleft segment and the bottomright segment are treated as two clusters, the bottomleft segment is looked upon as noise, and other segments are classified into their nearest clusters. RNNDSC detected out only one segment of this curve. The other three algorithms are unable to partition some of main classes.
t5.8k has six labellike classes and a thick stick running across them. It also contains random noise. All labellike classes are found by the five algorithms. ISDSC gets the highest benchmarks with highest noise ratio again. Our algorithm treats the stick as noise. RNNDSC finds out one segment of the stick. ISBDSC finds out 3 segments of the stick as 3 independent clusters and classifies some noise into 3 independent clusters too. ADPC partitions all objects into 6 clusters.
4.2. RealWorld Data
Table 3 shows the realworld data sets we used to test the algorithms, which were downloaded from website UCI [23]. For realworld data sets, it should be noted that we did a few data preprocessing on some of them or selected the subset from them to do experiments, which are all listed below:(i)All samples with null or uncertain values or duplicates in the data sets were removed. Such data sets are Breast_C_W, Echocardiogram, and InternetAds.(ii)Most of data sets have class attributes or character attributes. So Table 3 only shows the number of attributes used to compute distances of samples.(iii)SPECTHeart data set has two subsets, and we took the SPECT.test subset to test the algorithms.(iv)All text values in Chess were replaced by numbers, such as “f” was replaced by 0 and “t” by 1 and so on.(v)The attribute nos. 1 and 10–13 were removed from Echocardiogram, and the second attribute (“stillalive”) was selected as the clustering label.(vi)Lungcancer is a sparse data set. There are 4 values for the fifth attribute, and 1 value for the ninth attribute was “?” (unknown). We replaced them with 0.(vii)Heartdisease has 10 subdata sets. We used “reprocessed hugarian data” to test the algorithms. This data set is also unbalance because its largest cluster has more than 60% samples, while the smallest one has less than 6% samples.
Table 4 shows the experiment results of the five methods.
The attribute characters of InternetAds, Echocardiogram, Heartdisease, and Liverdisorders are categorical, integral, and real. The first three data sets are unbalance data sets because their vast majority of samples are in one class. InternetAds are also sparse. The benchmarks show that RNNLDH outperforms other algorithms on InternetAds. For Echocardiogram, IS_DSC gets the best benchmark, but it classifies near half samples into noise. Compared to the other 3 algorithms, RNNLDH gets the best results on F1.
The attribute values of Breast_C_W, Lungcancer, and Wholesale are all integral. Our algorithm outperforms the others on all benchmarks for the first two data sets. For Lungcancer, the other four algorithms cannot get the correct cluster numbers.
The attribute characters of Imageseg, Wine, and Sonar are real. For Imageseg, RNNLDH does the best work than the others. IS_DSC gets the highest benchmarks but with the highest noise ratio and the wrong cluster number. For Wine and Sonar, RNNLDH outperforms the other algorithms on one benchmark.
The attribute characters of SPECTHeart, Monk3, and Haysroth are categorical. SPECTHeart is also unbalance. For these data sets, our method outperforms the other methods on F1. The attribute characters of the remaining data sets are multiple. Our method does better than RNNDSC, ISDSC, and ISBDSC.
The experimental results of RNNLDH are combined with the experimental results of RNNDSC, ISBDSC, and ADPC, respectively, into three data groups. Each data group has 2 columns and 135 rows. One column represents the algorithm RNNLDH, and the other column is one of other three methods. 135 rows are divided into 5 labels: F1, AMI, ARI, NR, and CR. Label CR represents the correct ratio of cluster numbers, which is calculated by the following equation:where C represents the cluster number the algorithms found out and TC represents the true cluster number of the data set.
The Friedman tests are carried out on these 3 data groups, and the value of each test is listed in Table 5. Because the results of ISDSC are not good, especially on the realworld data sets, we do not do the Friedman test on them with RNNLDH.
The values in Table 5 show that the results of our algorithm are significantly different with the results of the other algorithms.
5. Conclusions
In this paper, we proposed an improved densitybased clustering algorithm, which is termed as RNNDPC, by combining the kreverse nearest neighbor model and the density hierarchical relationship. With the kreverse nearest neighbor model, the proposed method partitions all observations of a data set into several unconnected core regions while outliers are around them initially. Comparing with density peak clustering, our method is more robust in finding initial clusters. By using the density hierarchical relationship, each unclustered object is grouped into the cluster that its parent object belongs to. If one’s parent is itself or it is unclassified to any cluster, it is a noise. In comparison with the RNN based method, our algorithm has lower noise ratio than ISDSC and has higher accuracy than ISDSC and RNNDSC.
Data Availability
The data sets used in this paper are standard test data sets which are all available online and could be freely accessed. The synthetic data sets were downloaded from https://cs.joensuu.fi/sipu/datasets/and http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download, but the realworld data sets were downloaded from http://archive.ics.uci.edu/ml.
Conflicts of Interest
There are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported in part by NSFC under Grant 61773022, Hunan Provincial Education Department (nos. 16B244, 17A200, and 18B504), and Natural Science Foundation of Hunan Province (nos. 2017JJ3287 and 2018JJ3479).