Abstract

Clustering is widely used in data analysis, and density-based methods are developed rapidly in the recent 10 years. Although the state-of-art density peak clustering algorithms are efficient and can detect arbitrary shape clusters, they are nonsphere type of centroid-based methods essentially. In this paper, a novel local density hierarchical clustering algorithm based on reverse nearest neighbors, RNN-LDH, is proposed. By constructing and using a reverse nearest neighbor graph, the extended core regions are found out as initial clusters. Then, a new local density metric is defined to calculate the density of each object; meanwhile, the density hierarchical relationships among the objects are built according to their densities and neighbor relations. Finally, each unclustered object is classified to one of the initial clusters or noise. Results of experiments on synthetic and real data sets show that RNN-LDH outperforms the current clustering methods based on density peak or reverse nearest neighbors.

1. Introduction

Clustering is the task to find a set of groups in which similar objects are in the same group, but different objects are separated into different groups. Since clustering can uncover the inherent, potential, and unknown knowledge, principles, or rules in the real-world, it has been widely used in many fields, including data mining, pattern recognition, machine learning, information retrieval, image analysis, and computer graphics [13]. According to the strategies used, clustering algorithms are traditionally classified into connectivity-based approaches, centroid-based approaches, distribution-based approaches, and density-based approaches [1, 2]. Among these kinds of approaches, density-based approaches allow to discover clusters with arbitrary shapes and different sizes without specifying the number of clusters.

In density-based clustering, clusters are considered to be dense regions of objects separated by low-density regions representing noise. With respect to clustering, the procedure can be broken up into two steps: estimating the density of each object and grouping density-connected objects.

The first approach adopted the density-based strategy proposed by Ester et al. [4] in the paper “Density-Based Spatial Clustering of Applications with Noise,” which is dubbed as DBSCAN. In this approach, the density of each object is defined as the number of objects contained in its eps neighborhood. If the number is greater than minpts, the object is regarded as core objects, otherwise, as noise. Then, all objects that are reachable from one of the unclustered core objects are grouping to a cluster. However, it is difficult for DBSCAN to select two proper parameters eps and minpts. Another drawback of DBSCAN is that the adjacent clusters of different densities could not be properly identified [5, 6].

Density peak clustering (DPC) [7] is another famous strategy for density-based clustering, which is based on the idea that cluster centres have higher densities than their neighbors and are far away from each other. This method can identify the cluster centres from the decision graph, which is constructed by the density and the distance attributes of each objects. Moreover, it only needs one parameter. Although it seems more convenient than DBSCAN in completing clustering, it has some inner defaults. First, it is a nonsphere type of centroid-based method essentially according to DPC’s definition of density peak and the strategy of grouping. So, in some cases, complex shapes still cannot be recognized by this method. Second, the cluster centres are picked out from the decision graph manually, which limits the application of DPC. Besides, it is very difficult to select the true centres on some specific data sets. Third, errors will be propagated in the subsequent assignment process.

To remedy these limitations in DPC, there are many improved methods that have been proposed [5, 814]. FKNN-DPC [8] defines a uniform local density metric based on the k-nearest neighbors and uses a fuzzy technique to complete the assignment procedure after the cluster centres have been found out manually. ADPC [5] calculates local density of each object on its k-nearest neighbors by using Gaussian kernel function and applies the divide-and-conquer strategy to find cluster centres and group other objects automatically. RECOME [10] defines a new density measure as the ratio of each object’s density to the maximum density of its k-nearest neighbors and also uses the divide-and-conquer strategy to partition a data set. Although these algorithms have improved DPC in some aspects, they still suffer from some drawbacks of centroid-based methods.

In contrast to the algorithms listed above, RECORD [15], RNN-DSC, [16], IS-DSC [17], and ISB-DSC [6] use reverse nearest neighbors to define object density. From the graph theory angle to interpret, these algorithms use the directed graph to complete clustering. In the graph, each vertex is an object of a data set, and for any two vertexes a and b, there is a directed edge from a to b if b is one of the k-nearest neighbors of a. RECORD defines those vertexes as core objects whose out degrees are not lower than the input parameter k and outliers otherwise. Outliers are regarded as noises and eliminated from the graph, while core vertexes and their edges form a subgraph, from which all strong connected components are found out as the result of clustering. The main distinction between RNN-DSC and RECORD lies on the outliers’ assignment. In the former, an outlier will be grouped into the cluster that its nearest neighbor belongs to if the nearest neighbor is a core object. IS-DSC defines the k-influence space of each object as the intersection of its k-nearest neighbors set and reverse k-nearest neighbors set. The method applies the STRATIFY algorithm to remove outliers firstly and then performs the similar clustering procedure on remaining objects as RECORD, but each vertex is defined as the core object if its size of k-influence space is greater than 2k. ISB-DSC also uses k-influence space to create subgraph like IS-DSC, but the clustering procedure is applied on the whole data set. Comparing to DPC, the superiority of these approaches is that they no longer need to find the cluster centres. However, since RECORD, IS-DSC, and ISB-DSC employ a global threshold to predetermine outliers, they partition too many objects to noise. PIDC [18] uses the size of the unique closest neighbor set as an estimate of object density and growing strategies to complete clustering. Although this method is parameter independent, it is sensitive to noise and has high computing complexity.

In this paper, we propose an improved clustering approach by combining the k-reverse nearest neighbor graph model and density hierarchical relationship model. Based on the reverse nearest neighbor model and parameter k, a directed graph is constructed from objects of a data set. By searching strong connected components in the graph, the data set is partitioned into several initial clusters. Then, we use density dependence and k nearest neighborhood to build the density hierarchical relationships of all objects. Each unclassified object is grouped into the same cluster in which its parent is. The algorithm has the following advantages: (1) The method searches core regions instead of density peaks. So it can find out the true clusters automatically rather than to get some false cluster centres. (2) A novel density measure is proposed based on the k reverse nearest neighbor, which can reflect the aggregate relations of objects. (3) It is more efficient that our approach classifies the unclustered objects by the local density hierarchy relationships. It also reduces the risk of misclassification by the orders of the objects.

The proposed algorithm is performed on synthetic and real-world data sets, which are widely used for the performance tests of clustering algorithms. The results of RNN-DPC are compared with IS-DSC, ISB-DSC, RNN-DSC, and ADPC in terms of three very popular benchmarks: F-measure (F1) [19], adjusted mutual information (AMI), and Adjusted Rand Index (ARI) [20]. The ratio of the noise number to total objects number is taken as the benchmark too.

The rest of the paper is organized as follows: Section 2 makes a detailed description of the notations and definitions used in our algorithm. Section 3 describes the procedures of RNN-LDH in detail. Section 4 gives our experiment results and discusses the choice of parameter k briefly. Section 5 draws some conclusions.

2. RNN-DHR Algorithm

In this section, we give the detail description of RNN-LDH theoretically. Some definitions in the section were introduced in other papers but modified by our method.

2.1. Notations and Definitions

The notations used in this paper are listed below:(1): cardinal of a set(2): set of data with d dimension; d: the dimension number of data(3): any three objects in (4): distance between two objects and (5): set of the k-nearest neighbor of object ; k: the input parameter with the integer value(6)Especially, is the set containing only one object which is nearest to object x(7): set of the k-reverse nearest neighbor of object , which is defined as(8)std: standard deviation function.

Definition 1 (directly density reachable). Object x is directly density reachable from an object if(1) and (2) and

Definition 2 (density reachable). Object x is directly density reachable from object if there is a chain of objects , , and , which satisfies the conditions listed below:(1) is a core object(2) is directly density reachable from

Definition 3 (core region). A core region (Rc) is a none empty subset of such that(1)(2), is density-reachable from

Definition 4 (extended core region). Given a core region (Rc), its extended core region (Re) composes of all elements in Rc and any object x which is satisfying the following conditions:(1) does not belong to any core region(2), and

Definition 5 (local density). Local density of an object x is defined aswhere .

Definition 6 (parent). The parent of an object is defined aswhere .
The parent represents a local density hierarchical relationship of object to its k-nearest neighbors.

Definition 7 (hierarchical distance). The hierarchical distance of an object x is defined as

Definition 8 (inner distance). The inner distance of an extended core region is defined aswhere is the standard deviation of distances of all objects to their closest neighbors. Usually, the distance of a core object to its closest core object is less than the distance to its closest boundary object. The offset b can help to capture the global distribution of objects.

Definition 9 (density connected). An object x is density connected to an extended core region if there exists an object u and a chain , , and such that(1)(2)

Definition 10 (cluster). Given an extended core region , a cluster C is the union of and all objects in X which are density connected to the .

Definition 11 (noise). An object x is a noise if it does not belong to any cluster of X.

NR is the noise ratio, which is defined as .

3. Procedures of the RNN-DHR Algorithm

In this section, we discuss our algorithm in detail.

Algorithm 1 lists the procedures for performing RNN-LDH, which accepts two inputs: data set and nearest neighbor parameter k and outputs a label vector. The value of each element in the label vector indicates which cluster that the corresponding object belongs to, and the object is a noise if its label value is zero.

(1);
(2)GetLDH(X, k);
(3);
(4);
(5)for all
(6) if =UNLABELED
(7)  if
(8)   ;
(9)   if
(10)    ;
(11)    ;
(12)    ;
(13)   end if
(14)  else
(15)   ;
(16)  end if
(17) end if
(18)end for
(19);
(20)while (bChanged)
(21);
(22) for all
(23)  ;
(24)  if cidNoise &&
(25)   ;
(26)   bChangedTRUE;
(27)  end if
(28) end for
(29)end while
(30)return label;

In the procedure of Algorithm 2, function GetLDH is called to get the local density hierarchical relationship (parent) of each object and the result is saved into the array variant parent firstly; then, from step 5 to 18, all extended core regions in data set X are found out by calling the FindECR procedure and saved to set variant ECRs, and each extended core region is an initial cluster; finally, from steps 20 to 29, each noise object connected to an initial cluster is identified as the same cluster by an iterative way if it satisfies the distance condition.

(1)for all
(2);
(3);
(4)for each
(5)  if
(6)   ;
(7)   ;
(8)  end if
(9)end for
(10)if
(11)  ;
(12)  ;
(13)else
(14)  ;
(15)  ;
(16)end if
(17)end for
(18)return parent;

The algorithm GetLDH is realized according to Definition 6. The main purpose of this algorithm is to find out each object’s density-dependent object dubbed as parent in its k-nearest neighbors. Meanwhile, the hierarchical distance of each object is calculated by formula (5) and saved into array.

Figure 1 shows the result of GetLDH algorithm on Compound [21] data set. In this figure, the red circle represents the core object and the black circle represents the boundary or noise. The larger the density of the object is, the bigger the circle shows. A line with direct arrow represents the local density hierarchical relationship of two connected objects.

Algorithm 3 represents the processing of FindECR for finding a new cluster. An unlabelled core object x is input as a seed and appended into a queue. The algorithm pops the first seed of the queue and performs the searching procedure in k-nearest neighbors of the seed iteratively. For all unlabelled objects which are visited in the procedure, those core objects are set to the same cluster number cid, while the others are labeled as noise. Step 3 to Step 17 finds out a core region starting from core object x. Steps 18–21 discard this core region if its size is not greater than . Step 22–29 extends the core region by using Definition 4.

(1);
(2) initialize an empty queue Q;
(3)Q.enqueue(x);
(4) while not empty Q
(5);
(6);
(7);
(8) for each
(9)  if  = UNLABELED &
(10)   if
(11)    Q.enqueue(o);
(12)   else
(13)    ;
(14)   end if
(15)  end if
(16) end for
(17)end while
(18)if
(19);
(20) return {};
(21)end if
(22)for each
(23) for each
(24)  if
(25)   ;
(26)   ;
(27)  end if
(28) end for
(29)end for
(30)return ;

Figure 2 shows the result of the FindECR algorithm on Compound [21] data set. In this figure, six extended core regions are found out and black dots represent the boundary and noise. The final results of our algorithm and the comparison with the state-of-art methods on Compound will be shown in the next section.

3.1. Choice of k

The five algorithms discussed in this paper all need one parameter k. IS-DSC and ADPC did not give the way how to set the value of k. ISB-DSC compared its parameter setting with DBSCAN and drew a conclusion that it is more robust than DBSCAN for different setting of k, but it did not address the choice of k too. RNN-DSC discussed 2 approaches to determine an appropriate value of k. Each approach chose the best k from 1 to 100 by a criterion. RNN-LDH also can use these two ways to choose k. By analysing results of large amount of experiments, we cannot yet find out the theoretical bases of k choice. For achieving the best performance of each testing algorithm, we choose the k in the range independently. By this value, the number of clusters grouped by the algorithm is as close as possible to the true class number and F1 measure is largest.

3.2. Complexity of the Algorithm

The time complexity of RNN-LDH depends on the following aspects: (1) computing the distance between points O(n2); (2) sorting the distance vector of each object (O(n2)), the time complexity will be down to O(nlog(n)); (3) computing the local density with k-reverse nearest neighbors (O(kn)) but k is not great than n; (4) calculating the distance for each object (O(kn)); (5) finding extended core regions (O(n2)); and (6) classifying noise (O(n2)). So the overall time complexity of RNN-LDH is O(n2).

The above analysis shows that RNN-LDH has the same complexity as RNN-DSC and ADPC.

4. Results and Discussion

To evaluate the performance of RNN-LDH, we perform a set of experiments on synthetic and real world data sets which are commonly used to test the performance of clustering algorithms. Indeed, we compare the performance of RNN-LDH with well-known clustering algorithms including RNN-DSC in [15], IS-DSC in [16], ISB-DSC in [6], and ADPC in [5]. Three popular criteria F1 measure (F1) [19], adjusted mutual information (AMI), and adjusted rand index (ARI) [20] are used to evaluate the performance of the above clustering algorithms. The upper bounds of these criteria are all 1.0. The better the clustering is, the larger the benchmark values are.

4.1. Synthetic Data

Table 1 shows the synthetic data sets we used in this paper. These data sets are all composed of classes with different densities, shapes, and orientations. The first 6 data sets were obtained from [21], and the remains were downloaded from [22]. The result of each algorithm for some of these synthetic data sets is displayed in Figure 3, plotted by different marks and color points, and all noises are plotted as black points. The parameter setting (k), cluster number found (C), noise ratio (NR), and values of benchmarks as F1, AMI, and ARI are listed in Table 2.

There are 300 objects in path-based data set. They are classified to 3 classes. One class forms a 3/4 circular ring, and the other two classes distribute at the both ends of the horizontal diameter of the ring. As shown in the first row of Figure 3, RNN-LDH gets the best result, RNN-DSC also gets the correct number of clusters, and the other three algorithms classify the data set incorrectly.

Compound has six classes with different densities. Two adjacent classes in the upper-left corner are subject to Gaussian distribution, and in the right of the figure, the class with the irregular shape is surrounded by the class with lowest density. In the bottom-left corner, the smallest class is encircled by the ring-shape class. As shown in the second row of Figure 3, our method partitions three classes exactly which are labeled as yellow hexagram, blue left-pointing triangle, and fuchsia upward-pointing triangle, and one object in the contiguous zone of two classes in the upper-left corner is classified incorrectly. Only part objects (green diamond) in the lowest density class are recognized, and unrecognized objects are labeled as noise (black points). Although all objects are classified to one of six classes by RNN-DSC, many objects are partitioned wrong. Although IS-DSC gets the best benchmarks, it only finds out core objects, but too many other objects are treated as noise. ISB-DSC and ADPC even cannot find correct number of classes.

A particularly challenging feature of Frame, t7.10k, and t8.8k is that classes have homogeneous distributions and are very close to each other. RNN-LDH outperforms the other algorithms on the data sets. On the data set Frame, RNN-LDH takes two outliers in the upper-left corner as noise while ADPC classifies these two objects to the upper class. RNN-DSC misclassifies one object in the adjacent area of two classes. On t8.8k, the result of RNN-DSC is closed to RNN-LDH. Although ISB-DSC has the highest benchmarks, we can see it partition the data set incorrectly from Figure 3.

Spiral has 3 classes which embrace each other, and Dim1024 is a high-dimensional data set and has 16 Gaussian classes with 1024 points. From Table 2, we can see the clustering algorithms all can get good results, but IS-DSC has a high noise ratio. Jain has two moon shape classes with different densities. ADPC divides the high density class into two parts and classifies the lower density class to the nearest part. Results of these three data sets are not displayed.

t4.8k has six classes with random noise. A thin sine curve runs across classes. RNN-LDH partitions the data set into 8 clusters for the sine curve is divided into several segments: the upper-left segment and the bottom-right segment are treated as two clusters, the bottom-left segment is looked upon as noise, and other segments are classified into their nearest clusters. RNN-DSC detected out only one segment of this curve. The other three algorithms are unable to partition some of main classes.

t5.8k has six label-like classes and a thick stick running across them. It also contains random noise. All label-like classes are found by the five algorithms. IS-DSC gets the highest benchmarks with highest noise ratio again. Our algorithm treats the stick as noise. RNN-DSC finds out one segment of the stick. ISB-DSC finds out 3 segments of the stick as 3 independent clusters and classifies some noise into 3 independent clusters too. ADPC partitions all objects into 6 clusters.

4.2. Real-World Data

Table 3 shows the real-world data sets we used to test the algorithms, which were downloaded from website UCI [23]. For real-world data sets, it should be noted that we did a few data preprocessing on some of them or selected the subset from them to do experiments, which are all listed below:(i)All samples with null or uncertain values or duplicates in the data sets were removed. Such data sets are Breast_C_W, Echocardiogram, and Internet-Ads.(ii)Most of data sets have class attributes or character attributes. So Table 3 only shows the number of attributes used to compute distances of samples.(iii)SPECT-Heart data set has two subsets, and we took the SPECT.test subset to test the algorithms.(iv)All text values in Chess were replaced by numbers, such as “f” was replaced by 0 and “t” by 1 and so on.(v)The attribute nos. 1 and 10–13 were removed from Echocardiogram, and the second attribute (“still-alive”) was selected as the clustering label.(vi)Lung-cancer is a sparse data set. There are 4 values for the fifth attribute, and 1 value for the ninth attribute was “?” (unknown). We replaced them with 0.(vii)Heart-disease has 10 sub-data sets. We used “reprocessed hugarian data” to test the algorithms. This data set is also unbalance because its largest cluster has more than 60% samples, while the smallest one has less than 6% samples.

Table 4 shows the experiment results of the five methods.

The attribute characters of Internet-Ads, Echocardiogram, Heart-disease, and Liver-disorders are categorical, integral, and real. The first three data sets are unbalance data sets because their vast majority of samples are in one class. Internet-Ads are also sparse. The benchmarks show that RNN-LDH outperforms other algorithms on Internet-Ads. For Echocardiogram, IS_DSC gets the best benchmark, but it classifies near half samples into noise. Compared to the other 3 algorithms, RNN-LDH gets the best results on F1.

The attribute values of Breast_C_W, Lung-cancer, and Wholesale are all integral. Our algorithm outperforms the others on all benchmarks for the first two data sets. For Lung-cancer, the other four algorithms cannot get the correct cluster numbers.

The attribute characters of Image-seg, Wine, and Sonar are real. For Image-seg, RNN-LDH does the best work than the others. IS_DSC gets the highest benchmarks but with the highest noise ratio and the wrong cluster number. For Wine and Sonar, RNN-LDH outperforms the other algorithms on one benchmark.

The attribute characters of SPECT-Heart, Monk-3, and Hays-roth are categorical. SPECT-Heart is also unbalance. For these data sets, our method outperforms the other methods on F1. The attribute characters of the remaining data sets are multiple. Our method does better than RNN-DSC, IS-DSC, and ISB-DSC.

The experimental results of RNN-LDH are combined with the experimental results of RNN-DSC, ISB-DSC, and ADPC, respectively, into three data groups. Each data group has 2 columns and 135 rows. One column represents the algorithm RNN-LDH, and the other column is one of other three methods. 135 rows are divided into 5 labels: F1, AMI, ARI, NR, and CR. Label CR represents the correct ratio of cluster numbers, which is calculated by the following equation:where C represents the cluster number the algorithms found out and TC represents the true cluster number of the data set.

The Friedman tests are carried out on these 3 data groups, and the value of each test is listed in Table 5. Because the results of IS-DSC are not good, especially on the real-world data sets, we do not do the Friedman test on them with RNN-LDH.

The values in Table 5 show that the results of our algorithm are significantly different with the results of the other algorithms.

5. Conclusions

In this paper, we proposed an improved density-based clustering algorithm, which is termed as RNN-DPC, by combining the k-reverse nearest neighbor model and the density hierarchical relationship. With the k-reverse nearest neighbor model, the proposed method partitions all observations of a data set into several unconnected core regions while outliers are around them initially. Comparing with density peak clustering, our method is more robust in finding initial clusters. By using the density hierarchical relationship, each unclustered object is grouped into the cluster that its parent object belongs to. If one’s parent is itself or it is unclassified to any cluster, it is a noise. In comparison with the RNN based method, our algorithm has lower noise ratio than IS-DSC and has higher accuracy than IS-DSC and RNN-DSC.

Data Availability

The data sets used in this paper are standard test data sets which are all available online and could be freely accessed. The synthetic data sets were downloaded from https://cs.joensuu.fi/sipu/datasets/and http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download, but the real-world data sets were downloaded from http://archive.ics.uci.edu/ml.

Conflicts of Interest

There are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by NSFC under Grant 61773022, Hunan Provincial Education Department (nos. 16B244, 17A200, and 18B504), and Natural Science Foundation of Hunan Province (nos. 2017JJ3287 and 2018JJ3479).