Advances in Multimedia

Volume 2017, Article ID 3695323, 9 pages

https://doi.org/10.1155/2017/3695323

## A Novel DBSCAN Based on Binary Local Sensitive Hashing and Binary-KNN Representation

^{1}College of Big Data and Information Engineering, Guizhou University, Guiyang 550025, China^{2}Guizhou University, Guizhou Provincial Key Laboratory of Public Big Data, Guiyang, Guizhou 550025, China

Correspondence should be addressed to Qin Wei; nc.ude.uzg@qiew

Received 17 August 2017; Accepted 12 November 2017; Published 7 December 2017

Academic Editor: Fumin Shen

Copyright © 2017 Qing He et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We revisit the classic DBSCAN algorithm by proposing a series of strategies to improve its robustness to various densities and its efficiency. Unlike the original DBSCAN, we first use the binary local sensitive hashing (LSH) which enables faster region query for the neighbors of a data point. The binary data representation method based on neighborhood is then proposed to map the dataset into the Hamming space for faster cluster expansion. We define a core point based on binary influence space to enhance the robustness to various densities. Also, we propose a seed point selection method, which is based on influence space and neighborhood similarity, to select some seed points instead of all the neighborhood during cluster expansion. Consequently, the number of region queries can be decreased. The experimental results show that the improved algorithm can greatly improve the clustering speed under the premise of ensuring better algorithm clustering accuracy, especially for large-scale datasets.

#### 1. Introduction

Clustering studies [1, 2] play important roles in many fields including data mining and machine learning. The purpose of clustering is to partition the dataset into different subclasses in such a way that the similarities of objects in the same subclass are maximized and the similarities of objects between different subclasses are minimized. DBSCAN, a density-based clustering algorithm [3, 4], basic idea is to connect the adjacent areas with high density which exceeded the threshold. Different from -means [5, 6], which are based on partitioned strategy, DASCAN does not need to initially set the number of data clusters; it is insensitive to noise and can identify clusters of any shape. However, it uses global parameters and* Minpts* to measure the density so that it does not perform well when the density and interclass distance distribution inconsistent. If the value of is higher, the data points in the cluster with relatively lower density will be defined as the boundary points, and the sparse cluster will be divided into several similar classes. On the contrary, if we give a lower , it will combine those clusters who are closer with larger density. Consequently, DBSCAN only plays well in well-proportioned datasets. Also, to find the core points for cluster expansion, DBSCAN continuously queries the neighborhood so as to have a considerable I/O cost especially for the large-scale datasets. To overcome the flaws of DBSCAN, scholars have done a lot of research and put forward a lot of improvement methods. Zhu et al. [7] proposed the ReCon and ReScale approaches based on density-ratio to improve the limitation of DBSCAN in finding clusters of varying densities. But ReCon-DBSCAN use a density estimator () and ReScale-DBSCAN use two additional parameters (, ) which increase the reliance on parameters of the algorithm to compute density-ratio. ZHOU Shui-geng et al. proposed PDBSCAN and FDBSCAN algorithm to solve the defect of DBSCAN, respectively. PDBSCAN [8] divides the data space into several uniformed areas according to the histogram statistical analysis results of the data in one or more dimensions; then it uses different for different regions to solve the weakness for various densities. However, PDBSCAN uses human-computer interaction to achieve data partition; it does not make great difference in practical application, although the clustering quality is better. FDBSCAN [9] algorithm uses only a small number of representative points instead of all in the neighborhood of a core point as seed points to expand the cluster. Unlike DBSCAN, FDBSCAN reduces the execution frequency of region query and decreases the I/O cost sharply. But it is at the expense of clustering accuracy, and the efficiency of region query has not been improved which means it still has room for improvement.

To address the issue related to DBSCAN, this paper proposes a fast clustering algorithm BLSH-DBSCAN based on collision-ordered LSH and binary nearest neighbors. This algorithm makes the following contributions:

(i) Using the binary LSH to query the nearest neighbors, it can improve the speed of region query greatly compared with traditional linear search.

(ii) It constructs a binary-KNN representation method which can map the data into the Hamming space for the next clustering operation and greatly improve the speed of clustering.

(iii) It introduces a core point distinguishing method based on the influence space and designs the solution of influence space in the binary dataset to boost the clustering speed. At the same time, due to the density sensitivity of influence space, this improved method has much better clustering quality and efficiency compared with the original DBSCAN.

(iv) It introduces a seed point selection method, based on influence space and the neighborhood similarity, to select some seed points instead of all the neighborhood during cluster expansion. It can decrease the execution frequency of region query to realize faster clustering operation.

The rest of the paper is organized as follows.

In Section 2, we provide an explanation of locality sensitive hashing for region query and how to make the binary-KNN representation of a point. An improved density-based clustering algorithm is developed in Section 3. We introduce the influence space and its solving method in binary dataset; also, a seed point selection method is proposed. Section 4 reports the experimental results. Discussion and conclusions are provided in Section 5.

#### 2. Binary Locality Sensitive Hashing and Binary-KNN Representation

##### 2.1. About DBSCAN Algorithm

DBSCAN is a typical density-based spatial clustering algorithm. It has two important parameters and* Minpts*. defines the radius of the neighborhood of a data object, and* Minpts* defines the minimum number of data points contained in the neighborhood. DBSCAN gives the following definitions.

Suppose that we are given a dataset .

*Definition 1 (directly density reachable). *If is in the neighborhood of and is a core object, then object is directly density reachable from .

*Definition 2 (density reachable). *A point is density reachable from a point , if there is a chain of points , , , such that is directly density reachable from ().

*Definition 3 (density connected). *A point is density connected to a point , if there is a point and both and are density reachable from .

*Definition 4 (core point). *In the -neighborhood of point , if the number of points which are directly density reachable from point is greater than* Minpts*, then is a core point.

*Definition 5 (border point). *If is not a core point, but is directly density reachable from a core point, then is a border point.

*Definition 6 (noise point). *If point is neither a core point nor a border point, then is a noise point.

To find the cluster, DBSCAN starts with an arbitrary object in and then retrieves all points which are density reachable from with respect to and* Minpts*. If is a core point, then mark and its -neighborhood as a new cluster. Then, DBSCAN continues to retrieve -neighborhood of other points in the cluster and adds -neighborhood of the core points to the current cluster until no new object can be added to the cluster. When all points have been divided into a cluster or been labeled as a noise point, clustering ends.

However, the neighborhood query needs to calculate the distance between the query object and all other objects by liner search and it has a huge I/O cost. To solve the problem, we propose the following improvements: to accelerate the region query, using the binary LSH rather than linear search to query the nearest points, and to use the neighbor structure to represent data points which can map the high-dimensional datasets into Hamming space to expedite the clustering expansion.

##### 2.2. Locality Sensitive Hashing

The LSH algorithm is usually for quick neighbor query. It involves two steps: index construction and object query. In index construction, through a set of hash functions, it projects similar data points into the same hash bucket with a higher probability. In object query, it uses a filter-and-refine framework to hash the data into the hash bucket through the same hash functions. All the data points in the hash buckets are adopted as candidates, which are used to calculate the similarity with the query object to find the nearest neighbors.

*Definition of LSH* (see [10]). Let be two distances which satisfy the distance function. We call the hash function family -sensitive when each function in satisfies the following two conditions.(1)If , .(2)If ,,

where .

LSH uses different hash function families for different distance functions. In this paper, binary hash function family based on -stable distribution which applies to the Euclidean space under the norm is used. For each high-dimensional data point, the hash function family is [11]where is a random vector that follows the -stable distribution and has the same dimension with .

The index structure of LSH can be summarized as the following two steps [12, 13]:

(i) Giving a set of hash functions and a set of vectors , defining a new hash function family . For the vectors and the hash functions in and , we select hash functions to carry out the AND structure to construct functions .

belongs to and its detailedness is

(ii) Selecting an integer , then randomly selecting hash tables from to map the data points into the hash tables.

When using the binary LSH in object query, for each query object , it selects the same hash functions with those in index structure to calculate the conflicting bucket number of . It is clear that each set of functions can get a conflicting bucket number of , which is as .

Basic LSH adopts all data objects which have the same conflicting bucket number with the query object as the candidates, and then it compares the similarity between the candidates and the query object to find the nearest neighbors of the query object.

However, in the neighbor search of DBSCAN, there are more data points in the area with high density. That is to say, the computational complexity will be so large by comparing the similarity between query object and all the candidates that the query efficiency would not meet the requirements of large-scale datasets. It has been proved in [10] that the more similar the two objects are, the more times would they be mapped into the same hash bucket with the same scale of LSH operations. Therefore, this paper uses the conflict count sorting strategy proposed in [14], descending the candidates by the count number of the conflicts, to select the first candidates as the neighbors of the query object.

##### 2.3. Binary-KNN Representation

As we all know, the neighbor structure contains strong data class information. It is possible to effectively judge the similarity between the data objects through it [15, 16]. In this section, we propose a binary representation method based on nearest neighbors. It expresses the neighbor structure in binary to represent the data points, which can map the complex high-dimensional dataset to Hamming space. It is obvious that the clustering in Hamming space will considerably decrease the run time of DBSCAN.

The details can be described as follows. For any data object in dataset and its neighborhood (, is the subscript of the th neighbor of which can be found by using the LSH proposed in Section 2.1; we definite its new expression as . If and only if * or *, ; otherwise, its value is . With this, we get the binary-KNN representation of the object which is as and thebinary-KNN representation dataset of thedataset at last.

Different from the original DBSCAN, this paper uses binary LSH rather than linear search to query the -nearest neighbor and can improve the efficiency of the neighbor query. Also, it transforms the neighbor structure information of a data point into binary to represent the data, by which we can operate the clustering in Hamming space. It would be faster in Hamming space to divide the data points into core points, boundary points, and noise points so that the run time of cluster expansion can be decreased sharply.

#### 3. Improved DBSCAN Clustering Algorithm

##### 3.1. Influence Space in Binary Dataset

Density-based clustering is to find out the area where the density exceeds the threshold. In DBSCAN, it uses global parameters and* Minpts* to measure the density, which leads to a lower clustering quality for datasets with various densities. To improve its robustness for various densities, we introduce a core point distinguishing method which is based on the influence space and its solving method in binary-KNN dataset. Due to the local density sensitive feature of influence space, our method can improve the robustness of DBSCAN in datasets with various densities. Also, by applying the core point distinguishing method in binary Hamming space (Section 2.3), the efficiency will be further improved.

For further explanation, we give the following definitions.

*Definition 7 (-neighborhood-point set ). *For , the -neighborhood-point set is consisted of nearest neighbors of , which is expressed as ( is the subscript of the th neighbor of ).

*Definition 8 (core point). *For , if is a core point, then it meets the following equation:

*Definition 9 (border point). *For , if is a border point, then it meets the following equation:

*Definition 10 (noise point). *For , if is a noise point, then it meets the following equation:where is the influence space of , which contains the data points in whose nearest neighbors also include . is the number of points in influence space. is the weight coefficient, and the general value is 2/3. is the number of neighbors.

The influence space was first proposed by Jin et al. [17] for estimating the neighborhood density. Different from DBSCAN which is weak in various density, is very sensitive to the change in density of local area. By using the influence space, it can improve the clustering quality obviously in the dataset with various densities.

Also, to calculate the in binary dataset introduced in Section 2.2, we design a straightforward method as the following equation:where takes the information of the data points whose contains .

is just the intersection of and . Meanwhile, benefiting from the symmetry of influence space, acquiring comes to be simple and fast. It first needs a transposition of , and then is the vector where in . In general, only one step is needed in calculating the influence space. It greatly simplifies the query step and the algorithm efficiency is further improved.

##### 3.2. Representative Objects Selection

To improve the efficiency of the algorithm, on the one hand, we need to improve the efficiency of the neighbor query which has been solved by LSH and binary-KNN representation in Section 2; on the other hand, we can also reduce the frequency of the neighbor query.

In the cluster expansion of DBSCAN, all points in the neighborhood are selected as the seeds for the next region query. However, our core point distinguishing the method proposed in Section 3.1 is based on influence space which contains the data points in one neighborhood whose neighborhood also includes the query object. For an object , it should be certain that there is overlap between ’s and of its neighborhood points. When is a core point, it is true that there are more points in its influence space. Theoretically, the more the points in influence space, the larger the overlapping area. There is even a case where the neighborhood of object is completely covered by the neighborhood of its which is shown in Figure 1. Although the object is a core object, if we choose all points in its neighborhood for the next cluster expansion, it will only increase the frequency of the neighbor query which is not conducive to the algorithm efficiency. Therefore, we need to select part of the data points rather than all neighbors of a core point as seed points for the clustering expansion.