Computational Intelligence and Neuroscience

Volume 2016 (2016), Article ID 4075257, 12 pages

http://dx.doi.org/10.1155/2016/4075257

## A Novel Accuracy and Similarity Search Structure Based on Parallel Bloom Filters

^{1}Faculty of Electric Power Engineering, Kunming University of Science and Technology, Kunming 650051, China^{2}Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650051, China^{3}Faculty of Transportation Engineering, Kunming University of Science and Technology, Kunming 650051, China

Received 21 April 2016; Revised 25 September 2016; Accepted 26 October 2016

Academic Editor: Hong Man

Copyright © 2016 Chunyan Shuai et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In high-dimensional spaces, accuracy and similarity search by low computing and storage costs are always difficult research topics, and there is a balance between efficiency and accuracy. In this paper, we propose a new structure Similar-PBF-PHT to represent items of a set with high dimensions and retrieve accurate and similar items. The Similar-PBF-PHT contains three parts: parallel bloom filters (PBFs), parallel hash tables (PHTs), and a bitmatrix. Experiments show that the Similar-PBF-PHT is effective in membership query and* K*-nearest neighbors (*K*-NN) search. With accurate querying, the Similar-PBF-PHT owns low hit false positive probability (FPP) and acceptable memory costs. With* K*-NN querying, the average overall ratio and rank-*i* ratio of the Hamming distance are accurate and ratios of the Euclidean distance are acceptable. It takes CPU time not I/O times to retrieve accurate and similar items and can deal with different data formats not only numerical values.

#### 1. Introduction

In high-dimensional spaces, exact search methods, such as* k*d-tree approaches and* Q*-gram, are only suitable for small size vectors due to huge computation resources. However, similar search algorithms can drastically improve the search speed while maintaining good precision [1], which include VA-files, best-bin-first, space filling curves,* K*-means (see [2] and references therein), NV tree [3],* K*-nearest neighbors (*K*-NN), and locality-sensitive hashing (LSH) [4]. Most* K*-NN methods adopt the Euclidean distance; they assume all coordinates are numerical and own same units and semantics. But, in some applications, the dimension may be string or category, which makes the Euclidean distance questionable and artificial.

In query tools, a bloom filter [5] (BF), as a space-efficient and constant query delay random data structure, has been applied to present a big set and retrieve memberships broadly [6]. But the BF only can present 1-dimensional elements of a set; references [7–9] extended it to present high-dimensional sets and dynamic sets. But these methods can only answer the membership query, not the similarity query. In [10, 11], the LSH functions replace the random hash functions of the BF to implement the similarity search, while [10, 11] only can deal with numerical coordinates and return the elements whose distances from the query are at most CR distance in Euclidean spaces, which lead to false negative probability (FNP).

Here, by computing the Hamming distance, we propose a new structure, called Similar-PBF-PHT, based on the BFs and hash tables (HT) to search the membership as well as the* K*-NN regardless of the radius CR. The Similar-PBF-PHT includes PBFs, PHTs, and a bitmatrix. The PBFs and PHTs apply BFs and HTs to store dimensions, and the bitmatrix stores the dependences of the dimensions. The experiments show that the Similar-PBF-PHT owns better performance in Hamming spaces than other methods. Meanwhile, with* K*-NN searching, it gets a balance performance and can process different data formats while other LSH-based methods can only deal with numerical value.

#### 2. Related Work

There are different kinds of approximate search algorithms, and we divide them into three categories to discuss.

The famous one is space partition method, including IDistance [12] and MedRank [13]. The IDistance [12] clusters all high-dimensional elements into multiply spaces and converts them into 1-dimension space. It costs linear space and supports data insertion and deletion; however, if the data distribute uniformly or dimensions are anisotropic, space partition and center selection will be difficult. The MedRank [13] is a rank aggregation and instance optional algorithm, which aggregates the given dataset into sorted lists, where every element has an entry with a form (Id, key). The number of the lists equals , where is the number of the elements, and by lists probing, the MedRank finds out approximate NN items. The MedRank possesses the best linear-space, but an element insertion or deletion needs update lists and every list requires sorting again.

The LSH and its variants are other famous* K*-NN search algorithms [14], like Rigorous-LSH [15], E2LSH [4, 16], Adhoc-LSH [17], LSB-tree [18], LSB-forest [18], BLSH [11], and so on [19–21]. Let be a set of points in* d*-dimensional space, The Rigorous-LSH [15] applies C-approximate ball cover, which has radius and centers at the query point , denoted as . If contains at least one point in ( is a constant), it returns a point that is at most CR distance to ; others return nothing. The Rigorous-LSH is theoretically perfect, but the query and space costs are expensive. In Euclidean spaces, E2LSH [4, 16] achieves the Rigorous-LSH through* p*-stable distribution [22], which reduces CPU and memory costs of the Rigorous-LSH greatly. The Adhoc-LSH [17] modifies the drawbacks of the Rigorous-LSH by a heuristic approach. Let a query be and a magic radius RM; the Adhoc-LSH returns the points within the radius of RM. If the RM equals the distance between the and the exact NN, the Adhoc-LSH works well. If not, an improper RM may lead to FNP. Beyond locality sensitive hashing (BLSH) [11] scheme uses a two-level hashing algorithm to overcome the lower bound of the FNP [19] and finds CR-NN in Euclidean spaces. Different from other LSH methods, the second-level BLSH, parameterized by different center points, is a data-aware scheme. The outer hash table partitions the data sets into buckets of bounded diameter. For each bucket, the BLSH constructs an inner hash table, which applies the minimum enclosing ball of the points in the bucket as a center point. However, the BLSH still has high memory costs and will bring FNP. The LSB-tree and LSB-forest [18] implement the* K*-NN search by space mapping and* Z*-order coding. By the LSH functions, the LSB-tree [18] first maps dimensional points to a lower dimensional points . Then the LSB-tree gets the* Z*-order [23] value of the , which is indexed by a conventional B-tree. Multiply LSB-trees form a LSB-forest, which can update efficiently and satisfy query accuracy but space costs are expensive.

The BFs are introduced into the high-dimensional search, like high-dimensional dynamic BFs (MDDBFs) [7], PBF-BF [8], PBF-HT [8], similarity sets [24], and distance-sensitive BFs (DSBF) [10], and so on [25]. The MDDBFs [7] apply parallel standard BFs (PBFs) to present a* d*-dimensional dynamic dataset. By searching PBFs, the MDDBFs find out the membership, but the MDDBFs lack a way to verify the dependency of multiple dimensions of an item, which causes high FPPs with membership retrieval. To reduce the FPP, PBF-BF and PBF-HT [8] add another BF and hash table (HT) to the PBFs to store the verification value of the different dimensions. However the methods above based on the BFs can only answer the membership query, not similarity query.

Distance-sensitive BFs (DSBF) [10] replace the uniform hash functions in the BF with the LSH functions to find out similar strings. But the DSBF can only differentiate a query string that differs from all strings in the dataset on a (constant) *δ*-fraction of bits. The locality-sensitive bloom filter (LSBF) [26] uses two-level BFs to implement the approximate item query. The first-level bloom replaces the random hash functions with locality-sensitive hash function (LSH), which is based on* p*-stable distribution [22], and maps all items to bit-bloom arrays. To keep the integrity and reduce the FPP, the second level BF stores the hash verification signature formed by the LSH functions in the first-level BF. In order to reduce the FNP, the LSBF needs to probe the neighbor bits in the first-level BF, which leads to cost more query time. Meanwhile since the LSH function concentrates most points around the mean and maps some neighboring points to remote bits, it will bring bigger FPP and FNP.

#### 3. Structures and Working Mechanism

##### 3.1. Structures

A standard BF [5] applies an array of bits (initially all are set to 0) and independent hash functions to represent a set of elements, as shown in Figure 1(a). If an element is mapped into the BF by , the corresponding bit is set to 1. Given a query , by hash functions mapping, the BF answers whether the is a member of with a FPP. In order to support elements deletion, counting bloom filter (CBF) [27, 28] replaces the array of bits with counters.