Abstract

In high-dimensional spaces, accuracy and similarity search by low computing and storage costs are always difficult research topics, and there is a balance between efficiency and accuracy. In this paper, we propose a new structure Similar-PBF-PHT to represent items of a set with high dimensions and retrieve accurate and similar items. The Similar-PBF-PHT contains three parts: parallel bloom filters (PBFs), parallel hash tables (PHTs), and a bitmatrix. Experiments show that the Similar-PBF-PHT is effective in membership query and K-nearest neighbors (K-NN) search. With accurate querying, the Similar-PBF-PHT owns low hit false positive probability (FPP) and acceptable memory costs. With K-NN querying, the average overall ratio and rank-i ratio of the Hamming distance are accurate and ratios of the Euclidean distance are acceptable. It takes CPU time not I/O times to retrieve accurate and similar items and can deal with different data formats not only numerical values.

1. Introduction

In high-dimensional spaces, exact search methods, such as kd-tree approaches and Q-gram, are only suitable for small size vectors due to huge computation resources. However, similar search algorithms can drastically improve the search speed while maintaining good precision [1], which include VA-files, best-bin-first, space filling curves, K-means (see [2] and references therein), NV tree [3], K-nearest neighbors (K-NN), and locality-sensitive hashing (LSH) [4]. Most K-NN methods adopt the Euclidean distance; they assume all coordinates are numerical and own same units and semantics. But, in some applications, the dimension may be string or category, which makes the Euclidean distance questionable and artificial.

In query tools, a bloom filter [5] (BF), as a space-efficient and constant query delay random data structure, has been applied to present a big set and retrieve memberships broadly [6]. But the BF only can present 1-dimensional elements of a set; references [79] extended it to present high-dimensional sets and dynamic sets. But these methods can only answer the membership query, not the similarity query. In [10, 11], the LSH functions replace the random hash functions of the BF to implement the similarity search, while [10, 11] only can deal with numerical coordinates and return the elements whose distances from the query are at most CR distance in Euclidean spaces, which lead to false negative probability (FNP).

Here, by computing the Hamming distance, we propose a new structure, called Similar-PBF-PHT, based on the BFs and hash tables (HT) to search the membership as well as the K-NN regardless of the radius CR. The Similar-PBF-PHT includes PBFs, PHTs, and a bitmatrix. The PBFs and PHTs apply BFs and HTs to store dimensions, and the bitmatrix stores the dependences of the dimensions. The experiments show that the Similar-PBF-PHT owns better performance in Hamming spaces than other methods. Meanwhile, with K-NN searching, it gets a balance performance and can process different data formats while other LSH-based methods can only deal with numerical value.

There are different kinds of approximate search algorithms, and we divide them into three categories to discuss.

The famous one is space partition method, including IDistance [12] and MedRank [13]. The IDistance [12] clusters all high-dimensional elements into multiply spaces and converts them into 1-dimension space. It costs linear space and supports data insertion and deletion; however, if the data distribute uniformly or dimensions are anisotropic, space partition and center selection will be difficult. The MedRank [13] is a rank aggregation and instance optional algorithm, which aggregates the given dataset into sorted lists, where every element has an entry with a form (Id, key). The number of the lists equals , where is the number of the elements, and by lists probing, the MedRank finds out approximate NN items. The MedRank possesses the best linear-space, but an element insertion or deletion needs update lists and every list requires sorting again.

The LSH and its variants are other famous K-NN search algorithms [14], like Rigorous-LSH [15], E2LSH [4, 16], Adhoc-LSH [17], LSB-tree [18], LSB-forest [18], BLSH [11], and so on [1921]. Let be a set of points in d-dimensional space, The Rigorous-LSH [15] applies C-approximate ball cover, which has radius and centers at the query point , denoted as . If contains at least one point in ( is a constant), it returns a point that is at most CR distance to ; others return nothing. The Rigorous-LSH is theoretically perfect, but the query and space costs are expensive. In Euclidean spaces, E2LSH [4, 16] achieves the Rigorous-LSH through p-stable distribution [22], which reduces CPU and memory costs of the Rigorous-LSH greatly. The Adhoc-LSH [17] modifies the drawbacks of the Rigorous-LSH by a heuristic approach. Let a query be and a magic radius RM; the Adhoc-LSH returns the points within the radius of RM. If the RM equals the distance between the and the exact NN, the Adhoc-LSH works well. If not, an improper RM may lead to FNP. Beyond locality sensitive hashing (BLSH) [11] scheme uses a two-level hashing algorithm to overcome the lower bound of the FNP [19] and finds CR-NN in Euclidean spaces. Different from other LSH methods, the second-level BLSH, parameterized by different center points, is a data-aware scheme. The outer hash table partitions the data sets into buckets of bounded diameter. For each bucket, the BLSH constructs an inner hash table, which applies the minimum enclosing ball of the points in the bucket as a center point. However, the BLSH still has high memory costs and will bring FNP. The LSB-tree and LSB-forest [18] implement the K-NN search by space mapping and Z-order coding. By the LSH functions, the LSB-tree [18] first maps dimensional points to a lower dimensional points . Then the LSB-tree gets the Z-order [23] value of the , which is indexed by a conventional B-tree. Multiply LSB-trees form a LSB-forest, which can update efficiently and satisfy query accuracy but space costs are expensive.

The BFs are introduced into the high-dimensional search, like high-dimensional dynamic BFs (MDDBFs) [7], PBF-BF [8], PBF-HT [8], similarity sets [24], and distance-sensitive BFs (DSBF) [10], and so on [25]. The MDDBFs [7] apply parallel standard BFs (PBFs) to present a d-dimensional dynamic dataset. By searching PBFs, the MDDBFs find out the membership, but the MDDBFs lack a way to verify the dependency of multiple dimensions of an item, which causes high FPPs with membership retrieval. To reduce the FPP, PBF-BF and PBF-HT [8] add another BF and hash table (HT) to the PBFs to store the verification value of the different dimensions. However the methods above based on the BFs can only answer the membership query, not similarity query.

Distance-sensitive BFs (DSBF) [10] replace the uniform hash functions in the BF with the LSH functions to find out similar strings. But the DSBF can only differentiate a query string that differs from all strings in the dataset on a (constant) δ-fraction of bits. The locality-sensitive bloom filter (LSBF) [26] uses two-level BFs to implement the approximate item query. The first-level bloom replaces the random hash functions with locality-sensitive hash function (LSH), which is based on p-stable distribution [22], and maps all items to bit-bloom arrays. To keep the integrity and reduce the FPP, the second level BF stores the hash verification signature formed by the LSH functions in the first-level BF. In order to reduce the FNP, the LSBF needs to probe the neighbor bits in the first-level BF, which leads to cost more query time. Meanwhile since the LSH function concentrates most points around the mean and maps some neighboring points to remote bits, it will bring bigger FPP and FNP.

3. Structures and Working Mechanism

3.1. Structures

A standard BF [5] applies an array of bits (initially all are set to 0) and independent hash functions to represent a set of elements, as shown in Figure 1(a). If an element is mapped into the BF by , the corresponding bit is set to 1. Given a query , by hash functions mapping, the BF answers whether the is a member of with a FPP. In order to support elements deletion, counting bloom filter (CBF) [27, 28] replaces the array of bits with counters.

In this paper, to present high-dimensional elements, parallel BFs (PBFs) and parallel hash tables (PHTs) are proposed to represent the elements with dimensions. At the same time, a bitmatrix is introduced to keep the inherent dependency of the dimensions and reduce the FPP, as shown in Figure 1(b).

3.1.1. PBFs

To store the dimensions, this paper introduced BFs (Figure 1(b)), and every BF owns independent hash functions [29] and a bit array with length, denoted as . Let present the jth dimension and present ith hash function value of the jth dimension. When is mapped into , the corresponding place is set to 1. Since attenuation method can make hash values distribute broadly [8] and reduce the FPP, we apply the attenuation sum of hash values to store the verification value of the dimension .

3.1.2. PHTs

In order to find out which dimensions and how many dimensions of the elements in set are similar to the query , this paper utilizes parallel hash tables (PHTs) and hash links to store identifications (IDs) of the elements. Each hash table, denoted as HT, is indeed a link array with length.

3.1.3. Bitmatrix

Since dimensions are stored into BFs and HTs separately, the integrity of the elements is destroyed, which leads to query confusion. Thus, an auxiliary structure, called bitmatrix, is added to record dimensions hit in the PBFs and PHTs. After dimensions are checked in the PBFs and PHTs, numbers of the hit dimensions are summed up in the bitmatrix; that is, . If , the query is a member of the set with a FPP, as shown in (1). If , no dimension of the query is in the set, for example, (2). If the query is a similar elements with a FPP, as shown in (3).

The Bitmatrix(a)Membership:(b)Outlier:(c)Similarity:

3.2. Working Mechanism
3.2.1. Initialization or Insertion

When dimensions of an element are mapped into BFs and HTs by hash functions, the locations of the bit array in the PBFs are set to 1, and the attenuation hash values are summed up: . By mapping, the corresponding link in the jth HT is found, and a new hash node is added to the tail of the link to store the item’s ID.

3.2.2. Query

Only when the dimension returns 1 in the BF will the attenuated hash values be summed up and located in the corresponding HT. The hit elements’ IDs are found and the corresponding bits in the bitmatrix are set to 1. After all dimensions are mapped, columns in the bitmatrix are summed up. If the summation is in the range between 1 and , the membership or similar elements are obtained.

3.2.3. Element Deletion

Since bit deletion in a BF will bring FPP; the Similar-PBF-PHT only needs to delete the hash node in the corresponding HT.

4. Performance Analysis

Since the BF only has FPP but not FNP [24], we evaluate the performance of the Similar-PBF-PHT by the quality of results, FPP, query time, and space consumption.

4.1. False Positive Probability (FPP)

False Positive. A query is a false positive to object set , if the query gets a positive answer while in fact is not a membership, or there does not exit a neighbor object in Euclidean or Hamming spaces. FPP is the probability of false positives.

4.1.1. FPP of a BF and a HT

Theorem 1. Where elements in the set have been mapped to a BF with bits by different independent hash functions, the FPP of a BF [24] isWhen , the obtains the minimum value or [24].

Theorem 2. A HT possess upper bound FPP, if hashes of the BF follow uniform distribution and the attenuated check value obeys normal distribution.

Proof. In this paper, BFs are used to present the dimensions of elements, and each BF owns independent hash functions [19]. Let each random variable follow the uniform distribution with range , the expected value of , and variance of , and is ranged . According to central limit theorem [1], if is big enough, the random variable satisfies a normal distribution with the expected value of and variance of . Because the sum of attenuated value of hash functions can reduce the FPP, we store the attenuated value of jth dimension of element into the jth HT. It is difficult to estimate the probability density functions of , and according to birthday attack [2], when distributes uniformly, the collision will be minimum. For simplicity, we suppose satisfies normal distribution to estimate the upper bound of false positive.
Let ; due to , the discrete degree of is bigger than , while the collision probability of is smaller than . Let be the FPP of the verification value of the jth attribute of x. There exists . According to central limit theorem [30], satisfy normal distribution, with the expected value of and variance of .
Let there be items and be in the range of . In order to get the upper bound, we compute the maximum FPP of each dimension, that is, :

4.1.2. FPP of Similar-PBF-PHT

Theorem 3. With K-NN searching, the average FPP of dimensions misdetected, is , in which is the FPP of a dimension misdetected in both the BF and HT. When , it is membership search, and the hit FPP is .

Proof. The BF and HT are independent, so the FPP of a dimension is . With similarity querying, if any dimensions collide in the BFs and HTs and other dimensions are members, the collision happens, which satisfies binomial distribution, and all combinations areWith similarity searching, let , and the average FPP isWhen , there is , and the FPP isWhen , it is the membership search, namely, .

If dimensions are misdetected simultaneously, gets the minimum value. If only one dimension is falsely detected, is the maximum value. There is

In Figure 2, let ; with dimensions increasing, the maximum and minimum of the decay exponentially.

4.2. Average Overall Ratio

We evaluate the quality of a K-NN search result by rank-i ratio and average overall ratio (AOR) [18], which are used in most experiments. The rank-i ratio is denoted by and defined aswhere , is the distance of the queried ith neighbor to , and is the distance of the actual ith neighbor to . The overall approximation ratio is the mean of the ratios of all ranks, namely, .

4.3. Complexity
4.3.1. Storage Space

The storage spaces of the Similar-PBF-PHT contain three parts:

(i) When the FPP of a BF is not greater than and the number of hash functions is  optimal, to express the set of elements, the size of the BF array must be  . Then the spaces required by parallel BFs are   (bits)

(ii) A HT needs to store all IDs of the elements and the next node. Let be the  length of the HT (), a node takes up bits, and the HT requires spaces with a range  from to . The spaces range of the HTs is bits.

(iii) The bitmatrix needs bits.

The storage spaces of the Similar-PBF-PHT are (bytes) When , space gets the minimum value. and when , , and are constants, the space complexity of the Similar-PBF-PHT is

4.3.2. Search Time

When querying, the PBFs need times hash calculation, the PHTs require times to search the IDs, in which is the average length of the HT bucket links.

During the membership searching, all hit element's IDs need to be recorded, and the time complexity is

The bitmatrix traverses at most times. So the time complexity is

5. Experiments

5.1. Dataset and Setting

The BF is designed to represent a set, and there is no benchmark. Here we choose 4 datasets used in most experiments; they are Color [13], Mnist [12], Varden [18], and Reuters 21578 [31]. Data formats in the Reuters 21578 are various including digital, character, symbols, and their combinations. We use it to generate 49396-item dataset with 1000 dimensions to test the performance of the Similar-PBF-PHT, including the query latency and the ability of data processing. The experiments run on a computer with 2.5 GHz Intel double Core processors and 8 G RAM.

5.2. Membership Query

In this section, we will discuss performances of different methods in membership query.

Let , , , , and , where is bits of every verification values in the PBF-HT and Similar-PBF-PHT. Figure 3 displays the FPPs of the SBF [5], PBF [7], PBF-HT [8], PBF-BF [8], and Similar-PBF-PHT on Reuters 21578 data. With the increasing, the FPPs decrease, and to a constant , the FPPs will increase with the number of the elements growing, especially the SBF and PBF. When the number of the items exceeds a threshold (), the FPPs of the SBF and PBF are nearly equal to 1, which is consistent in the BF theory. In different , the Similar-PBF-PHT gets the lowest FPP; even when , the biggest FPP is not beyond 0.01, while the FPPs of others are almost 1.

Figure 4 demonstrates memory usages of the PBF, PBF-BF, PBF-HT, and Similar-PBF-PHT on the Reuters 21578 dataset, when , , , , and . According to formula (13), to fit a constant FPP, the memory usage will grow with the number of the items increasing. The hash tables and the bitmatrix reduce the FPP at the cost of memory, and the BF’s bits arrays take up 1/4 spaces of the CBFs in other 3 schemes. All these make the space consumption of the Similar-PBF-PHT just a little higher than the PBF but lower than the PBF-HT.

5.3. K-NN Search

To evaluate the accuracy of the K-NN search, we compare the average overall ratios of the Rigorous-LSH [15], MedRank [13], Adhoc-LSH [17], LSB-tree, and LSB-forest [18] with the Similar-PBF-PHT on the Color and Mnist dataset, as shown in Figure 5. Workload is set to 50, and 1–100 nearest neighbors are searched. In Hamming spaces, the ratios of the Similar-PBF-PHT are almost equal to 1. In Euclidean spaces, the ratios of the Similar-PBF-PHT are not stable; the overall ratios on the Mnist are almost as good as the LSB-forest, but the ratios on Color are a little higher and increase with the number of nearest neighbors. The main reasons are that the dimensions of the Mnist are sparse (most values are 0), and most Hamming distances are 0. While the dimensions of the Color are dense, a small distance (0.0001) in Euclidean spaces will be recognized as 1 in Hamming spaces. All these make the accuracy decrease, but the ratios are still beyond 0.98.

Figure 6 displays average rank-i ratios of the Euclidean and Hamming distance. presents Hamming distance; because of the FPP of the BF, the actual distance is less than the query distance; there exists . In Figure 6(a), with increasing, the rank-i ratios of Hamming distance are stable and not lower than 0.985. On the Mnist (Figure 6(b)), rank-i ratios of Euclidean distances of the Similar-PBF-PHT are minimum, almost equal to 1, while, on the Color (Figure 6(c)), the ratios of Euclidean distance increase slowly and are higher than the LSB-tree and LSB-forest’s; when the , it becomes lower than the MedRank.

Table 1 analyzes the memory consumption (MB) on the Varden dataset, setting bytes, , and . Although the memory costs of the LSB-tree are less, its FPPs are higher than the LSB-forest, so we abandon it. The memory usages of the Similar-PBF-PHT are minimum while the Rigorous-LSH are maximum, and all consumption increases with and . When the memory costs of the LSB-Forest are almost as big as the Adhoc-LSH.

The Similar-PBF-PHT can deal the dimensions with different formats and lengths, and the length of dimension and number of samples will affect the query time. In Figure 7(a), we set to 100, 500, and 1000, respectively, and every dimension contains 20 characters (big enough to most applications) to search 10-NN. With dimensions growing, the average query latency of the Similar-PBF-PHT increases linearly. Let , , and ; Figures 7(b) and 7(c) demonstrate effects of different dimension’s lengths on query delay with 10-NN searching. Average query latencies will increase with the numbers of the characters and dimensions. This is because most of the CPU time is wasted on processing the hash values of the characters.

In Figure 8, we analyze the effects of the parameters α and on the AORs and FPPs of the Similar-PBF-PHT. Let , , , and  -NN and let test workload be 10000. As shown in the Figure 8, under different α, even small α (0.01), and big (0.5), the Similar-PBF-PHT gets good query results and low FPPs. That means the PHT and the bitmatrix can effectively improve the detection accuracy. affects the query accuracy much more than the FPP of the BF. With α increasing, the FPP decreases and the AOR increases; at the same time the space consumption increases.

6. Conclusions

In this paper, we propose a comprehensive structure, called Similar-PBF-PHT, to represent and search member and similar elements of a big dataset in high-dimensional spaces by computing Hamming distance. We analyze its working mechanism, FPP, and space and time complexity in detail. The experiments show that, with membership searching, compared with the PBF, PBF-HT, and PBF-BF, the Similar-PBF-PHT owns lower hit FPP by a low memory cost. The Similar-PBF-PHT costs less storage than the schemes based on the locality sensitive hash, including the Rigorous-LSH, LSB-forest, Adhoc-LSH, and BLSH. With K-NN items querying, it costs CPU time, not I/O times, which make it have less query latency. Meanwhile, the Similar-PBF-PHT computes hash values of all characters in each dimension, so it can deal with different data formats (chars, number, symbol, and so on), and the number of characters will affect the query time. The average overall ratios (query accuracy) and the average rank-i ratios of the Hamming distance are accurate. All these advantages make it appropriate for representing and searching items in high-dimensional spaces, such as database and documents similar search.

Although the Similar-PBF-PHT can get good performance in Hamming spaces, memory costs and the FPP of Euclidean spaces for K-NN searching are still a little higher. In the future, we will study the local sensitive hash functions to replace the random hash functions and further reduce the storage spaces.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This work is supported by the National Nature Science Foundation of China (no. 61562056), Yunnan Province Education Department (nos. 2015Z055 and 2015Y073), and Yunnan Province Nature Science Foundation (nos. KKSY201303125, KKSY201304129, and KKSY201404106).