Research Article  Open Access
A Novel Accuracy and Similarity Search Structure Based on Parallel Bloom Filters
Abstract
In highdimensional spaces, accuracy and similarity search by low computing and storage costs are always difficult research topics, and there is a balance between efficiency and accuracy. In this paper, we propose a new structure SimilarPBFPHT to represent items of a set with high dimensions and retrieve accurate and similar items. The SimilarPBFPHT contains three parts: parallel bloom filters (PBFs), parallel hash tables (PHTs), and a bitmatrix. Experiments show that the SimilarPBFPHT is effective in membership query and Knearest neighbors (KNN) search. With accurate querying, the SimilarPBFPHT owns low hit false positive probability (FPP) and acceptable memory costs. With KNN querying, the average overall ratio and ranki ratio of the Hamming distance are accurate and ratios of the Euclidean distance are acceptable. It takes CPU time not I/O times to retrieve accurate and similar items and can deal with different data formats not only numerical values.
1. Introduction
In highdimensional spaces, exact search methods, such as kdtree approaches and Qgram, are only suitable for small size vectors due to huge computation resources. However, similar search algorithms can drastically improve the search speed while maintaining good precision [1], which include VAfiles, bestbinfirst, space filling curves, Kmeans (see [2] and references therein), NV tree [3], Knearest neighbors (KNN), and localitysensitive hashing (LSH) [4]. Most KNN methods adopt the Euclidean distance; they assume all coordinates are numerical and own same units and semantics. But, in some applications, the dimension may be string or category, which makes the Euclidean distance questionable and artificial.
In query tools, a bloom filter [5] (BF), as a spaceefficient and constant query delay random data structure, has been applied to present a big set and retrieve memberships broadly [6]. But the BF only can present 1dimensional elements of a set; references [7–9] extended it to present highdimensional sets and dynamic sets. But these methods can only answer the membership query, not the similarity query. In [10, 11], the LSH functions replace the random hash functions of the BF to implement the similarity search, while [10, 11] only can deal with numerical coordinates and return the elements whose distances from the query are at most CR distance in Euclidean spaces, which lead to false negative probability (FNP).
Here, by computing the Hamming distance, we propose a new structure, called SimilarPBFPHT, based on the BFs and hash tables (HT) to search the membership as well as the KNN regardless of the radius CR. The SimilarPBFPHT includes PBFs, PHTs, and a bitmatrix. The PBFs and PHTs apply BFs and HTs to store dimensions, and the bitmatrix stores the dependences of the dimensions. The experiments show that the SimilarPBFPHT owns better performance in Hamming spaces than other methods. Meanwhile, with KNN searching, it gets a balance performance and can process different data formats while other LSHbased methods can only deal with numerical value.
2. Related Work
There are different kinds of approximate search algorithms, and we divide them into three categories to discuss.
The famous one is space partition method, including IDistance [12] and MedRank [13]. The IDistance [12] clusters all highdimensional elements into multiply spaces and converts them into 1dimension space. It costs linear space and supports data insertion and deletion; however, if the data distribute uniformly or dimensions are anisotropic, space partition and center selection will be difficult. The MedRank [13] is a rank aggregation and instance optional algorithm, which aggregates the given dataset into sorted lists, where every element has an entry with a form (Id, key). The number of the lists equals , where is the number of the elements, and by lists probing, the MedRank finds out approximate NN items. The MedRank possesses the best linearspace, but an element insertion or deletion needs update lists and every list requires sorting again.
The LSH and its variants are other famous KNN search algorithms [14], like RigorousLSH [15], E2LSH [4, 16], AdhocLSH [17], LSBtree [18], LSBforest [18], BLSH [11], and so on [19–21]. Let be a set of points in ddimensional space, The RigorousLSH [15] applies Capproximate ball cover, which has radius and centers at the query point , denoted as . If contains at least one point in ( is a constant), it returns a point that is at most CR distance to ; others return nothing. The RigorousLSH is theoretically perfect, but the query and space costs are expensive. In Euclidean spaces, E2LSH [4, 16] achieves the RigorousLSH through pstable distribution [22], which reduces CPU and memory costs of the RigorousLSH greatly. The AdhocLSH [17] modifies the drawbacks of the RigorousLSH by a heuristic approach. Let a query be and a magic radius RM; the AdhocLSH returns the points within the radius of RM. If the RM equals the distance between the and the exact NN, the AdhocLSH works well. If not, an improper RM may lead to FNP. Beyond locality sensitive hashing (BLSH) [11] scheme uses a twolevel hashing algorithm to overcome the lower bound of the FNP [19] and finds CRNN in Euclidean spaces. Different from other LSH methods, the secondlevel BLSH, parameterized by different center points, is a dataaware scheme. The outer hash table partitions the data sets into buckets of bounded diameter. For each bucket, the BLSH constructs an inner hash table, which applies the minimum enclosing ball of the points in the bucket as a center point. However, the BLSH still has high memory costs and will bring FNP. The LSBtree and LSBforest [18] implement the KNN search by space mapping and Zorder coding. By the LSH functions, the LSBtree [18] first maps dimensional points to a lower dimensional points . Then the LSBtree gets the Zorder [23] value of the , which is indexed by a conventional Btree. Multiply LSBtrees form a LSBforest, which can update efficiently and satisfy query accuracy but space costs are expensive.
The BFs are introduced into the highdimensional search, like highdimensional dynamic BFs (MDDBFs) [7], PBFBF [8], PBFHT [8], similarity sets [24], and distancesensitive BFs (DSBF) [10], and so on [25]. The MDDBFs [7] apply parallel standard BFs (PBFs) to present a ddimensional dynamic dataset. By searching PBFs, the MDDBFs find out the membership, but the MDDBFs lack a way to verify the dependency of multiple dimensions of an item, which causes high FPPs with membership retrieval. To reduce the FPP, PBFBF and PBFHT [8] add another BF and hash table (HT) to the PBFs to store the verification value of the different dimensions. However the methods above based on the BFs can only answer the membership query, not similarity query.
Distancesensitive BFs (DSBF) [10] replace the uniform hash functions in the BF with the LSH functions to find out similar strings. But the DSBF can only differentiate a query string that differs from all strings in the dataset on a (constant) δfraction of bits. The localitysensitive bloom filter (LSBF) [26] uses twolevel BFs to implement the approximate item query. The firstlevel bloom replaces the random hash functions with localitysensitive hash function (LSH), which is based on pstable distribution [22], and maps all items to bitbloom arrays. To keep the integrity and reduce the FPP, the second level BF stores the hash verification signature formed by the LSH functions in the firstlevel BF. In order to reduce the FNP, the LSBF needs to probe the neighbor bits in the firstlevel BF, which leads to cost more query time. Meanwhile since the LSH function concentrates most points around the mean and maps some neighboring points to remote bits, it will bring bigger FPP and FNP.
3. Structures and Working Mechanism
3.1. Structures
A standard BF [5] applies an array of bits (initially all are set to 0) and independent hash functions to represent a set of elements, as shown in Figure 1(a). If an element is mapped into the BF by , the corresponding bit is set to 1. Given a query , by hash functions mapping, the BF answers whether the is a member of with a FPP. In order to support elements deletion, counting bloom filter (CBF) [27, 28] replaces the array of bits with counters.
(a) Bloom filter structure
(b) SimilarPBFPHT structure
In this paper, to present highdimensional elements, parallel BFs (PBFs) and parallel hash tables (PHTs) are proposed to represent the elements with dimensions. At the same time, a bitmatrix is introduced to keep the inherent dependency of the dimensions and reduce the FPP, as shown in Figure 1(b).
3.1.1. PBFs
To store the dimensions, this paper introduced BFs (Figure 1(b)), and every BF owns independent hash functions [29] and a bit array with length, denoted as . Let present the jth dimension and present ith hash function value of the jth dimension. When is mapped into , the corresponding place is set to 1. Since attenuation method can make hash values distribute broadly [8] and reduce the FPP, we apply the attenuation sum of hash values to store the verification value of the dimension .
3.1.2. PHTs
In order to find out which dimensions and how many dimensions of the elements in set are similar to the query , this paper utilizes parallel hash tables (PHTs) and hash links to store identifications (IDs) of the elements. Each hash table, denoted as HT, is indeed a link array with length.
3.1.3. Bitmatrix
Since dimensions are stored into BFs and HTs separately, the integrity of the elements is destroyed, which leads to query confusion. Thus, an auxiliary structure, called bitmatrix, is added to record dimensions hit in the PBFs and PHTs. After dimensions are checked in the PBFs and PHTs, numbers of the hit dimensions are summed up in the bitmatrix; that is, . If , the query is a member of the set with a FPP, as shown in (1). If , no dimension of the query is in the set, for example, (2). If the query is a similar elements with a FPP, as shown in (3).
The Bitmatrix(a)Membership:(b)Outlier:(c)Similarity:
3.2. Working Mechanism
3.2.1. Initialization or Insertion
When dimensions of an element are mapped into BFs and HTs by hash functions, the locations of the bit array in the PBFs are set to 1, and the attenuation hash values are summed up: . By mapping, the corresponding link in the jth HT is found, and a new hash node is added to the tail of the link to store the item’s ID.
3.2.2. Query
Only when the dimension returns 1 in the BF will the attenuated hash values be summed up and located in the corresponding HT. The hit elements’ IDs are found and the corresponding bits in the bitmatrix are set to 1. After all dimensions are mapped, columns in the bitmatrix are summed up. If the summation is in the range between 1 and , the membership or similar elements are obtained.
3.2.3. Element Deletion
Since bit deletion in a BF will bring FPP; the SimilarPBFPHT only needs to delete the hash node in the corresponding HT.
4. Performance Analysis
Since the BF only has FPP but not FNP [24], we evaluate the performance of the SimilarPBFPHT by the quality of results, FPP, query time, and space consumption.
4.1. False Positive Probability (FPP)
False Positive. A query is a false positive to object set , if the query gets a positive answer while in fact is not a membership, or there does not exit a neighbor object in Euclidean or Hamming spaces. FPP is the probability of false positives.
4.1.1. FPP of a BF and a HT
Theorem 1. Where elements in the set have been mapped to a BF with bits by different independent hash functions, the FPP of a BF [24] isWhen , the obtains the minimum value or [24].
Theorem 2. A HT possess upper bound FPP, if hashes of the BF follow uniform distribution and the attenuated check value obeys normal distribution.
Proof. In this paper, BFs are used to present the dimensions of elements, and each BF owns independent hash functions [19]. Let each random variable follow the uniform distribution with range , the expected value of , and variance of , and is ranged . According to central limit theorem [1], if is big enough, the random variable satisfies a normal distribution with the expected value of and variance of . Because the sum of attenuated value of hash functions can reduce the FPP, we store the attenuated value of jth dimension of element into the jth HT. It is difficult to estimate the probability density functions of , and according to birthday attack [2], when distributes uniformly, the collision will be minimum. For simplicity, we suppose satisfies normal distribution to estimate the upper bound of false positive.
Let ; due to , the discrete degree of is bigger than , while the collision probability of is smaller than . Let be the FPP of the verification value of the jth attribute of x. There exists . According to central limit theorem [30], satisfy normal distribution, with the expected value of and variance of .
Let there be items and be in the range of . In order to get the upper bound, we compute the maximum FPP of each dimension, that is, :
4.1.2. FPP of SimilarPBFPHT
Theorem 3. With KNN searching, the average FPP of dimensions misdetected, is , in which is the FPP of a dimension misdetected in both the BF and HT. When , it is membership search, and the hit FPP is .
Proof. The BF and HT are independent, so the FPP of a dimension is . With similarity querying, if any dimensions collide in the BFs and HTs and other dimensions are members, the collision happens, which satisfies binomial distribution, and all combinations areWith similarity searching, let , and the average FPP isWhen , there is , and the FPP isWhen , it is the membership search, namely, .
If dimensions are misdetected simultaneously, gets the minimum value. If only one dimension is falsely detected, is the maximum value. There is
In Figure 2, let ; with dimensions increasing, the maximum and minimum of the decay exponentially.
4.2. Average Overall Ratio
We evaluate the quality of a KNN search result by ranki ratio and average overall ratio (AOR) [18], which are used in most experiments. The ranki ratio is denoted by and defined aswhere , is the distance of the queried ith neighbor to , and is the distance of the actual ith neighbor to . The overall approximation ratio is the mean of the ratios of all ranks, namely, .
4.3. Complexity
4.3.1. Storage Space
The storage spaces of the SimilarPBFPHT contain three parts:
(i) When the FPP of a BF is not greater than and the number of hash functions is optimal, to express the set of elements, the size of the BF array must be . Then the spaces required by parallel BFs are (bits)
(ii) A HT needs to store all IDs of the elements and the next node. Let be the length of the HT (), a node takes up bits, and the HT requires spaces with a range from to . The spaces range of the HTs is bits.
(iii) The bitmatrix needs bits.
The storage spaces of the SimilarPBFPHT are (bytes) When , space gets the minimum value. and when , , and are constants, the space complexity of the SimilarPBFPHT is
4.3.2. Search Time
When querying, the PBFs need times hash calculation, the PHTs require times to search the IDs, in which is the average length of the HT bucket links.
During the membership searching, all hit element's IDs need to be recorded, and the time complexity is
The bitmatrix traverses at most times. So the time complexity is
5. Experiments
5.1. Dataset and Setting
The BF is designed to represent a set, and there is no benchmark. Here we choose 4 datasets used in most experiments; they are Color [13], Mnist [12], Varden [18], and Reuters 21578 [31]. Data formats in the Reuters 21578 are various including digital, character, symbols, and their combinations. We use it to generate 49396item dataset with 1000 dimensions to test the performance of the SimilarPBFPHT, including the query latency and the ability of data processing. The experiments run on a computer with 2.5 GHz Intel double Core processors and 8 G RAM.
5.2. Membership Query
In this section, we will discuss performances of different methods in membership query.
Let , , , , and , where is bits of every verification values in the PBFHT and SimilarPBFPHT. Figure 3 displays the FPPs of the SBF [5], PBF [7], PBFHT [8], PBFBF [8], and SimilarPBFPHT on Reuters 21578 data. With the increasing, the FPPs decrease, and to a constant , the FPPs will increase with the number of the elements growing, especially the SBF and PBF. When the number of the items exceeds a threshold (), the FPPs of the SBF and PBF are nearly equal to 1, which is consistent in the BF theory. In different , the SimilarPBFPHT gets the lowest FPP; even when , the biggest FPP is not beyond 0.01, while the FPPs of others are almost 1.
(a)
(b)
(c)
(d)
Figure 4 demonstrates memory usages of the PBF, PBFBF, PBFHT, and SimilarPBFPHT on the Reuters 21578 dataset, when , , , , and . According to formula (13), to fit a constant FPP, the memory usage will grow with the number of the items increasing. The hash tables and the bitmatrix reduce the FPP at the cost of memory, and the BF’s bits arrays take up 1/4 spaces of the CBFs in other 3 schemes. All these make the space consumption of the SimilarPBFPHT just a little higher than the PBF but lower than the PBFHT.
5.3. KNN Search
To evaluate the accuracy of the KNN search, we compare the average overall ratios of the RigorousLSH [15], MedRank [13], AdhocLSH [17], LSBtree, and LSBforest [18] with the SimilarPBFPHT on the Color and Mnist dataset, as shown in Figure 5. Workload is set to 50, and 1–100 nearest neighbors are searched. In Hamming spaces, the ratios of the SimilarPBFPHT are almost equal to 1. In Euclidean spaces, the ratios of the SimilarPBFPHT are not stable; the overall ratios on the Mnist are almost as good as the LSBforest, but the ratios on Color are a little higher and increase with the number of nearest neighbors. The main reasons are that the dimensions of the Mnist are sparse (most values are 0), and most Hamming distances are 0. While the dimensions of the Color are dense, a small distance (0.0001) in Euclidean spaces will be recognized as 1 in Hamming spaces. All these make the accuracy decrease, but the ratios are still beyond 0.98.
(a) inearest neighbors
(b) inearest neighbors (Mnist)
(c) inearest neighbors (Color)
Figure 6 displays average ranki ratios of the Euclidean and Hamming distance. presents Hamming distance; because of the FPP of the BF, the actual distance is less than the query distance; there exists . In Figure 6(a), with increasing, the ranki ratios of Hamming distance are stable and not lower than 0.985. On the Mnist (Figure 6(b)), ranki ratios of Euclidean distances of the SimilarPBFPHT are minimum, almost equal to 1, while, on the Color (Figure 6(c)), the ratios of Euclidean distance increase slowly and are higher than the LSBtree and LSBforest’s; when the , it becomes lower than the MedRank.
(a) Ranki
(b) Ranki (Mnist)
(c) Ranki (Color)
Table 1 analyzes the memory consumption (MB) on the Varden dataset, setting bytes, , and . Although the memory costs of the LSBtree are less, its FPPs are higher than the LSBforest, so we abandon it. The memory usages of the SimilarPBFPHT are minimum while the RigorousLSH are maximum, and all consumption increases with and . When the memory costs of the LSBForest are almost as big as the AdhocLSH.
(a) Space versus cardinality n  
 
(b) Space versus dimensionality d ()  

The SimilarPBFPHT can deal the dimensions with different formats and lengths, and the length of dimension and number of samples will affect the query time. In Figure 7(a), we set to 100, 500, and 1000, respectively, and every dimension contains 20 characters (big enough to most applications) to search 10NN. With dimensions growing, the average query latency of the SimilarPBFPHT increases linearly. Let , , and ; Figures 7(b) and 7(c) demonstrate effects of different dimension’s lengths on query delay with 10NN searching. Average query latencies will increase with the numbers of the characters and dimensions. This is because most of the CPU time is wasted on processing the hash values of the characters.
(a)
(b)
(c)
In Figure 8, we analyze the effects of the parameters α and on the AORs and FPPs of the SimilarPBFPHT. Let , , , and NN and let test workload be 10000. As shown in the Figure 8, under different α, even small α (0.01), and big (0.5), the SimilarPBFPHT gets good query results and low FPPs. That means the PHT and the bitmatrix can effectively improve the detection accuracy. affects the query accuracy much more than the FPP of the BF. With α increasing, the FPP decreases and the AOR increases; at the same time the space consumption increases.
(a) inearest neighbors,
(b) inearest neighbors,
(c) inearest neighbors,
(d) inearest neighbors,
6. Conclusions
In this paper, we propose a comprehensive structure, called SimilarPBFPHT, to represent and search member and similar elements of a big dataset in highdimensional spaces by computing Hamming distance. We analyze its working mechanism, FPP, and space and time complexity in detail. The experiments show that, with membership searching, compared with the PBF, PBFHT, and PBFBF, the SimilarPBFPHT owns lower hit FPP by a low memory cost. The SimilarPBFPHT costs less storage than the schemes based on the locality sensitive hash, including the RigorousLSH, LSBforest, AdhocLSH, and BLSH. With KNN items querying, it costs CPU time, not I/O times, which make it have less query latency. Meanwhile, the SimilarPBFPHT computes hash values of all characters in each dimension, so it can deal with different data formats (chars, number, symbol, and so on), and the number of characters will affect the query time. The average overall ratios (query accuracy) and the average ranki ratios of the Hamming distance are accurate. All these advantages make it appropriate for representing and searching items in highdimensional spaces, such as database and documents similar search.
Although the SimilarPBFPHT can get good performance in Hamming spaces, memory costs and the FPP of Euclidean spaces for KNN searching are still a little higher. In the future, we will study the local sensitive hash functions to replace the random hash functions and further reduce the storage spaces.
Competing Interests
The authors declare that they have no competing interests.
Acknowledgments
This work is supported by the National Nature Science Foundation of China (no. 61562056), Yunnan Province Education Department (nos. 2015Z055 and 2015Y073), and Yunnan Province Nature Science Foundation (nos. KKSY201303125, KKSY201304129, and KKSY201404106).
References
 H. Samet, Foundations of Multidimensional and Metric Data Structures, Morgan Kaufmann, Burlington, Mass, USA, 2006.
 H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, 2011. View at: Publisher Site  Google Scholar
 H. Lejsek, F. H. Ásmundsson, B. P. Jónsson, and L. Amsaleg, “NVtree: an efficient diskbased index for approximate search in very large highdimensional collections,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, pp. 869–883, 2009. View at: Publisher Site  Google Scholar
 A. Andoni and P. Indyk, “Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions,” Communications of the ACM, vol. 51, no. 1, pp. 117–122, 2008. View at: Publisher Site  Google Scholar
 H. Burton, “Space/time tradeoffs in hash coding with allowable errors,” Communications of the ACM, vol. 13, no. 7, pp. 422–426, 1970. View at: Publisher Site  Google Scholar
 S. Geravand and M. Ahmadi, “Bloom filter applications in network security: a stateoftheart survey,” Computer Networks, vol. 57, no. 18, pp. 4047–4064, 2013. View at: Publisher Site  Google Scholar
 D. Guo, J. Wu, H. Chen, Y. Yuan, and X. Luo, “The dynamic bloom filters,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 1, pp. 120–133, 2010. View at: Publisher Site  Google Scholar
 B. Xiao and Y. Hua, “Using parallel bloom filters for multiattribute representation on network services,” IEEE Transactions on Parallel and Distributed Systems, vol. 21, no. 1, pp. 20–32, 2010. View at: Publisher Site  Google Scholar
 M.Z. Xiao, Y.F. Dai, and X.M. Li, “Split bloom filter,” Acta Electronica Sinica, vol. 32, no. 2, pp. 241–245, 2004. View at: Google Scholar
 Kirsch and M. Michael, “Distancesensitive bloom filters,” in Proceedings of the 8th Workshop on Algorithm Engineering and Experiments (ALENEX '06), pp. 41–51, Philadelphia, Pa, USA, 2006. View at: Google Scholar
 A. Andoni, P. Indyk, H. L. Nguyen, and I. Razenshteyn, “Beyond localitysensitive hashing,” in Proceedings of the 25th Annual ACMSIAM Symposium on Discrete Algorithms, pp. 1018–1028, SIAM, Portland, Ore, USA, January 2014. View at: Publisher Site  Google Scholar  MathSciNet
 H. V. Jagadish, B. C. Ooi, K.L. Tan, C. Yu, and R. Zhang, “IDistance: an adaptive B+tree based indexing method for nearest neighbor search,” ACM Transactions on Database Systems, vol. 30, no. 2, pp. 364–397, 2005. View at: Publisher Site  Google Scholar
 R. Fagin, R. Kumar, and D. Sivakumar, “Efficient similarity search and classification via rank aggregation,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '03), pp. 301–312, San Diego, Calif, USA, June 2003. View at: Publisher Site  Google Scholar
 J. Wang, H. T. Shen, J. Song et al., “Hashing for similarity search: a survey,” https://arxiv.org/abs/1408.2927. View at: Google Scholar
 P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality,” in Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pp. 604–613, 1998. View at: Google Scholar
 M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Localitysensitive hashing scheme based on pstable distributions,” in Proceedings of the 20th Annual Symposium on Computational Geometry (SCG '04), pp. 253–262, ACM, June 2004. View at: Google Scholar
 A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in Proceedings of the 25th International Conference on Very Large Data Bases (VLDB '99), pp. 518–529, Edinburgh, UK, September 1999. View at: Google Scholar
 Y. Tao, K. Yi, C. Sheng, and P. Kalnis, “Efficient and accurate nearest neighbor and closest pair search in highdimensional space,” ACM Transactions on Database Systems (TODS), vol. 35, no. 3, article 20, 2010. View at: Publisher Site  Google Scholar
 R. O'Donnell, Y. Wu, and Y. Zhou, “Optimal lower bounds for localitysensitive hashing (except when $q$ is tiny),” ACM Transactions of Computation Theory, vol. 6, no. 1, article 5, 2014. View at: Google Scholar  MathSciNet
 D. Gorisse, M. Cord, F. Precioso, and S. PhilippFoliguet, “Fast approximate kernel based similarity search for image retrieval task,” in Proceedings of the IEEE 19th International Conference on Pattern Recognition (ICPR '08), pp. 1873–1876, Tempa, Fla, USA, December 2008. View at: Google Scholar
 D. Gorisse, M. Cord, and F. Precioso, “Localitysensitive hashing for chi2 distance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 2, pp. 402–409, 2012. View at: Publisher Site  Google Scholar
 W. H. Lee, Continuous and discrete properties of stochastic processes [Ph.D. thesis], The University of Nottingham, Nottingham, UK, 2010.
 G. M. Morton, “A computer oriented geodetic data base and a new technique in file sequencing,” Tech. Rep., IBM, Ottawa, Canada, 1966. View at: Google Scholar
 A. Broder and M. Mitzenmacher, “Network applications of Bloom filters: a survey,” Internet Mathematics, vol. 1, no. 4, pp. 485–509, 2004. View at: Publisher Site  Google Scholar  MathSciNet
 S. Tarkoma, E. R. Christian, and L. Eemil, “Theory and practice of bloom filters for distributed systems,” IEEE Communications Surveys and Tutorials, vol. 14, no. 1, pp. 131–155, 2012. View at: Publisher Site  Google Scholar
 Y. Hua, B. Xiao, B. Veeravalli, and D. Feng, “Localitysensitive Bloom filter for approximate membership query,” IEEE Transactions on Computers, vol. 61, no. 6, pp. 817–830, 2012. View at: Publisher Site  Google Scholar  MathSciNet
 K.Y. Whang, B. T. VanderZanden, and H. M. Taylor, “Lineartime probabilistic counting algorithm for database applications,” ACM Transactions on Database Systems, vol. 15, no. 2, pp. 208–229, 1990. View at: Publisher Site  Google Scholar
 L. Fan, P. Cao, J. Almeida, and A. Z. Broder, “Summary cache: a scalable widearea Web cache sharing protocol,” IEEE/ACM Transactions on Networking, vol. 8, no. 3, pp. 281–293, 2000. View at: Publisher Site  Google Scholar
 General Purpose Hash Function Algorithms, http://partow.net/programming/hashfunctions/index.html.
 A. R. Barron, “Entropy and the central limit theorem,” The Annals of Probability, vol. 14, no. 1, pp. 336–342, 1986. View at: Publisher Site  Google Scholar  MathSciNet
 Reuters21578 Text Categorization Collection, http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.
Copyright
Copyright © 2016 Chunyan Shuai et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.