Abstract

Locality Sensitive Hashing (LSH) has been proposed as an efficient technique for similarity joins for high dimensional data. The efficiency and approximation rate of LSH depend on the number of generated false positive instances and false negative instances. In many domains, reducing the number of false positives is crucial. Furthermore, in some application scenarios, balancing false positives and false negatives is favored. To address these problems, in this paper we propose Personalized Locality Sensitive Hashing (PLSH), where a new banding scheme is embedded to tailor the number of false positives, false negatives, and the sum of both. PLSH is implemented in parallel using MapReduce framework to deal with similarity joins on large scale data. Experimental studies on real and simulated data verify the efficiency and effectiveness of our proposed PLSH technique, compared with state-of-the-art methods.

1. Introduction

A fundamental problem in data mining is to detect similar items. Finding similar pairs of instances is an essential component in mining numerous types of data, including document clustering [1, 2], plagiarism detection [3], image search [4], and recommender system [5].

Identifying pairs of similar instances is also called similarity joins [6]. Given a set of data instances, a similarity threshold , and a join attribute , the goal of similarity joins is to find all pairs of instances where their similarity on the join attribute is larger than the threshold (i.e., ). There are various similarity measurements, including cosine similarity [7], edit distance [6, 8, 9], hamming distance [7, 10], dimension root similarity [2], and EDU-based similarity for elementary discourse units [11]. In this work we focus on Jaccard similarity, which is proven to be successful for high dimensional, sparse feature sets [6]. Jaccard similarity for two feature vectors and is defined as . As an example, we illustrate naive computation for similarity joins based on Jaccard similarity in Table 1. Suppose there are instances, namely, , and , the join attribute consists of features , the Jaccard similarity of is , is , is , , , and is , is , and the similarities for remaining pairs are all . Given the similarity threshold , it is directly concluded that instances and are similar.

A naive algorithm, which finds similar pairs by computing similarities for all instance pairs, is clearly impracticable on a large collection of instances with high dimensional features. To improve efficiency and scalability of similarity joins, previous research efforts generally fall into two categories. On one hand, parallel algorithms are adopted on clusters of machines. Most of them are implemented using MapReduce framework, including a 3-stage MapReduce approach for end-to-end set-similarity join algorithm [12], fast computation of inner products for large scale news articles [13], and a new ant colony optimization algorithm parallelized using MapReduce [14] to select features in a high dimension space. Others exploit the parallelism and high data throughput of GPU, that is, the LSS algorithm [15]. On the other hand, algorithmic design can be improved to reduce time and storage cost of similarity computation for high dimensional feature space. One type of such approaches uses dimension reduction technologies, including Principle Components Analysis and neural networks [16]. Another type is to hash and filter, so that high dimensional feature space can be replaced by smaller representative signatures. Most popular hashing methods include minhashing [17], minwise hashing [18], and Locality Sensitive Hashing (LSH) [10]. The core idea of hashing is to map similar pairs to similar signatures with several hundred dimensions, each element of which is the result of hashing and hence sheds insights to the solution of high dimensionality. Hashing can also be a means for data clustering because it enables similar features with vast dimensions to be hashed into the same buckets and thus partitions features into groups [19]. Filtering methods, including length filter [7], prefix filter [20], and suffix filter [21], are frequently utilized consequently to eliminate dissimilar pairs while possible similar pairs remain. As a result, fewer similarity computations are needed. In particular, banding technique [22], a specified form of Locality Sensitive Hashing, which maps every band of signatures to an array of buckets so the probability of collision is much higher for instances close to each other, is the most efficient filtering method.

Although previous works have demonstrated the importance and feasibility of hashing and filtering approaches, one critical issue remains underestimated. Hashing and filtering approaches produce approximate results. The similarities of selected pairs are not guaranteed to be larger than the predefined threshold. In the meanwhile, obsoleted pairs are not indeed dissimilar, with similarities less than the predefined threshold. The former case is called false positive, while the latter one is called false negative. An appropriate number of false positives and false negatives are acceptable in many applications. However, the tolerance to false positive and false negative may differ. In most application scenarios such as clustering and information retrieval, a small amount of false positives is emphasized to increase efficiency and precision. In applications such as recommendation and bioinformatics systems [2325], a small number of false negatives are more important.

In this paper, we address the problem of tailoring the number of false positives and false negatives for different applications. To the best of our knowledge, this is the first time in literature to present such detailed analysis. False positives and false negatives are caused by the scheme of pruning candidate pairs whose signatures map into disjoint bucket arrays. Intuitively, similar signatures are likely to have highly analogous bands. And analogous bands will be mapped into identical bucket arrays. Inspired by this intuition, we propose the new banding technique called Personalized Locality Sensitive Hashing (PLSH), in which bands of signatures mapped to at least identical buckets are selected as candidates. We also explore the probability guarantee of the new banding techniques provided for three cases, namely, false negatives, false positives, and the sum of both. According to these probabilities, we propose the upper bounds and lower bounds of false positives and false negatives and accordingly present to personalize the parameters involved in banding and hashing algorithms to fulfill different application demands.

The contributions of this paper are threefold:(i)We improve the traditional banding technique by a new banding technique with flexible threshold to reduce the number of false positives and improve efficiency.(ii)We derive the number and lower/upper bound of false negatives and false positives and balancing between them for our new banding technique.(iii)We implement the new banding technique using parallel framework MapReduce.

The rest of the paper is structured as follows. In Section 2, the backgrounds of minhashing and banding technique are presented. In Section 3, we introduce Personalized Locality Sensitive Hashing (PLSH). The implementation of PLSH using MapReduce is shown in Section 4. In Section 5, we present and analyze the experimental results. We survey the related works in Section 6. Finally, the conclusion is given in Section 7.

2. Background

In this section, we briefly introduce the minhashing algorithm and the consequent banding algorithm, which are the fundamental blocks of Locality Sensitive Hashing (LSH). The intuition of minhashing is to generate low dimensional signatures to represent high dimensional features. The intuition of banding is to filter candidates which are not likely to be similar pairs.

2.1. MinHashing

For large scale data sets, feature space is usually high dimensional and very sparse; that is, only a tiny portion of features appear in a single instance. In order to reduce the memory used to store sparse vector, we use a signature, an integer vector consisting of up to several hundred elements to represent an instance. To generate a signature, we first randomly change the order of features. In other words, the permutation defines a hash function that shuffles the features. Each element of signature is a minhash value [17], which is the position of the first nonzero feature in the permuted feature vector. For example, the original feature vector in Table 1 is ; suppose the permuted feature vector is ; then feature vectors for , , , , and become , and as illustrated in Table 2. Thus the minhash value for , , , , and is 1, 2, 1, 3, and 4, respectively.

We can choose independent permutations . Suppose the minhash value of an instance for a certain permutation is denoted by ; then the signature denoted by is

The approximate similarity between two instances based on their signatures is defined as the percentage of identical values at the same position in the corresponding signatures. For example, given , , and , the approximate Jaccard similarity is .

2.2. Banding

Given a large set of signatures generated in Section 2.1, it is still too costly to compare similarities for all signature pairs. Therefore, a banding technique is presented consequently to filter dissimilar pairs.

The banding technique divides each signature into bands, where each band consists of   elements. For each band of every signature, the banding technique maps the vector of elements to a bucket array.

As shown in Figure 1, the th band of each signature maps to bucket array . Intuitively, if for a pair of signatures, the corresponding bucket arrays have at least one bucket array in common, then the pair is likely to be similar. For example, signature 1 and signature 2 and signature 2 and signature in Figure 1 are similar. Such a pair with common bucket array is considered to be a candidate pair and needs to be verified in the banding technique.

3. Personalized LSH

3.1. New Banding Technique

The candidates generated by LSH are not guaranteed to be similar pairs. Chances are that a pair of signatures are projected to identical bucket arrays even if the Jaccard similarity between the pair of instances is not larger than the given threshold. In the meantime, a pair of instances can be filtered out from candidates since their corresponding signatures are projected into disjoint bucket arrays even if the Jaccard similarity is smaller than the given threshold. The former case is called false positive, while the latter one is called false negative. Massive false positives will lead to inaccurate results, while a large amount of false negatives will deteriorate computational efficiency of LSH. To enhance the algorithm precision and efficiency, we present here a new banding scheme to filter more dissimilar instance pairs. Intuitively, if two instances are highly alike, it is possible that many bands of the two corresponding signatures are mapped to identical buckets. For example, in Figure 1, there are at least 3 bands (i.e., the 1st, the 5th, and the th bands) of signature 1 and signature 2 which map to the same buckets (i.e., in the corresponding bucket array 1, 5, ).

Therefore, we change the banding scheme as follows. For any pair of instances, if the two corresponding signatures do not map into at least identical buckets, it will be filtered out. Otherwise, it is considered to be a candidate pair and the exact Jaccard similarity is computed and verified. For the signatures shown in Figure 1, given , signature and signature and signature 2 and signature are filtered.

3.2. Number of False Positives

A candidate pair is false positive, if and share at least common bucket arrays. Since the efficiency of LSH is mainly dependent on the number of false positives, and most real applications demand a high precision, we first derive the possible number of false positives generated by the new banding technique.

Lemma 1. The upper bound of false positives generated by the new banding technique is equal to the original LSH and the lower bound is approximate to 0.

Proof. According to the law of large numbers, the probability that the minhash values of two feature vectors (e.g., ) are equal under any random permutation , is very close to the frequency percentage of observing identical value in the same position at two long signatures of the corresponding feature vectors. That is,where is the length of signatures and ; is the position in signatures, .
Also, the probability that a random permutation of two feature vectors produces the same minhash value equals the Jaccard similarity of those instances [17]. That is,Based on the above two equations, the probability of two instances with Jaccard similarity is considered to be a candidate pair by the new banding technique denoted by aswhere is the Jaccard similarity of the two instances, is the length of each band, and is the number of bands. We can prove the derivative of is greater than 0, which represents a monotonically increasing function of .
The number of false positive, denoted by , iswhere denotes the total number of similar pairs whose Jaccard similarity is in the instances set. Given an instance set, is a constant. is the given similarity threshold.
The value of depends on the similarity distribution of a given instance set. The upper bound of equals the original LSH . Without the knowledge of the similarity distribution of the data set, the lower bound of false positives cannot be directly derived. Hence, we introduce a threshold to ensurewhere is close to zero with increasing . If is , the lower bound of false positives approximates to 0, which indicates that the candidates generated by the proposed new banding technique are almost all truly similar pairs.
To understand the zero lower bound with , suppose there are two signatures with elements each, bands of which are mapped to the same bucket. At least elements in the two signatures are identical because a band includes elements. According to (2) and (3), the approximate similarity between the two corresponding instances is then greater than . Hence, similarity for each pair of signatures is greater than the threshold and no false positives exist.

The introduction of also enables us to personalize the number of false positives, that is, to vary the range of for different . The range of for a desired is a function of that can be numerically solved. For example, given ; Figure 2 shows the trend of for . The minimum of is achieved when . If the desired , we can find a satisfying range of since and .

3.3. Number of False Negatives

False negatives are truly similar pairs mapped to disjoint bucket arrays. We also derive the upper and lower bound of false negatives generated by the proposed new banding technique.

Lemma 2. The upper bound of false negatives generated by the new banding technique is . The lower bound is close to the original LSH.

Proof. Similar to Section 3.2, the number of false negatives, denoted by , is is a monotonic increasing function of . The lower bound of it is achieved when . The upper bound of is obtained when is the total number of bands. Hence, the upper bound of is proportional to the number of similar instances :

For a desired the number of false negatives, we do a division between and in terms ofwhere is a threshold which is always greater than 1. By deriving the numerical solution for , the range of for a desired is obtained. For example, given the arguments , Figure 3 shows us the trend of . If the desired , from Figure 3, we can find that and , so the satisfying range is .

3.4. Balance False Positives and False Negatives

In some application scenarios, we want to have a balance between false positives and false negatives. Here we analyse a special case where we want a desired aggregated number of false positives and false negatives. We use to denote the sum of false positives and false negatives, which is defined as follows:

The lower bound of is dependent on the similarity distribution of the given data set. However, since in most cases , thus is less than when is appropriately chosen.

Inspired by Sections 3.2 and 3.3, we can also use a threshold to obtain the desired degree of precision. As shown in Figure 4, the ratio of for on a uniformly distributed data set first decreases as the value of increases. The minimum is when . Then the ratio increases as becomes larger. If we are required to have a higher precision of the new banding technique, compared with traditional banding technique, in terms of aggregated number of false negatives and false positives (i.e., small ), then is acceptable.

4. MapReduce Implementation of PLSH

In this section, we first introduce the MapReduce framework. Then we present the details of implementing Personalized LSH with MapReduce, including minhashing, banding, and verification.

4.1. MapReduce

MapReduce [26] is a framework for processing paralleled algorithms on large scale data sets using a cluster of computers. MapReduce allows for distributed processing of data, which is partitioned and stored in a distributed file system (HDFS). Data is stored in the form of pairs to facilitate computation.

As illustrated in Figure 5, the MapReduce data flow consists of two key phases: the map phase and the reduce phase. In the map phase, each computing node works on the local input data and processes the input pairs to a list of intermediate pairs in a different domain. The pairs generated in map phase are hash-partitioned and sorted by the key, and then they are sent across the computing cluster in a shuffle phase. In the reduce phase, pairs with the same key are passed to the same reduce task. User-provided functions are processed in the reduce task on each key to produce the desired output.

In similarity joins, to generate pairs, we first segment the join attributes in each instance to tokens. Each token is denoted by a unique integer id. In the following steps, token id is used to represent each feature.

4.2. MinHashing

We use one map reduce job to implement minhashing. Before the map function is called, pairs are loaded. In the map task, each instance is represented by a set of tokens present in this instance.

In the reduce task, for each instance, the reducer produces a signature with length . As described in Section 2, the minhashing function requires random permutations of features. But it is not feasible to permute massive features explicitly. Instead, a random hash function is adopted to simulate this task. Suppose the total number of features is , integer set , we choose the hash function aswhere and must be relatively prime and is a function that obtains the remainder of divided by . It maps a number to another number with no collision. Hence the result list is a permutation of the original features.

Furthermore, since it requires independent permutations to produce a signature for each instance, we prove that there are more than different permutations.

Lemma 3. Given features, the desired signature length , the hash function , where and , produces more than different permutations.

Proof. Assume a permutation is generated by hash function with parameters and ; then , , , and , . Hence, for a specified , different integers produce different permutations. Euler’s totient function is an arithmetic function that counts the number of totatives of integer , which indicates the number of desired is . Therefore, there are pairs of which produce different permutations. Since , we prove that hash function produces more than different permutations.

4.3. Banding

Banding technique filters dissimilar pairs. As shown in Figure 6, we implement the banding procedure in two MapReduce phases.

In the first phase, the signatures are input to the map function. The map function divides each signature into bands, each band consists of elements, and then each band is mapped to a bucket. The outputs are in form of . In other words, and are combined as a key, and is assigned to the corresponding value. For example, as shown in Figure 6, the signature of instance is (1 2 11 3 4 23 ) and (2 2 13 3 4 23 ) for instance 2. Suppose ; then instance 1 is divided into at least 2 bands (1 2 11) and (3 4 23). The two bands are mapped to bucket in bucket array and bucket in bucket array . So the outputs of map function for instance 1 include and . Analogously, and are a part of map outputs for instance 2.

In reduce task, all instances with the same and are assigned to the same reduce task. An output in the form of is produced for every pair of instances, where the fixed value 1 represents the occurrence frequency for pair . For instance, and are the aforementioned map outputs for instances 1 and 2, since they have the same 1 and 5; the reduce task produces a pair . That is to say, instance 2 and instance 23 are likely to be a candidate pair because their first bands are both mapped to the 5th bucket.

In the second phase, the map task outputs what is produced in the first phrase. To minimize the network traffic between the map and reduce functions, we use a combine function to aggregate the outputs generated by the map function into partial local counts in the computing node. Subsequently, the reduce function computes the total counts for each instance pair. Outputs of the reduce function are in the form of pairs, where is the global frequency for the instance pair. Personalized LSH eliminates those pairs of instances whose is less than the given threshold . As shown in Figure 6, suppose , the count of instance pair is greater than , so is a candidate pair.

4.4. Verification

Candidate pairs generated in Section 4.3 need to be checked in the verification stage. For each candidate instance pair , signature similarity is computed. Because of the massive number of instances, it is not a trivial task.

In order to reduce the storage cost for each reduce task, the set of signatures for all instances is partitioned into small files according to instance ids. In this way, each reduce task holds only two different small partitions, where the first partition is for the first instance in the pair , and the second partition is for the second instance. For example, for each reduce input pair , the first partition contains the signature for while the second partition contains the signature for . All pairs of instances ids contained in the same partitions have to be assigned to the same reduce task. Hence, map task calculates the reduce task id for each pair according to its instances ids and produces an output . In reduce task, signatures similarity for each pair of instances is computed. Finally, reduce task outputs pairs whose similarities are greater than the given threshold . The outputs are in the form of , where is the Jaccard similarity for and . As shown in Figure 7, suppose the given threshold ; similarity between signature 2 and signature 5 is 0.73, which is greater than ; the output is .

5. Experiment Evaluation

In this section, we design a series of experiments to analyze the performance of the proposed PLSH algorithms. We want to study the following research questions.(i)The efficiency of PLSH: for example, is the PLSH fast, compared with other similarity join algorithms? Can it scale to large scale data sets? Will the different values of parameters affect the efficiency of PLSH?(ii)The effectiveness of PLSH: for example, is the PLSH accurate? Can it generate less false positives and false negatives?(iii)The personalization of PLSH: for example, how should we set the parameters of PLSH to generate the desired number of false positives and false negatives? Is the tailored PLSH more appropriate for different applications?

The experiments are conducted on a 6-node cluster. Each node has one processor i7-3820 3.6 GHz with four cores, 32 GB of RAM, and 100 G hard disks. On each node, we install the Ubuntu 12.04, 64-bit, server edition operating system, Java 1.6 with a 64-bit server JVM, and Hadoop 1.0. Apache Hadoop is an open source implementation of MapReduce. We run 2 reduce tasks in parallel on each node.

We use DBLP and CITESEERX dataset and increase the number of instances when needed. The original DBLP dataset has approximately 1.2 M instances while the original CITESEERX has about 1.3 M instances. As shown in Figure 8, when increasing the number of instances of these two data sets, CITESEERX occupies larger storage than DBLP.

In our experiments, we tokenize the data by word. The concatenation of the paper title and the list of authors are the join attributes (i.e., paper title and the list of authors are two attributes in each instance.). The default threshold of Jaccard similarity . The default length of hash signature is 100, and each band has elements.

5.1. Efficiency of PLSH

The ground truth is the truly similar pairs generated by fuzzyjoin [12]. A 3-stage MapReduce approach is implemented for end-to-end set-similarity join algorithm and selects instance pairs with Jaccard similarities greater than the given threshold.

We first compare the efficiency of our method PLSH with fuzzyjoin on the DBLP and CITESEERX data sets. The CPU times of different algorithms are shown in Figure 9. We can conclude from Figure 9 that generally PLSH is faster than fuzzyjoin. When the number of instances is 7.5 M in DBLP and CITESEERX, the time cost of fuzzyjoin is nearly two times of that of PLSH. When the data size increases, the efficiency improvement is more significant. This suggests that PLSH is more scalable to large scale data sets. Fuzzyjoin takes roughly equivalent CPU time on DBLP and CITESEERX with similar size, while PLSH works faster on DBLP than on CITESEERX. This suggests that PLSH is more affected by the similarity distribution in a data set.

Next we analyze the effect of the number of reduce tasks for algorithm efficiency. Because the reduce task number of verification step is different from the previous steps, we record the CPU time in the previous three stages (minhashing, banding1, and banding2, note that banding procedure is completed in two MapReduce phases) and in verification step separately. In the three steps, we vary the reduce tasks number from 2 to 12 with a step size of 2. The data sets are DBLP datasets containing 0.6 M, 1.2 M, 2.5 M, 5 M, and 7.5 M instances, respectively.

From Figure 10 we have the following observations. In general, the time cost of PLSH reduces with more reduce tasks. This verifies our assumption that the parallel mechanism improves the algorithm efficiency. When there are less than 2.5 M instances, CPU time decreases slowly with the increasing reduce tasks number. This is because the start time of MapReduce cannot be shortened even though tasks number is increasing. When the CPU time is more than 2000 seconds, the percentage of start time is just a small part of the total time. Therefore, when the number of instances is relatively large, the improvement is more significant. This suggests that the parallel mechanism of PLSH is more suitable for large scale data sets.

We further study the speedup of reduce task numbers. We set the time cost with 2 reduce tasks as the baseline and plot the ratio between the running time versus the baseline for various reduce task numbers. From Figure 11, we observe that when the size of dataset increases, the speedup is more significant. But none of the curves is a straight line, which suggest that the speedup is not linear with respect to the number of reduce tasks. The limited speedup is due to two main reasons. There are 5 jobs to find the similar instances. The start time of each job will not be shortened. As the number of dataset increases, more data is sent through the network, and more data is merged and reduced; thus the traffic cost cannot be reduced.

The number of reduce tasks is fixed in the verification stage. However, in this stage, parameter in the proposed PLSH has a significant effect on the running time. We analyze the performance of PLSH with various values of . From Figure 12, we have the following observations. In general, as the value of increases, the time cost in verification stage decreases. This suggests that, with a larger , the proposed PLSH can better prune dissimilar pairs and enhance efficiency. When the data set is small, that is, the number of instances is smaller than 2.5 M, the enhancement is not obvious. The underlying reason is that there are fewer candidates in a small data set. Speedup is significant in large scale data set (with more than 5 M instances), which suggests the scalability of PLSH.

5.2. Effectiveness Evaluation

In this subsection, we study the effectiveness of PLSH, and how the length of signature, the defined threshold, and the parameter affect effectiveness.

We first define four types of instances. False positives (FP) are dissimilar pairs (similarity lower than the threshold) that are selected as candidates by PLSH. True positives (TP) are similar pairs (similarity higher than the threshold) that are selected as candidates by PLSH. False negatives (FN) are similar pairs (similarity higher than the threshold) that are pruned by PLSH. True Negatives (TN) are dissimilar pairs (similarity lower than the threshold) that are pruned by PLSH.

The numbers of FP with various signature length and , with similarity threshold , are shown in Figure 13. We have the following observations. For a fixed signature length, the number of false positives significantly decreases with larger . For larger (e.g., ), prolonging the signature results in fewer false positives. For small and shorter signatures (<400 elements), the number of false positives is uncertain. The above three observations indicate that the proposed PLSH can achieve a high precision with larger , longer signatures.

The numbers of FP with various similarity threshold and , with signature length , are shown in Figure 14. We have the following observations. For a fixed similarity threshold, the number of false positives significantly decreases with larger . For a fixed , the number of false positives significantly decreases with larger similarity threshold. This is because, with larger similarity threshold, there are fewer qualified candidates.

The numbers of FN with various signature length and , with similarity threshold , are shown in Figure 15. We have the following observations. For a fixed signature length, the number of false negatives generated by original LSH () is fewer than PLSH. For original LSH, shorter signatures generate more FN. In terms of FN, PLSH is not very sensitive to the length of signature.

The numbers of FN with various similarity threshold and , with signature length , are shown in Figure 16. Although the number of false negatives for PLSH is larger than the original PLSH, we observe that for large similarity threshold the difference is minute. In most applications, we want to search for very similar pairs; thus, the PLSH will still perform good in most scenarios.

Since the false negatives generated by PLSH are in general more than LSH, we further use precision and specificity to evaluate the performance of PLSH. Precision is defined as and specificity is defined as . In this subsection, we want to analyze the effect of , , and in terms of precision and specificity. We vary from 1 to 5 with a step size of 1, from 200 to 500 with a step size 100, and from 0.7 to 0.9 with a step size 0.1.

From Figures 17 to 20, we have the following observations. The PLSH method performs better than LSH in terms of specificity and precision, which demonstrates the potential of PLSH in many data mining and bioinformatics applications. Generally, longer signatures outperform shorter signatures.

5.3. Personalized LSH

As proved in Section 3, an important characteristic of our proposed PLSH is that it is capable of tailoring the number of false positives and false negatives for different applications. In this subsection, we present a pilot study on the tailoring capability of PLSH. We first numerically derive the appropriate for different degree of desired precision. As shown in Table 3, the required precision is measured in terms of the ratio of false positives versus conventional LSH , the ratio of false negatives versus conventional LSH , and the ratio of total errors (the sum of false positives and false negatives) versus the conventional LSH . For example, we can see that if we want to generate less than half of false negatives in LSH, , we should set .

We then use the different settings of to generate collaborator recommendations on DBLP and CITESEERX data sets. We keep the authors who have published more than 25 papers. We use a modified collaborative filtering [25] to generate recommendations. For each paper published by author , PLSH is utilized to find a set of similar publications with similarity ,. We then gather the set of authors who write the similar publications. In recommendation systems, the similar authors are treated as nearest neighbors. Each nearest neighbor is assigned a score, which is the accumulated similarities between the publication of nearest neighbor and the publication of the author . For each author, the top 5 nearest neighbors with largest accumulated similarities are returned as recommended collaborators. We manually annotate the results. The evaluation metric is the precision at top 5 results (). The average results are shown in Table 3. We have the following conclusions. When the number of false positives is reduced, we can generate more precise nearest neighbours; thus the recommendation performance is boosted. When the number of false positives is too small, then due to the sparsity problem, the collaborative filtering framework is not able to generate enough nearest neighbors; thus the recommendation performance is deteriorated. A recommendation system achieves best performance with a fine tuned parameter , which optimizes the trade-off between false positives and false negatives.

There are fruitful research works on the similarity joins problem. Commonly adopted similarity measurements include cosine similarity [7], edit distance [8], Jaccard similarity [12], hamming distance [10], and synonym based similarity [27].

The methodologies used in similarity joins can be categorized into two classes. The first class adopts dimension reduction technologies [16], so that the similarity computation is conducted in a lower dimensional space. The second class utilizes filtering approaches to avoid unnecessary similarity computation, including prefix filter [7], length filter [7], prefix filter [20] and suffix filter [21], and positional filter [8]. To take advantage of the two categories, a series of hashing methods (such as minhashing [17], minwise hashing [18], and Locality Sensitive Hashing (LSH) [10, 19]) combine dimension reduction and filtering approaches by first using a signature scheme [28] to project the original feature vectors to low dimensional signatures and then filter unqualified signatures.

Recently, parallel algorithms are adopted on clusters of machines to improve efficiency of similarity joins for large scale and high dimensional data sets. We have seen an emerging trend of research efforts in this line, including a 3-stage MapReduce approach for end-to-end set-similarity join algorithm [12, 13], fast computation of inner products for large scale news articles [13], a new ant colony optimization algorithm parallelized using MapReduce [14] to select features in a high dimension space, and the LSS algorithm exploiting the high data throughput of GPU [15].

In the era of big data, approximate computation is also a possible solution, if the approximate result with much less computation cost is pretty close to the accurate solution. Approximate computation can be implemented not only through hardware modules [29, 30] but also through approximate algorithms [31, 32]. Unlike exact similarity joins [25], approximate similarity joins and top- similarity joins [33] are better adapted to different application domains [34].

7. Conclusion

The problem of similarity joins based on Jaccard similarity is studied. We propose a new banding technique which reduces the number of false positives, false negatives, and the sum of them. From our experiments, it is shown that our new method is both efficient and effective.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

Chen Lin is partially supported by China Natural Science Foundation under Grant nos. NSFC61102136 and NSFC61472335, Base Research Project of Shenzhen Bureau of Science, Technology, and Information (JCYJ20120618155655087), CCF-Tencent Open Research Fund under Grant no. CCF-Tencent20130101, and Baidu Open Research under Grant no. Z153283.