Mathematical Problems in Engineering

Volume 2016, Article ID 3919043, 12 pages

http://dx.doi.org/10.1155/2016/3919043

## Effective and Fast Near Duplicate Detection via Signature-Based Compression Metrics

^{1}Key Laboratory of Trustworthy Distributed Computing and Service, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing, China^{2}Department of Computer Science and Technology, Tsinghua University, Beijing, China^{3}Institute of Electronic and Information Engineering in Dongguan, UESTC, Dongguan, China

Received 22 May 2016; Accepted 8 September 2016

Academic Editor: Yuqiang Wu

Copyright © 2016 Xi Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Detecting near duplicates on the web is challenging due to its volume and variety. Most of the previous studies require the setting of input parameters, making it difficult for them to achieve robustness across various scenarios without careful tuning. Recently, a universal and parameter-free similarity metric, the normalized compression distance or NCD, has been employed effectively in diverse applications. Nevertheless, there are problems preventing NCD from being applied to medium-to-large datasets as it lacks efficiency and tends to get skewed by large object size. To make this parameter-free method feasible on a large corpus of web documents, we propose a new method called SigNCD which measures NCD based on lightweight signatures instead of full documents, leading to improved efficiency and stability. We derive various lower bounds of NCD and propose pruning policies to further reduce computational complexity. We evaluate SigNCD on both English and Chinese datasets and show an increase in score compared with the original NCD method and a significant reduction in runtime. Comparisons with other competitive methods also demonstrate the superiority of our method. Moreover, no parameter tuning is required in SigNCD, except a similarity threshold.

#### 1. Introduction

With the rapid growing of information in the big data era, near duplicate detection algorithms face a number of challenges. Corresponding to the four V’s of big data, namely,* Volume*,* Velocity*,* Variety*, and* Value*, near duplicates should be detected with scalability, efficiency, robustness, and effectiveness. Although massive research efforts have been devoted, it is still difficult to meet all the requirements. For example, most of the existing algorithms cannot be adapted to the evolving scenarios and heterogeneous data types (e.g., image and video) without human intervention, especially when efficiency is considered. Feasible solutions are still under exploration to meet more and more requirements from a more thorough and extensive point of view.

A quantitative way to define two objects as near duplicates is to use similarity or distance functions, such as methods with Jaccard [1], cosine similarities [2], and Hamming or edit distances [3]. To improve efficiency, a common approach is to extract feature vectors or more lightweight signatures/fingerprints [4] from documents to perform similarity matching. However, it is challenging to choose suitable signatures or fingerprints, as it usually involves a tradeoff between effectiveness and efficiency. To achieve better performance, a set of complicated factors are taken into account, such as the spots and frequencies of occurrence, and delicately designed mapping functions (e.g., hash) are also involved. This process generally requires careful parameter tuning for good performance. However, detection task keeps evolving (which is common on the Internet), making it difficult to adapt to varying scenarios. Therefore, most of these signature-based approaches are task-specific and parameter-sensitive.

Apart from the above parameter-dependent methods, there also exist parameter-free methods due to a special similarity metric called normalized compression distance or NCD [5], which is measured by exploiting the off-the-shelf compressors to estimate the amount of information shared by any two documents. NCD has been proven to be universal and can naturally be applied to a variety of domains such as genomics, languages, music, and images [5–10]. However, most of these methods were only experimented on small datasets for two reasons. First, it is extremely time-consuming to compress each of the documents and each of the pairwise concatenations of documents, leading to a prohibitive time complexity, where is the number of documents. Second, as we verify in the experiments, NCD is prone to be skewed by long documents [11]. Thus, NCD is only effective for short documents. However, web documents can be of a very wide range of lengths, making the performance of NCD unpredictable.

In this paper, to deal with large collections of documents with a very wide range of lengths, we propose a new near duplicate algorithm called SigNCD which combines signature extraction process with normalized compression distance. Specifically, we first propose a punctuation-spot signature extraction method, which is robust and can be applied to different languages. Then we use lightweight signatures (rather than the full documents) as the inputs of NCD, resulting in dramatically reduced complexity and significantly improved stability. To further improve the efficiency of SigNCD, we derive various lower bounds of SigNCD (or NCD) to filter out a large portion of unnecessary comparisons. In contrast to the parameter-laden methods, no parameter is required for SigNCD, making it simple to implement and employ.

Overall, the contributions of this paper are threefold:(i)Due to the drawbacks of both signature-based and compression-based approaches, we propose a novel framework, SigNCD, to enjoy the best of both worlds. Besides, SigNCD is robust and efficient and requires no parameter tuning except a similarity threshold.(ii)Based on the derived tight lower bounds of SigNCD, we propose exact pruning policies for similarity search to significantly reduce the complexity of processing large collections of web documents.(iii)Experimental evaluation over both English and Chinese web document datasets shows that SigNCD outperforms NCD with an improvement in terms of* F*1 score and runtime. Comparison with other competitive signature-based methods also shows that SigNCD produces better results.

#### 2. Preliminaries

##### 2.1. Definition of NCD

We first introduce the Kolmogorov complexity [12] on which the definition of NCD is based. Kolmogorov complexity is a concept in algorithmic information theory, and the Kolmogorov complexity of an object, such as a piece of text, is the length of the shortest computer program (in a predetermined programming language) that produces the object as output. It is a measure of the computational resources needed to specify the object and is also known as descriptive complexity [13]. Consider the following two strings of 48 lowercase letters and digits:(i)“abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabc”(ii)“4c1j5x8rx2y39umgw5q85s7b2p0cv4w1dqoxjausakcpvc”

The first string has a short description, namely, “abc 16 times,” which consists of 12 characters. The second one has no obvious simple description other than writing down the string itself, which has 48 characters. Thus, it can be concluded that the first string is less than the second string in Kolmogorov complexity. Please note that Kolmogorov complexity is uncomputable.

Formally, the Kolmogorov complexity of a finite string is defined as the length of the shortest program to generate on a universal computer. Intuitively, the minimal information distance between and is the length of the shortest program for a universal computer to transform into and into . This measure will be shown to be, up to a logarithmic additive term, equal to the maximum of the conditional Kolmogorov complexities, that is, and . The conditional Kolmogorov complexity of given a finite string is defined as the length of the shortest program that generates when is used as an auxiliary input to the program. The information distance [5] is then developed and defined as . A normalized version of , called normalized information distance or NID, is

It is shown in [5] that NID is a universal similarity metric. Unfortunately, NID is based on Kolmogorov complexity, which is incomputable in the Turing sense. Thus, it is necessary to approximate it by a given compression. The result of approximating the NID using a real compressor is called the normalized compression distance (NCD), formally defined as

Here, denotes the compressed size of the concatenation of and , and and denote the compressed size of and , respectively. NCD is the real-world version of the ideal notion of NID. The idea is that if and share common information they will compress better together than separately. NCD can be explicitly computed between any two strings or files and . In practice, NCD is a nonnegative number , where is caused by the imperfections in compression techniques, but with most standard compression algorithms one is unlikely to see above 0.1. The more similar the two files are, the smaller value of NCD is.

##### 2.2. Properties of NCD

We give necessary properties used in our work. Firstly, we provide axioms determining a large family of compressors involving most of real-world compressors.

*Definition 1. *A compressor is normal if, up to an additive term, with being the maximal binary length of an element, the following hold:(i)Idempotency: , and , where is the empty string(ii)Monotonicity: (iii)Symmetry: (iv)Distributivity:

We omit the illustration material which can be found in [5].

Lemma 2. *If the compressor is normal, then NCD is a normalized admissible distance satisfying the metric (in)equalities as follows:*(i)*Idempotency: *(ii)*Monotonicity: for every ,, with inequality for *(iii)*Symmetry: *(iv)*Triangle inequality: *

*The proof of triangle inequality for NCD can also be found in [5].*

*To obtain NCD, off-the-shelf compressors such as gzip and Snappy can be used. Since the Kolmogorov complexity is not computable, it is impossible to compute how far away the NCD is from the NID. Nevertheless, previous works on various application domains have confirmed the effectiveness of NCD as a universal similarity metric.*

*3. Analysis of Existing Methods*

*In this section, we will analyze two existing methods for duplicate detection. One is SpotSigs, representing a category of methods with complex parameters to tune. The other is NCD, a parameter-free method. We will experimentally show their drawbacks.*

*Table 1 shows the performance of SpotSigs [1] under seven parameter settings. There are totally six parameters to tune (the specific meaning of each parameter can be found in [1]). is the Jaccard similarity threshold within and can provide the best F1, so, in all seven settings, we keep . It can be observed that SpotSigs is sensitive to different settings; for example, the best setting (setting 1) performs better than the worst setting (setting 5) by 51% in terms of F1. SpotSigs therefore is parameter-dependent, and it is challenging to choose the suitable setting from a large parameter space. Moreover, when the tasks evolve, it is difficult to adapt to varying scenarios.*