Table of Contents Author Guidelines Submit a Manuscript
Mathematical Problems in Engineering
Volume 2016, Article ID 3919043, 12 pages
http://dx.doi.org/10.1155/2016/3919043
Research Article

Effective and Fast Near Duplicate Detection via Signature-Based Compression Metrics

1Key Laboratory of Trustworthy Distributed Computing and Service, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing, China
2Department of Computer Science and Technology, Tsinghua University, Beijing, China
3Institute of Electronic and Information Engineering in Dongguan, UESTC, Dongguan, China

Received 22 May 2016; Accepted 8 September 2016

Academic Editor: Yuqiang Wu

Copyright © 2016 Xi Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linked References

  1. M. Theobald, J. Siddharth, and A. Paepcke, “SpotSigs: robust and efficient near duplicate detection in large web collections,” in Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, July 2008.
  2. A. Broder, S. Glassman, M. Manasse, and G. Zweig, “Syntactic clustering of the Web,” Computer Networks and ISDN Systems, vol. 29, no. 8–13, pp. 1157–1166, 1997. View at Google Scholar
  3. G. Manku, A. Jain, and A. Sarma, Eds., Detecting Near-Duplicates for Web Crawling, World Wide Web, 2007.
  4. M. Mitzenmacher, R. Pagh, and N. Pham, “Efficient estimation for high similarities using odd sketches,” in Proceedings of the 23rd International Conference on World Wide Web (WWW '14), pp. 109–118, Florence, Italy, April 2014. View at Publisher · View at Google Scholar · View at Scopus
  5. R. Cilibrasi and P. Vitanyi, “Clustering by compression,” IEEE Transactions on Information Theory, vol. 51, no. 4, pp. 1523–1545, 2005. View at Google Scholar
  6. X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker, “Shared information and program plagiarism detection,” IEEE Transactions on Information Theory, vol. 50, no. 7, pp. 1545–1551, 2004. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  7. D. Cerra and M. Datcu, “A fast compression-based similarity measure with applications to content-based image retrieval,” Journal of Visual Communication and Image Representation, vol. 23, no. 2, pp. 293–302, 2012. View at Publisher · View at Google Scholar · View at Scopus
  8. R. Cilibrasi, P. Vitányi, and R. De Wolf, “Algorithmic clustering of music based on string compression,” Computer Music Journal, vol. 28, no. 4, pp. 49–67, 2004. View at Publisher · View at Google Scholar · View at Scopus
  9. D. Cerra, A. Mallet, L. Gueguen, and M. Datcu, “Algorithmic information theory-based analysis of earth observation images: an assessment,” IEEE Geoscience and Remote Sensing Letters, vol. 7, no. 1, pp. 8–12, 2010. View at Publisher · View at Google Scholar · View at Scopus
  10. M. Li, J. H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang, “An information-based sequence distance and its application to whole mitochondrial genome phylogeny,” Bioinformatics, vol. 17, no. 2, pp. 149–154, 2001. View at Publisher · View at Google Scholar · View at Scopus
  11. M. Cebrian, M. Alfonseca, and A. Ortega, “Common pitfalls using the normalized compression distance: what to watch out for in a compressor,” Communications in Information and Systems, vol. 5, no. 4, pp. 367–383, 2005. View at Publisher · View at Google Scholar · View at MathSciNet
  12. M. Li and P. Vitányi, An Introduction to Kolmogorov Complexity and Its Applications, Springer, New York, NY, USA, 2008. View at Publisher · View at Google Scholar
  13. Kolmogorov Complexity, https://en.wikipedia.org/wiki/Kolmogorov_complexity
  14. Google Snappy Project, http://google.github.io/snappy
  15. J. Cho, H. Garcia-Molina, T. Haveliwala et al., “Stanford WebBase components and applications,” ACM Transactions on Internet Technology, vol. 6, no. 2, pp. 153–186, 2006. View at Publisher · View at Google Scholar · View at Scopus
  16. N. Tran, “The normalized compression distance and image distinguishability,” in Human Vision and Electronic Imaging XII, 64921D, vol. 6492 of Proceedings of SPIE, p. 11, February 2007. View at Publisher · View at Google Scholar
  17. M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” in Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC '02), pp. 380–388, ACM, 2002. View at Publisher · View at Google Scholar
  18. J. Feng and S. Wu, “Detecting near-duplicate documents using sentence level features,” in Proceedings of the International Conference on Database and Expert Systems Applications, 2015.
  19. SpotSigs Source Code, http://adrem.ua.ac.be/~tmartin/
  20. A. Broder, “Identifying and filtering near-duplicate documents,” in Combinatorial Pattern Matching, R. Giancarlo and D. Sankoff, Eds., vol. 1848 of Lecture Notes in Computer Science, pp. 1–10, Springer, Berlin, Germany, 2000. View at Publisher · View at Google Scholar
  21. C. Li, B. Wang, and X. Yang, “VGRAM: improving performance of approximate queries on string collections using variable-length grams,” in Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07), Vienna, Austria, September 2007.
  22. M. Henzinger, “Finding near-duplicate web pages: a large-scale evaluation of algorithms,” in Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '06), pp. 284–291, Seattle, Wash, USA, August 2006. View at Publisher · View at Google Scholar
  23. Y.-S. Lin, T.-Y. Liao, and S.-J. Lee, “Detecting near-duplicate documents using sentence-level features and supervised learning,” Expert Systems with Applications, vol. 40, no. 5, pp. 1467–1476, 2013. View at Publisher · View at Google Scholar · View at Scopus
  24. O. Alonso, D. Fetterly, and M. Manasse, “Duplicate news story detection revisited,” in Proceedings of the Asia Information Retrieval Symposium, pp. 203–214, 2013.
  25. C. Varol and S. Hari, “Detecting near-duplicate text documents with a hybrid approach,” Journal of Information Science, vol. 41, no. 4, pp. 405–414, 2015. View at Publisher · View at Google Scholar · View at Scopus
  26. Q. Zhang, H. Ma, W. Qian, and A. Zhou, “Duplicate detection for identifying social spam in microblogs,” in Proceedings of the IEEE International Congress on Big Data (BigData '13), pp. 141–148, July 2013. View at Publisher · View at Google Scholar · View at Scopus
  27. Y. Bachrach and E. Porat, “Fingerprints for highly similar streams,” Information and Computation, vol. 244, pp. 113–121, 2015. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  28. “Probabilistic near-duplicate detection using simhash,” in Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), S. Sood and D. Loguinov, Eds., pp. 1117–1126, Glasgow, UK, October 2011. View at Publisher · View at Google Scholar · View at Scopus
  29. P. Indyk, “A small approximately min-wise independent family of hash functions,” Journal of Algorithms, vol. 38, no. 1, pp. 84–90, 2001. View at Publisher · View at Google Scholar · View at Scopus
  30. T. Guha and R. K. Ward, “Image similarity using sparse representation and compression distance,” IEEE Transactions on Multimedia, vol. 16, no. 4, pp. 980–987, 2014. View at Publisher · View at Google Scholar · View at Scopus
  31. P. Foster, S. Dixon, and A. Klapuri, “Identifying cover songs using information-theoretic measures of similarity,” IEEE Transactions on Audio, Speech and Language Processing, vol. 23, no. 6, pp. 993–1005, 2015. View at Publisher · View at Google Scholar · View at Scopus
  32. E. Keogh, S. Lonardi, and C. A. Ratanamahatana, “Towards parameter-free data mining,” in Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 206–215, 2004.
  33. T. N. Huy, H. Shao, B. Tong, and E. Suzuki, “A feature-free and parameter-light multi-task clustering framework,” Knowledge and Information Systems, vol. 36, no. 1, pp. 251–276, 2013. View at Publisher · View at Google Scholar · View at Scopus
  34. Z. Xiao and X. Yuan, “B-bit normalized compression distance,” Journal of Computational Information Systems, vol. 8, no. 7, pp. 2701–2707, 2012. View at Google Scholar · View at Scopus
  35. A. R. Cohen and P. M. B. Vitányi, “Normalized compression distance of multisets with applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 8, pp. 1602–1614, 2015. View at Publisher · View at Google Scholar · View at Scopus