Table of Contents Author Guidelines Submit a Manuscript
Computational Intelligence and Neuroscience
Volume 2015, Article ID 217216, 13 pages
http://dx.doi.org/10.1155/2015/217216
Research Article

MapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data

1School of Information Science and Technology, Xiamen University, Xiamen 361005, China
2Shenzhen Research Institute of Xiamen University, Shenzhen 518058, China

Received 28 September 2014; Revised 24 February 2015; Accepted 2 March 2015

Academic Editor: J. Alfredo Hernandez

Copyright © 2015 Jingjing Wang and Chen Lin. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linked References

  1. C. Carpineto, S. Osiński, G. Romano, and D. Weiss, “A survey of web clustering engines,” ACM Computing Surveys, vol. 41, no. 3, article 17, 2009. View at Publisher · View at Google Scholar · View at Scopus
  2. R. Saraçoğlu, K. Tütüncü, and N. Allahverdi, “A fuzzy clustering approach for finding similar documents using a novel similarity measure,” Expert Systems with Applications, vol. 33, no. 3, pp. 600–605, 2007. View at Publisher · View at Google Scholar · View at Scopus
  3. T. C. Hoad and J. Zobel, “Methods for identifying versioned and plagiarized documents,” Journal of the American Society for Information Science and Technology, vol. 54, no. 3, pp. 203–215, 2003. View at Publisher · View at Google Scholar · View at Scopus
  4. B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scalable image search,” in Proceedings of the IEEE 12th International Conference on Computer Vision (ICCV '09), pp. 2130–2137, IEEE, Kyoto, Japan, October 2009. View at Publisher · View at Google Scholar · View at Scopus
  5. L. Li, L. Zheng, F. Yang, and T. Li, “Modeling and broadening temporal user interest in personalized news recommendation,” Expert Systems with Applications, vol. 41, no. 7, pp. 3168–3177, 2014. View at Publisher · View at Google Scholar · View at Scopus
  6. J. Wang, J. Feng, and G. Li, “Efficient trie-based string similarity joins with edit-distance constraints,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 1219–1230, 2010. View at Google Scholar
  7. R. J. Bayardo, Y. Ma, and R. Srikant, “Scaling up all pairs similarity search,” in Proceedings of the 16th International World Wide Web Conference (WWW '07), pp. 131–140, ACM, May 2007. View at Publisher · View at Google Scholar · View at Scopus
  8. C. Xiao, W. Wang, and X. Lin, “Ed-join: An efficient algorithm for similarity joins with edit distance constraints,” in Proceedings of the VLDB Endowment, vol. 1, pp. 933–944, 2008. View at Scopus
  9. G. Li, D. Deng, J. Wang, and J. Feng, “Pass-join: a partition-based method for similarity joins,” Proceedings of the VLDB Endowment, vol. 5, no. 3, pp. 253–264, 2011. View at Google Scholar
  10. A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in Proceedings of the 25th International Conference on Very Large Data Bases (VLDB '99), pp. 518–529, Edinburgh, Scotland, September 1999.
  11. N. X. Bach, N. L. Minh, and A. Shimazu, “Exploiting discourse information to identify paraphrases,” Expert Systems with Applications, vol. 41, no. 6, pp. 2832–2841, 2014. View at Publisher · View at Google Scholar · View at Scopus
  12. R. Vernica, M. J. Carey, and C. Li, “Efficient parallel set-similarity joins using MapReduce,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '10), pp. 495–506, ACM, June 2010. View at Publisher · View at Google Scholar · View at Scopus
  13. T. Elsayed, J. Lin, and D. W. Oard, “Pairwise document similarity in large collections with mapreduce,” in Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 265–268, Association for Computational Linguistics, Stroudsburg, Pa, USA, June 2008. View at Scopus
  14. M. J. Meena, K. R. Chandran, A. Karthik, and A. V. Samuel, “An enhanced ACO algorithm to select features for text categorization and its parallelization,” Expert Systems with Applications, vol. 39, no. 5, pp. 5861–5871, 2012. View at Publisher · View at Google Scholar · View at Scopus
  15. M. D. Lieberman, J. Sankaranarayanan, and H. Samet, “A fast similarity join algorithm using graphics processing units,” in Proceedings of the IEEE 24th International Conference on Data Engineering (ICDE '08), pp. 1111–1120, Cancun, Mexico, April 2008. View at Publisher · View at Google Scholar · View at Scopus
  16. M. Ture, I. Kurt, and Z. Akturk, “Comparison of dimension reduction methods using patient satisfaction data,” Expert Systems with Applications, vol. 32, no. 2, pp. 422–426, 2007. View at Publisher · View at Google Scholar · View at Scopus
  17. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, “Min-wise independent permutations,” Journal of Computer and System Sciences, vol. 60, no. 3, pp. 630–659, 2000. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  18. P. Li and C. König, “b-Bit minwise hashing,” in Proceedings of the 19th International World Wide Web Conference (WWW '10), pp. 671–680, ACM, April 2010. View at Publisher · View at Google Scholar · View at Scopus
  19. C. Chen, S.-J. Horng, and C.-P. Huang, “Locality sensitive hashing for sampling-based algorithms in association rule mining,” Expert Systems with Applications, vol. 38, no. 10, pp. 12388–12397, 2011. View at Publisher · View at Google Scholar · View at Scopus
  20. S. Chaudhuri, V. Ganti, and R. Kaushik, “A primitive operator for similarity joins in data cleaning,” in Proceedings of the 22nd International Conference on Data Engineering (ICDE '06), p. 5, IEEE, April 2006. View at Publisher · View at Google Scholar · View at Scopus
  21. C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang, “Efficient similarity joins for near-duplicate detection,” ACM Transactions on Database Systems, vol. 36, no. 3, article 15, 2011. View at Publisher · View at Google Scholar · View at Scopus
  22. A. Rajaraman and J. D. Ullman, Mining of Massive Datasets, Cambridge University Press, 2011.
  23. J. P. Lucas, S. Segrera, and M. N. Moreno, “Making use of associative classifiers in order to alleviate typical drawbacks in recommender systems,” Expert Systems with Applications, vol. 39, no. 1, pp. 1273–1283, 2012. View at Publisher · View at Google Scholar · View at Scopus
  24. J. P. Lucas, N. Luz, M. N. Moreno, R. Anacleto, A. A. Figueiredo, and C. Martins, “A hybrid recommendation approach for a tourism system,” Expert Systems with Applications, vol. 40, no. 9, pp. 3532–3550, 2013. View at Publisher · View at Google Scholar · View at Scopus
  25. J. Bobadilla, F. Ortega, A. Hernando, and A. Gutiérrez, “Recommender systems survey,” Knowledge-Based Systems, vol. 46, pp. 109–132, 2013. View at Publisher · View at Google Scholar · View at Scopus
  26. T. White, Hadoop: The Definitive Guide, O'Reilly Media, Sebastopol, Calif, USA, 2012.
  27. A. Arasu, S. Chaudhuri, and R. Kaushik, “Transformation-based framework for record matching,” in Proceedings of the IEEE 24th International Conference on Data Engineering (ICDE '08), pp. 40–49, Cancun, Mexico, April 2008. View at Publisher · View at Google Scholar · View at Scopus
  28. A. Arasu, V. Ganti, and R. Kaushik, “Efficient exact setsimilarity joins,” in Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 918–929, VLDB Endowment, September 2006.
  29. W.-T. J. Chan, A. B. Kahng, S. Kang, R. Kumar, and J. Sartori, “Statistical analysis and modeling for error composition in approximate computation circuits,” in Proceedings of the IEEE 31st International Conference on Computer Design (ICCD '13), pp. 47–53, IEEE, Asheville, NC, USA, October 2013. View at Publisher · View at Google Scholar · View at Scopus
  30. J. Huang, J. Lach, and G. Robins, “A methodology for energy-quality tradeoff using imprecise hardware,” in Proceedings of the 49th Annual Design Automation Conference (DAC '12), pp. 504–509, ACM, New York, NY, USA, June 2012. View at Publisher · View at Google Scholar · View at Scopus
  31. M. W. Mahoney, “Approximate computation and implicit regularization for very large-scale data analysis,” in Proceedings of the 31st Symposium on Principles of Database Systems(PODS '12), pp. 143–154, ACM, New York, NY, USA, 2012.
  32. J. Strassburg and V. Alexandrov, “On scalability behaviour of Monte Carlo sparse approximate inverse for matrix computations,” in Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '13), pp. 6:1–6:8, ACM, New York, NY, USA, November 2013. View at Publisher · View at Google Scholar · View at Scopus
  33. C. Xiao, W. Wang, X. Lin, and H. Shang, “Top-k set similarity joins,” in Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE '09), pp. 916–927, IEEE, April 2009. View at Publisher · View at Google Scholar · View at Scopus
  34. K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin, “An efficient filter for approximate membership checking,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '08), pp. 805–817, ACM, June 2008. View at Publisher · View at Google Scholar · View at Scopus