Table of Contents Author Guidelines Submit a Manuscript
Scientific Programming
Volume 2016, Article ID 6091385, 18 pages
http://dx.doi.org/10.1155/2016/6091385
Research Article

WSF2: A Novel Framework for Filtering Web Spam

Higher Technical School of Computer Engineering, University of Vigo, Polytechnic Building, Campus Universitario As Lagoas s/n, 32004 Ourense, Spain

Received 11 June 2015; Revised 26 October 2015; Accepted 12 November 2015

Academic Editor: Wan Fokkink

Copyright © 2016 J. Fdez-Glez et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linked References

  1. R. Jaslow, “FDA shuts down over 1,600 online pharmacies,” CBSNews, 2013, http://www.cbsnews.com/news/fda-shuts-down-over-1600-online-pharmacies/.
  2. D. McCoy, A. Pitsillidis, G. Jordan et al., “Pharmaleaks: understanding the business of online pharmaceutical affiliate programs,” in Proceedings of the 21st USENIX Conference on Security Symposium (Security '12), p. 1, USENIX Association, 2012.
  3. N. Christin, “Traveling the silk road: a measurement analysis of a large anonymous online marketplace,” in Proceedings of the 22nd International Conference on World Wide Web (WWW '13), pp. 213–223, May 2013. View at Scopus
  4. Y.-M. Wang, M. Ma, Y. Niu, and H. Chen, “Spam double-funnel: connecting web spammers with advertisers,” in Proceedings of the 16th International World Wide Web Conference (WWW '07), pp. 291–300, ACM, May 2007. View at Publisher · View at Google Scholar · View at Scopus
  5. Google, “Fighting Spam—Inside Search—Google,” 2014, http://www.google.com/insidesearch/howsearchworks/fighting-spam.html.
  6. K. P. Karunakaran and S. Kolkur, “Review of web spam detection techniques,” International Journal of Latest Trends in Engineering and Technology, vol. 2, no. 4, pp. 278–282, 2013. View at Google Scholar
  7. N. Spirin and J. Han, “Survey on web spam detection: principles and algorithms,” ACM SIGKDD Explorations Newsletter, vol. 13, no. 2, pp. 50–64, 2011. View at Publisher · View at Google Scholar
  8. M. Erdélyi and A. A. Benczúr, “Temporal analysis for web spam detection: an overview,” in Proceedings of the 1st International Temporal Web Analytics Workshop, pp. 17–24, March 2011.
  9. L. Han and A. Levenberg, “Scalable online incremental learning for web spam detection,” in Recent Advances in Computer Science and Information Engineering: Proceedings of the 2nd World Congress on Computer Science and Information Engineering, vol. 124 of Lecture Notes in Electrical Engineering, pp. 235–241, Springer, Berlin, Germany, 2012. View at Publisher · View at Google Scholar
  10. SpamAssassin, “The Apache SpamAssassin Project,” 2011, http://spamassassin.apache.org/.
  11. N. Pérez-Díaz, D. Ruano-Ordas, F. Fdez-Riverola, and J. R. Méndez, “Wirebrush4SPAM: a novel framework for improving efficiency on spam filtering services,” Software: Practice and Experience, vol. 43, no. 11, pp. 1299–1318, 2013. View at Publisher · View at Google Scholar · View at Scopus
  12. C. Castillo and B. D. Davison, “Adversarial web search,” Foundations and Trends in Information Retrieval, vol. 4, no. 5, pp. 377–486, 2010. View at Publisher · View at Google Scholar · View at Scopus
  13. Z. Gyöngyi and H. Garcia-Molina, “Web spam taxonomy,” in Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), pp. 39–47, Chiba, Japan, May 2005.
  14. S. Ghiam and A. N. Pour, “A survey on web spam detection methods: taxonomy,” International Journal of Network Security & Its Applications, vol. 4, no. 5, pp. 119–134, 2012. View at Publisher · View at Google Scholar
  15. D. Fetterly, M. Manasse, and M. Najork, “Spam, damn spam, and statistics: using statistical analysis to locate spam web pages,” in Proceedings of the 7th International Workshop on the Web and Databases (WebDB '04), pp. 1–6, ACM, Paris, France, June 2004. View at Publisher · View at Google Scholar · View at Scopus
  16. D. Fetterly, M. Manasse, and M. Najork, “Detecting phrase-level duplication on the world wide web,” in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '05), pp. 170–177, ACM, Salvador, Brazil, August 2005. View at Publisher · View at Google Scholar · View at Scopus
  17. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly, “Detecting spam web pages through content analysis,” in Proceedings of the 15th International Conference on World Wide Web (WWW '06), pp. 83–92, ACM, May 2006. View at Publisher · View at Google Scholar · View at Scopus
  18. M. Erdélyi, A. Garzó, and A. A. Benczúr, “Web spam classification: a few features worth more,” in Proceedings of the Joint WICOW/AIRWeb Workshop on Web Quality (WebQuality '11), pp. 27–34, Hyderabad, India, March 2011. View at Publisher · View at Google Scholar
  19. G.-G. Geng, X.-B. Jin, X.-C. Zhang, and D. X. Zhang, “Evaluating web content quality via multi-scale features,” in Proceedings of the ECML/PKDD 2010 Discovery Challenge, Barcelona, Spain, September 2010.
  20. A. Sokolov, T. Urvoy, L. Denoyer, and O. Ricard, “Madspam consortium at the ECML/PKDD discovery challenge 2010,” in Proceedings of the ECML/PKDD Discovery Challenge, Barcelona, Spain, September 2010.
  21. V. Nikulin, “Web-mining with wilcoxon-based feature selection, ensembling and multiple binary classifiers,” in Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD '10), September 2010.
  22. B. Davison, “Recognizing nepotistic links on the web,” in Proceedings of the AAAI Workshop on Artificial Intelligence for Web Search, pp. 23–28, Austin, Tex, USA, July 2000.
  23. G.-G. Geng, C.-H. Wang, Q.-D. Li, L. Xu, and X.-B. Jin, “Boosting the performance of web spam detection with ensemble under-sampling classification,” in Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD '07), pp. 583–587, Haikou, China, August 2007. View at Publisher · View at Google Scholar · View at Scopus
  24. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, “Web spam detection: link-based and content-based techniques,” in Proceedings of the European Integrated Project Dynamically Evolving, Large Scale Information Systems (DELIS '08), pp. 99–113, Barcelona, Spain, February 2008.
  25. R. M. Silva, T. A. Alimeida, and A. Yamakami, “Machine learning methods for spamdexing detection,” International Journal of Information Security Science, vol. 2, no. 3, pp. 1–22, 2013. View at Google Scholar
  26. K. M. Svore, Q. Wu, C. J. C. Burges, and A. Raman, “Improving web spam classification using rank time features,” in Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '07), pp. 9–16, Banff, Canada, 2007.
  27. J. Abernethy, O. Chapelle, and C. Castillo, “Web spam identification through content and hyperlinks,” in Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '08), pp. 41–44, ACM, Beijing, China, April 2008. View at Publisher · View at Google Scholar · View at Scopus
  28. M. Najork, “Web spam detection,” in Encyclopedia of Database Systems, pp. 3520–3523, Springer, 2009. View at Google Scholar
  29. B. Wu and B. D. Davison, “Detecting semantic cloaking on the Web,” in Proceedings of the 15th International Conference on World Wide Web (WWW '06), pp. 819–828, ACM, May 2006. View at Publisher · View at Google Scholar · View at Scopus
  30. K. Chellapilla and D. Chickering, “Improving cloaking detection using search query popularity and monetizability,” in Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '06), pp. 17–24, Seattle, Wash, USA, August 2006.
  31. G.-G. Geng, X.-T. Yang, W. Wang, and C.-J. Meng, “A Taxonomy of hyperlink hiding techniques,” in Web Technologies and Applications, vol. 8709 of Lecture Notes in Computer Science, pp. 165–176, Springer, 2014. View at Publisher · View at Google Scholar
  32. B. Wu and B. D. Davison, “Cloaking and redirection: a preliminary Study,” in Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), pp. 7–16, Chiba, Japan, May 2005.
  33. K. Chellapilla and A. Maykov, “A taxonomy of javascript redirection spam,” in Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '07), vol. AIRWeb, pp. 81–88, Alberta, Canada, May 2007.
  34. D. Ruano-Ordás, J. Fdez-Glez, F. Fdez-Riverola, and J. R. Méndez, “Effective scheduling strategies for boosting performance on rule-based spam filtering frameworks,” Journal of Systems and Software, vol. 86, no. 12, pp. 3151–3161, 2013. View at Publisher · View at Google Scholar · View at Scopus
  35. B. Liu and F. Menczer, “Web crawling,” in Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, pp. 311–362, Springer, 2nd edition, 2011. View at Google Scholar
  36. C. Olston and M. Najork, “Web crawling,” Foundations and Trends in Information Retrieval, vol. 4, no. 3, pp. 175–246, 2010. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at Scopus
  37. V. Shkapenyuk and T. Suel, “Design and implementation of a high-performance distributed web crawler,” in Proceedings of the 18th International Conference on Data Engineering, pp. 357–368, March 2002. View at Scopus
  38. L. Araujo and J. Martínez-Romo, “Web spam detection: new classification features based on qualified link analysis and language models,” IEEE Transactions on Information Forensics and Security, vol. 5, no. 3, pp. 581–590, 2010. View at Publisher · View at Google Scholar · View at Scopus
  39. M. Mahmoudi, A. Yari, and S. Khadivi, “Web spam detection based on discriminative content and link features,” in Proceedings of the 5th International Symposium on Telecommunications (IST '10), pp. 542–546, Tehran, Iran, December 2010. View at Publisher · View at Google Scholar · View at Scopus
  40. C. Castillo, D. Donato, L. Becchetti et al., “A reference collection for web spam,” ACM SIGIR Forum, vol. 40, no. 2, pp. 11–24, 2006. View at Publisher · View at Google Scholar
  41. H. Wahsheh, I. Abu Doush, M. Al-Kabi, I. Alsmadi, and E. Al-Shawakfa, “Using machine learning algorithms to detect content-based arabic web spam,” Journal of Information Assurance and Security, vol. 7, no. 1, pp. 14–24, 2012. View at Google Scholar
  42. S. Webb, J. Caverlee, and C. Pu, “Introducing the Webb spam corpus: using email spam to identify web spam automatically,” in Proceedings of the 3rd Conference on Email and AntiSpam (CEAS '06), 28, p. 27, July 2006. View at Scopus
  43. D. Wang, D. Irani, and C. Pu, “Evolutionary study of web spam: Webb Spam Corpus 2011 versus Webb Spam Corpus 2006,” in Proceedings of the 8th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom '12), pp. 40–49, Pittsburgh, Pa, USA, October 2012.
  44. C. Castillo, K. Chellapilla, and L. Denoyer, “Web spam challenge 2008,” in Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '08), Beijing, China, April 2008.
  45. C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri, “Know your neighbors: web spam detection using the web topology,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '07), pp. 423–430, ACM, Amsterdam, The Netherlands, July 2007. View at Publisher · View at Google Scholar · View at Scopus
  46. T. G. Dietterich, “An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization,” Machine Learning, vol. 40, no. 2, pp. 139–157, 2000. View at Publisher · View at Google Scholar · View at Scopus
  47. Freund Lab Wiki, Adaboost.c, 2014, http://seed.ucsd.edu/mediawiki/index.php/AdaBoost.c.
  48. L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. View at Publisher · View at Google Scholar · View at Scopus
  49. V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam filtering with naive Bayes-which naive bayes?” in Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS '06), Mountain View, Calif, USA, July 2006.
  50. VFML, VFML 2013, http://www.cs.washington.edu/dm/vfml/.
  51. I. Yevseyeva, V. Basto-Fernandes, and J. R. Méndez, “Survey on anti-spam single and multi-objective optimization,” in Proceedings of the 3th Conference on ENTERprise Information Systems (CENTERIS '11), pp. 120–129, Vilamoura, Portugal, October 2011.
  52. V. Basto-Fernandes, I. Yevseyeva, and J. R. Méndez, “Optimization of anti-spam systems with multiobjective evolutionary algorithms,” Information Resources Management Journal, vol. 26, no. 1, pp. 54–67, 2013. View at Publisher · View at Google Scholar · View at Scopus
  53. I. Yevseyeva, V. Basto-Fernandes, D. Ruano-Ordás, and J. R. Méndez, “Optimising anti-spam filters with evolutionary algorithms,” Expert Systems with Applications, vol. 40, no. 10, pp. 4010–4021, 2013. View at Publisher · View at Google Scholar · View at Scopus
  54. J. R. Méndez, M. Reboiro-Jato, F. Díaz, E. Díaz, and F. Fdez-Riverola, “Grindstone4Spam: an optimization toolkit for boosting e-mail classification,” Journal of Systems and Software, vol. 85, no. 12, pp. 2909–2920, 2012. View at Publisher · View at Google Scholar · View at Scopus