Research Article

WSF2: A Novel Framework for Filtering Web Spam

Table 2

Available datasets for web spam research.

Corpus nameHostsPagesDomain crawled%SpamAvailable at

WebSpam UK200611,40077.9MUnited Kingdom (.uk)26%http://chato.cl/webspam/datasets/uk2006/
WebSpam UK2007114,529105MUnited Kingdom (.uk)5.30%http://chato.cl/webspam/datasets/uk2007/
WebSpam UK2011n/a3,766United Kingdom (.uk)53%https://sites.google.com/site/heiderawahsheh/home/web-spam-2011-datasets/uk-2011-web-spam-dataset
DC201099,00023MEurope (.eu)3.2%https://dms.sztaki.hu/en/letoltes/ecmlpkdd-2010-discovery-challenge-data-set
Webb Spam Corpus 2006n/a350,000Links found in millions of spam e-mails100%http://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html
Webb Spam Corpus 2011n/a330,000Links found in millions of spam e-mails100%http://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html