Research Article

Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources

Table 2

List of available corpora for spam classification.

Dataset nameLanguageType of contentsSizeAvailable at
HamSpam

CSDMC 2010 Spam CorpusEnglishE-mail messages2,9491,378https://github.com/hexgnu/spam_filter/tree/master/data
TREC 2007 Public CorpusEnglish25,22050,199http://plg.uwaterloo.ca/∼gvcormac/treccorpus07/
SpamAssassinEnglish4,1501,897http://spamassassin.apache.org/old/publiccorpus/
Enron e-mailEnglish619,4460http://spamassassin.apache.org/old/publiccorpus/
Bruce Guenter spam collectionEnglish0>3Mhttp://untroubled.org/spam/
Ling spamEnglish2,412481http://csmining.org/index.php/ling-spam-datasets.html

SMS-spam-collection v.1EnglishSMS messages4,827747http://www.dt.fee.unicamp.br/∼tiago/smsspamcollection/
British English SMS corporaEnglish450425https://mtaufiqnzz.wordpress.com/british-English-sms-corpora/

Webspam-UK 2007EnglishWeb pages105,896,555http://chato.cl/webspam/datasets/index.php
Websmap-UK 2011English1,7691,998https://sites.google.com/site/heiderawahsheh/home/web-spam-2011-datasets/uk-2011-web-spam-dataset
DC 2010/EU 2010English, French, and German23Mhttps://dms.sztaki.hu/en/letoltes/ecmlpkdd—2010—discovery—challenge—data—set
Webb spam 2011-0330,000http://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html
ClueWeb 09Multilingual1,040Mhttp://www.lemurproject.org/clueweb09.php/
http://www.lemurproject.org/clueweb12.php/
ClueWeb 12English870M
http://commoncrawl.org/
Common Crawl DataMultilingual09B

YouTube Comments DatasetMultilingualYouTube comments5,950,137481,334http://mlg.ucd.ie/yt/
YouTube Spam Collection DatasetEnglish9511,005https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection

HSpam14.s2-Twitter messages14Mhttp://www3.ntu.edu.sg/home/axsun/datasets.html