Research Article
Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources
Table 2
List of available corpora for spam classification.
| Dataset name | Language | Type of contents | Size | Available at | Ham | Spam |
| CSDMC 2010 Spam Corpus | English | E-mail messages | 2,949 | 1,378 | https://github.com/hexgnu/spam_filter/tree/master/data | TREC 2007 Public Corpus | English | 25,220 | 50,199 | http://plg.uwaterloo.ca/∼gvcormac/treccorpus07/ | SpamAssassin | English | 4,150 | 1,897 | http://spamassassin.apache.org/old/publiccorpus/ | Enron e-mail | English | 619,446 | 0 | http://spamassassin.apache.org/old/publiccorpus/ | Bruce Guenter spam collection | English | 0 | >3M | http://untroubled.org/spam/ | Ling spam | English | 2,412 | 481 | http://csmining.org/index.php/ling-spam-datasets.html |
| SMS-spam-collection v.1 | English | SMS messages | 4,827 | 747 | http://www.dt.fee.unicamp.br/∼tiago/smsspamcollection/ | British English SMS corpora | English | 450 | 425 | https://mtaufiqnzz.wordpress.com/british-English-sms-corpora/ |
| Webspam-UK 2007 | English | Web pages | 105,896,555 | http://chato.cl/webspam/datasets/index.php | — | — | Websmap-UK 2011 | English | 1,769 | 1,998 | https://sites.google.com/site/heiderawahsheh/home/web-spam-2011-datasets/uk-2011-web-spam-dataset | DC 2010/EU 2010 | English, French, and German | 23M | https://dms.sztaki.hu/en/letoltes/ecmlpkdd—2010—discovery—challenge—data—set | — | — | Webb spam 2011 | - | 0 | 330,000 | http://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html | ClueWeb 09 | Multilingual | 1,040M | http://www.lemurproject.org/clueweb09.php/ | — | — | http://www.lemurproject.org/clueweb12.php/ | ClueWeb 12 | English | 870M | — | — | http://commoncrawl.org/ | Common Crawl Data | Multilingual | 0 | 9B |
| YouTube Comments Dataset | Multilingual | YouTube comments | 5,950,137 | 481,334 | http://mlg.ucd.ie/yt/ | YouTube Spam Collection Dataset | English | 951 | 1,005 | https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection |
| HSpam14.s2 | - | Twitter messages | 14M | http://www3.ntu.edu.sg/home/axsun/datasets.html | — | — |
|
|