Security and Communication Networks

Review Article

Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges

Table 2

Comparison of supervised techniques for spam filtering.


Authors	Algorithm	Dataset	Accuracy	Advantages	Limitations

DeBarr and Wechsler [42]	Random forest	Custom collection	95.2%	They got good accuracy with multiple trees	The dataset that they used was not a standard dataset
Rusland et al. [63]	Modified Naïve Bayes with selective features	Spam base and spam data	88% on spam base 83% on spam data	Selective features are taken that consume less time	They got less accuracy, and their model was not much intelligent
Halu zu et al. [67]	Bayes Net, SVM, and NB	Twitter and Facebook dataset	90% using SVM	They used the combined dataset for the training and testing of classifiers	Multiple algorithms and a combined dataset system take more training time
Hijawi et al. [41]	(MLP), Naïve Bayes, random forest, and decision tree	Spam assassin	99.3% using random forest	They use a list of most common spam features that improve the spam detection rate	They use a significant corpus of 6050 emails, but they use a small number of features extracted from the corpus
Banday and Jan [55]	Naïve Bayes, K-nearest neighbor, SVM, and additive regression tree	Real-life dataset	96.69% using SVM	They make a spam filter based on 8000 real-life spam emails	Their model is not so effective as spammers continuously change the characteristic that they used for making spam filter
Verma and Sofat [48]	ID3 algorithm hidden Markov	Enron dataset	89%	They use a preclassified dataset that uses less time in processing	Their model got an 11% loss that is not too good for spam filters
Subasi et al. [40]	CART, C4.5, REP tree, LAD tree, and NBT	UCI dataset	95.1%	They used 10-fold cross-validation that helps in better evaluation	Less number of features used
Zheng et al. [12]	SVM	Weibo social network data	99.5%	They use both user content and behavior features for detecting spammers	Feature extraction is based on statistical analysis and manual selection
Garavand et al. [72]	SVM, deep learning, and particle swarm optimization	Standard datasets from UCI 70% education data	93% using the support vector machine	They use deep learning models for feature extraction	The neural networks take massive time for training for the extraction of features
Olatunji et al. [5]	ELM and SVM classifier	Enron dataset	94.06 using SVM	They got a high accuracy level as compared to previous studies on the same dataset	For SVM, it takes more time than ELM to gain the accuracy level claimed in the paper
Jamil et al. [10]	SVM, KNN, DT, and LR	Health fitness data	92.1 using SVM	Smart contract-enabled blockchain technique is used that makes the system more secure	Interoperability of proposed model with IoT framework is not evaluated
Arif et al. [11]	XGBoost, bagged model, and generalized linear model with stepwise feature selection	Smart home dataset	91.8 using generalized linear model with stepwise feature selection	PCA was applied that enhances the accuracy of the system	Climatic and surrounding features of IoT devices are not considered