Review Article

Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges

Table 2

Comparison of supervised techniques for spam filtering.

AuthorsAlgorithmDatasetAccuracyAdvantagesLimitations

DeBarr and Wechsler [42]Random forestCustom collection95.2%They got good accuracy with multiple treesThe dataset that they used was not a standard dataset
Rusland et al. [63]Modified Naïve Bayes with selective featuresSpam base and spam data88% on spam base 83% on spam dataSelective features are taken that consume less timeThey got less accuracy, and their model was not much intelligent
Halu zu et al. [67]Bayes Net, SVM, and NBTwitter and Facebook dataset90% using SVMThey used the combined dataset for the training and testing of classifiersMultiple algorithms and a combined dataset system take more training time
Hijawi et al. [41](MLP), Naïve Bayes, random forest, and decision treeSpam assassin99.3% using random forestThey use a list of most common spam features that improve the spam detection rateThey use a significant corpus of 6050 emails, but they use a small number of features extracted from the corpus
Banday and Jan [55]Naïve Bayes, K-nearest neighbor, SVM, and additive regression treeReal-life dataset96.69% using SVMThey make a spam filter based on 8000 real-life spam emailsTheir model is not so effective as spammers continuously change the characteristic that they used for making spam filter
Verma and Sofat [48]ID3 algorithm hidden MarkovEnron dataset89%They use a preclassified dataset that uses less time in processingTheir model got an 11% loss that is not too good for spam filters
Subasi et al. [40]CART, C4.5, REP tree, LAD tree, and NBTUCI dataset95.1%They used 10-fold cross-validation that helps in better evaluationLess number of features used
Zheng et al. [12]SVMWeibo social network data99.5%They use both user content and behavior features for detecting spammersFeature extraction is based on statistical analysis and manual selection
Garavand et al. [72]SVM, deep learning, and particle swarm optimizationStandard datasets from UCI 70% education data93% using the support vector machineThey use deep learning models for feature extractionThe neural networks take massive time for training for the extraction of features
Olatunji et al. [5]ELM and SVM classifierEnron dataset94.06 using SVMThey got a high accuracy level as compared to previous studies on the same datasetFor SVM, it takes more time than ELM to gain the accuracy level claimed in the paper
Jamil et al. [10]SVM, KNN, DT, and LRHealth fitness data92.1 using SVMSmart contract-enabled blockchain technique is used that makes the system more secureInteroperability of proposed model with IoT framework is not evaluated
Arif et al. [11]XGBoost, bagged model, and generalized linear model with stepwise feature selectionSmart home dataset91.8 using generalized linear model with stepwise feature selectionPCA was applied that enhances the accuracy of the systemClimatic and surrounding features of IoT devices are not considered