Research Article  Open Access
N. PérezDíaz, D. RuanoOrdás, F. FdezRiverola, J. R. Méndez, "Boosting Accuracy of Classical Machine Learning Antispam Classifiers in Real Scenarios by Applying Rough Set Theory", Scientific Programming, vol. 2016, Article ID 5945192, 10 pages, 2016. https://doi.org/10.1155/2016/5945192
Boosting Accuracy of Classical Machine Learning Antispam Classifiers in Real Scenarios by Applying Rough Set Theory
Abstract
Nowadays, spam deliveries represent a major problem to benefit from the wide range of Internetbased communication forms. Despite the existence of different wellknown intelligent techniques for fighting spam, only some specific implementations of Naïve Bayes algorithm are finally used in real environments for performance reasons. As long as some of these algorithms suffer from a large number of false positive errors, in this work we propose a rough set postprocessing approach able to significantly improve their accuracy. In order to demonstrate the advantages of the proposed method, we carried out a straightforward study based on a publicly available standard corpus (SpamAssassin), which compares the performance of previously successful wellknown antispam classifiers (i.e., Support Vector Machines, AdaBoost, Flexible Bayes, and Naïve Bayes) with and without the application of our developed technique. Results clearly evidence the suitability of our rough set postprocessing approach for increasing the accuracy of previous successful antispam classifiers when working in real scenarios.
1. Introduction and Motivation
Half a century ago, nobody could imagine the immense capabilities of current computing systems and network devices. Nowadays, they have drastically changed the way people share or exchange information and interact or communicate through a full Internet access (24 hours a day) implemented by last generation devices. Actually, most of the Internet consumers use the smartphone (67.5%) or tablet (42.3%) to access their email accounts [1, 2].
As long as email can be read everywhere at any time, spammers found this service particularly appropriate for delivering spam content. On the one hand, the usage of email service has experienced an explosive growth achieving an average of 538.1 million messages sent daily during 2015, which represents an interannual increase of 5% since 2010 [3]. On the other hand, the percentage of spam emails suffered a slight reduction, representing an interannual decrease of 3.4% since 2010 [4]. Taking this situation into account, it is easy to realize that spam deliveries remain a problem to be solved in the modern society. To cope with this situation, the software industry (headed by Internet security enterprises) has been continuously improving existing antispam filtering techniques and systems in order to enhance both filtering throughput [5–7] and classification accuracy.
Regarding classification accuracy, during the last decade, different research works have introduced the definition of several antispam domain authentication schemes (e.g., SPF [8] and RBL/RWL [9]), the description of novel collaborative approaches (e.g., DCC [10]), and the usage of diverse machine learning (ML) alternatives. In this connection, previous successful techniques such as Artificial Immune Systems (AIS) [11, 12], CaseBased Reasoning (CBR) systems [13, 14], different topologies of artificial neural networks (ANN) [15, 16], some simple but effective algorithms like NN [17, 18], Support Vector Machines (SVM) [19, 20], and different implementations of the wellknown Naïve Bayes (NB) algorithm should be mentioned [21–23].
However, despite the large number of ML classifiers that have proven to be useful to fight against spam, only NB has been typically included by default in popular antispam filtering products such as SpamAssassin [24] and Wirebrush4SPAM [5], due essentially to its adequate balance between the accuracy obtained and the associated computational cost [21, 22].
This is particularly true because in the antispam filtering domain the number of false positive (FP) errors made by the classifier while processing legitimate contents is of utmost importance [25]. This aspect still represents a major challenge for current techniques commonly applied in the area, especially when working in real and dynamic environments characterized by (i) the subjective nature of the spam concept, (ii) the adverse effects of concept drift, and (iii) the coexistence of multiple languages in individual mailboxes. To cope with this situation, Google (considered as one of the most valuable brands in the world [26]) decided to equip Gmail with a userguided learning mechanism. As described in [27], this technology makes use of an ANN that takes into account the Gmail user classification criteria as feedback information for the neural network. In this context, it is obvious that the accuracy of this approach is directly proportional to the number of Gmail users. As a result, the large number of Gmail active accounts (more than 900 million in 2015 [28]) allows Gmail antispam filtering system to achieve a classification accuracy up to 99%. To this end, it is easy to realize that, due to its dependence on the number of users (to achieve suitable classification results), it can only be applied on email services with a large number of active users.
As a direct consequence of the underlying operation mode, this strategy cannot be extrapolated to those email services belonging to SMEs (Small and Medium Enterprises), since the number of email users tends to be insufficient to achieve accurate classification rates. This situation has motivated SMEs to continue using typical antispam filtering frameworks such as SpamAssassin or Wirebrush4SPAM.
In such a situation, the continuous development and deployment of both exiting and novel antispam techniques over classical filtering frameworks continue to be a necessity for the SME environment. Specifically, we consider the reduction of type I (false positive) errors extremely important. To this end, in this work, we propose the use of rough sets (RS) theory due to its ability to deal with uncertainty and avoid type I errors [29].
In detail, RS theory was initially proposed by Pawlak in the 80s [30, 31], providing a formal methodology for the automatic transformation of data into knowledge [32]. The philosophy of this method is based in the supposition that any inexact concept (e.g., denoted by a class label) can be approximated superiorly and inferiorly using an indiscernibility relationship. As detailed in [33], one of the most important characteristics of RS theory is the ability to discover redundancy and dependencies between features.
Additionally, RS could provide interesting benefits to the correct classification of emails as they guarantee (i) effectiveness in discovering hidden patterns from data, (ii) the possibility of using both quantitative and qualitative information, (iii) capability to evaluate the significance of data, (iv) finding the minimal set of useful data that minimizes the overall classification complexity, (v) the automatic generation of a decision ruleset from scratch, and (vi) the identification of previously unknown relationships. All of these inherent features, together with some positive results achieved in previous works [29], suggested to us the possibility of creating a RS postprocessing algorithm applicable to any ML classifier working as a standalone antispam filter. In this line, the present work introduces the proposal of a postprocessing algorithm and shows the viability of the idea from an experimental point of view.
While this section has introduced and motivated our proposal, the rest of the paper is organized as follows: Section 2 summarizes previous related approaches that also make use of RS theory in the antispam filtering domain. Section 3 details the developed algorithm that applies RS theory to extract domain specific decision rules from data, which will later guide the final revision of the initial proposed classification. Section 4 provides a clear description of the experimental protocol and documents the benchmark results obtained from the executed experiments. Finally, Section 5 provides conclusions and identifies future research work.
2. Related Work: Applying RS to Antispam Filtering
As previously stated, and mainly motivated by the massive proliferation of spamming activities, many researchers have studied the effectiveness of different approaches applied to the detection of illegitimate emails and other forms of spam [5, 8–25]. In this context, although several ML alternatives have been successfully used to categorize different email corpora, recent studies have demonstrated the suitability of applying RS to specifically characterize messages comprising disjoint concepts (such as spam) [29].
In this line, PérezDíaz et al. [29] proposed three different execution schemes for using specific rules generated by applying RS theory. They compared these approaches against other wellknown successful antispam techniques and reported a considerable reduction in the number of FP errors. Complementarily, Glymin and Ziarko [34] conducted a study to evaluate the use of variable precision RS (VPRS) [35] in the antispam filtering domain. In this work, a set of private Hotmail messages were collected during two years and VPRS were used to establish a decision table for classifying emails into two possible categories (i.e., spam or legitimate).
From a different perspective, some research studies focused their efforts on maintaining those rules generated through the use of RS [36–38]. These works proposed different frameworks to share generated rules from servers with the final goal of giving adequate support to a collaborative community interested in spam filtering. In the work of Chiu et al. [36], both the rule updating procedure and the policy for deleting obsolete rules are centralised in collaborative servers with the goal of immediately sharing available changes with the community. Additionally, the work of Lai et al. [37] introduces the generation of rules by means of RS, genetic algorithms, and reinforcement learning. Finally, the study carried out by Lai et al. [38] proposed novel methods to generate rules and validate their precision.
From another point of view, the work of Yang [39] proposed a framework (called RCFG) that combines RS and ant colony for applying an initial filtering to available data. Afterwards, the proposed approach uses a genetic algorithm to carry out feature selection. Finally, different classifiers (i.e., SVM, NN, ANN, and NB) are used to identify spam emails.
Furthermore, there are also available several works that make use of RS to support threeway classification schemes. This type of alternative involves the definition of a third category (i.e., “suspicious”) to include those messages that cannot be easily classified as spam or legitimate. Following this approach, Zhao and Zhu [40] made use of the forward selection method [41] to generate a training corpus formed by eleven attributes and demonstrated the superiority of their VPRSbased algorithm when compared with Naïve Bayes. In the same line, the authors of [42, 43] initially reduced data attributes (also making use of the forward selection method), applying genetic algorithms for calculating RS reducts.
Complementarily, several researchers concentrated their efforts in applying the decision theoretic RS (DTRS) model to threeway classification [44, 45]. In DTRS, the two thresholds that differentiate spam (i.e., ham and suspicious) are initially calculated by using Bayesian theory in an automated way. Afterwards, classification with DTRS is made by means of a set of loss functions, which obtains the best classification with the minimal risk. In [44], a threeway decision model based on DTRS was compared with Naïve Bayes to evidence a reduction in error rates. Zhao et al. [45] proposed a novel approach based on positiveregion of DTRS and compared achieved results with Naïve Bayes and other models based on RS.
Finally, Jia and colleagues [46, 47] enumerated the many benefits of threeway decision approaches and introduced a further challenge of discovering what to do with suspicious emails and how they can be examined in detail.
3. Using RS to Extract and Apply Domain Specific Decision Rules for Improving Accuracy
As can be seen from the last section, during last years a wide variety of contributions showing the applicability of RS [30–33] to the antispam filtering domain were presented. However, to the best of our knowledge, there is not a valid approach able to combine the fast execution speed of some successful ML classifiers with the good accuracy achieved by RS alternatives.
Therefore, in this work, we propose an innovative way to review the final output given by standard classifiers (in the form of a postprocessing algorithm) with the goal of reducing the number of type I (FP) errors. In this line, the generation of our complementary RS decision rules is carried out by using the same data (email corpus) as in the case of the classifier (see Figure 1) but being applied only when a new incoming email is initially classified as spam. By following this straightforward approach, our method becomes potentially applicable to any classifier.
As showed in Figure 1, the whole filtering process involves an initial feature extraction phase used to gather the specific values needed for representing a new incoming email as an adequate input for the selected classifier. After that, the classification model guesses the class of the message generating an initial output. In the case that the message was categorized as spam, it is further revised by our automatically generated RS decision rules before reaching a final classification. These revision rules are generated by our knowledge acquisition and representation module (showed in the right part of Figure 1), which is structured into two different stages: (i) feature selection and (ii) computation of RS rules.
In order to carry out the initial feature selection stage, a dense dataset should be generated from those messages that comprise the email corpus. To do this, each column included in the dataset (condition attribute) , represents the existence or absence of a given token (i.e., the smallest portion of text enclosed by two characters included in [[:blank:]] class) in the email corpus. Therefore, the number of condition attributes of the newly generated dataset, , is equal to the number of different tokens included in any message belonging to the email corpus. Moreover, the real (known) class of each message (decision attribute) is also included as the last column of the dataset, being represented using a binary variable. In this context, the set of instances stored in the dataset is denominated universe, , and its cardinality is equal to the number of messages finally represented, .
During the feature selection stage, we perform a reduction of the dimensionality of the condition attributes that are part of the initial input dataset, represented by . To this end, we apply two complementary procedures: (i) stop word removal and (ii) feature ranking. The first one comprises the elimination of those tokens having less than 3 characters and/or being included in the stop word list provided by BaezaYates and RibeiroNeto [48]. Then, we take advantage of Information Gain (IG) [49–51] to evaluate the suitability of each attribute included in the dataset. From all the available columns, we select the best 100 ranked attributes included in the dataset and discard the rest of the information [29]. Table 1 introduces an example of the result achieved after the execution of the feature selection stage, showing only 8 token attributes () and 8 emails () due to the lack of space. Additionally, we maintain the decision attributes () corresponding to the real (known) classes in the dataset (represented in the 9th column).

From the information stored in the dense dataset represented in Table 1, and applying RS theory, we designed a deterministic approach to generating a set of accurate revision rules [52], which will be later applied to the standard workflow represented in Figure 1. In this context, rule establishes a specific combination of values for some condition attributes (i.e., ) that determine a solution for a certain decision attribute (.decision = solution). The proposed algorithm able to carry out the rule extraction process is introduced in Algorithm 1. For representation purposes, a value of in a condition attribute, , means that this feature should not be taken into consideration.

As showed in Algorithm 1, for each email stored in the dataset, , a new rule is generated through the computation of the shortest reduct (computeShortestReduct function) for a given concept (), which is defined as 1 for the same email, ? for messages of the same class, and 0 for other instances (lines (08)–(12) in Algorithm 1). In this context, a reduct is a minimal (irreducible) subset of features, , having the same precision to guess a concept () from the whole set of condition attributes in . In order to assess the potential for classification of a set of condition attributes, , all the instances, , should be grouped into different subsets, where each subset contains all the indiscernible (indistinguishable) instances. In such a situation, this grouping is known as the set of equivalence classes, .
Two instances are indiscernible regarding the condition attribute set, , if they share the same values for all their attributes. Taking this into consideration, the potential for classification of the condition attributes included in is measured by computing the lower approximation for the concept , . In this context, is the union of equivalence classes of having at least one positive instance , , and not any negative object , . Expression (1) shows the formal definition of the lower approximation of for the decision concept :
If we now consider the example shown in Table 1, as long as the fact that all the represented instances are discernible, , the lower approximation of concept with attributes included in is . Moreover, the subset of features is a reduct regarding concept , because and, hence, .
Keeping in mind the existence of undefined values () for concept (considered in the algorithm shown in Algorithm 1), two lower approximations are equivalent if they only differ in those instances () having an undefined value for concept.
Therefore, using the reference implementation of the proposed technique (refer to AdditionalFile1.java from the Supplementary Material available online at http://dx.doi.org/10.1155/2016/5945192 for its Java implementation), we extracted the rules from the example data source included in Table 1. The extracted rules are shown as follows.
Revision Rules Generated by the Proposed Algorithm for the Example Shown in Table 1 (2) IF a6 = TRUE THEN x1 = FALSE (2) IF a7 = TRUE THEN x1 = TRUE (2) IF a8 = TRUE THEN x1 = TRUE (4) IF a8 = FALSE THEN x1 = FALSE
As shown above, the rules generated by our proposed algorithm are simple and easy to execute. Therefore, the postprocessing stage (labeled as RSbased decision in Figure 1) will not involve the usage of a great amount of computational resources. In addition, each rule generated by our algorithm includes the number of samples from training dataset that match with it (also known as coverage set cardinality). This information is very useful when a target message matches two or more conflicting rules. In this case, we use a voting scheme using the cardinality of the coverage set as vote weight. After that, if the obtained result is equal for both the spam and legitimate categories, the last one is selected for the target email.
4. Model Benchmarking
In order to demonstrate the suitability of applying RS theory for improving the accuracy of previously successful ML classifiers in the antispam filtering domain, we designed an experimental protocol to execute our testbed. In Section 4.1, we include a description of this protocol introducing the reasons supporting our specific corpus selection, detailing several preprocessing issues, and defining the fold cross validation scheme as well as different measures. Complementarily, in Section 4.2, we present and discuss the obtained results.
4.1. Experimental Protocol
With the goal of evidencing whether the combination of ML techniques with RS is adequate to reduce type I (FP) errors, we analyzed several publicly available datasets in order to select one able to ensure the validity of our experimental results. In this line, the most widespread are SpamAssassin [53], LingSpam [54], PU1 [54], PU2 [54], PU3 [54], PU4 [54], TREC [55–57], and Spambase from the UCI repository [58]. Table 2 compiles relevant information about these corpora including the percentage of legitimate and spam emails and the total number of available messages.
 
Available at https://labsrepos.iit.demokritos.gr/skel/iconfig/downloads/. ^{2}Available at http://ftp.ics.uci.edu/pub/machinelearningdatabases/spambase/. ^{3}Available at http://trec.nist.gov/data/spam.html. ^{4}Available at https://spamassassin.apache.org/publiccorpus/. 
First of all, LingSpam corpus contains legitimate messages collected from a linguistic list merged with some spam messages directly compiled by its authors. It only includes 481 spam messages (16.6% of the total) and 2412 legitimate instances. Because of the small number of spam messages, most ML classifiers are affected by imbalanced learning [59] and, therefore, this dataset is not adequate for general experiments.
Secondly, PU1, PU2, PU3, and PUA corpora are distributed into 10 separate parts to facilitate the execution of 10fold cross validation experiments [60]. As shown in Table 2, these corpora present different percentages of spam messages (43.8%, 20%, 49%, and 50%, resp.) making them appropriate to avoid the imbalanced data problem. However, due to the format used for their original representation, the usage of stop word lists, stemming, and other techniques based on gathering information from the email header is not supported. As long as our approach requires the application of preprocessing techniques (e.g., usage of a stop word list), we have ruled out their use.
In the case of Spambase corpus, it contains 4601 messages (60.6% being spam) represented as feature vectors with information about 57 attributes. Due to the reduced dimensionality (number of attributes) of this corpus, we found it unsuitable for the study.
Next, as described in Table 2, TREC conference presents three corpora grouped according to the mailing date (2005, 2006, and 2007, resp.) with different percentages of spam and ham messages (43%, 35%, and 33.5%, resp.). These corpora were built following the standard Internet message format (described in RFC2822 [61]), keeping unaltered the original content of the messages. The preprocessing of the corpus does not include the detection and removal of duplicates.
Finally, SpamAssassin is one of the most used corpus by the antispam filtering community. It includes a total number of 9332 messages, of which 25.5% are spam emails. This standard corpus was built by the SpamAssassin developers without altering the original content of the messages. The preprocessing of this corpus (distributed in RFC2822 format) included the removal of duplicates and the anonymization of specific data with the goal of guaranteeing receiver privacy. The ratio between the size of the corpus (mediumsized) and the proportion of spam and ham messages makes SpamAssassin corpus as the most suitable dataset for our experiments.
In order to demonstrate the benefits of our proposal in the antispam filtering domain, we selected four wellknown and widely used ML classifiers: Naïve Bayes [62], Flexible Bayes [62], AdaBoost [63], and SVM [64–66]. Regarding their specific implementation, we chose the standard version of these classifiers included in the Weka Data Mining Software (available at http://www.cs.waikato.ac.nz/~ml/weka/). To successfully use Naïve and Flexible Bayes Weka implementations, the dimensionality of the input feature vectors was limited to 1000 characteristics (using IG feature ranker). Moreover, Naïve Bayes classifier was executed using binary features while Flexible Bayes was evaluated with continuous attributes (frequency). Additionally, AdaBoost was configured to use Decision Stumps as metaclassifiers and 150 boosting iterations. Complementarily, using IG method, we reduced the dimensionality of input vectors down to 700 binary features. Finally, a 1degree polynomial function was selected as kernel for SMO algorithm (Weka SVM implementation), which was executed using binary feature vectors with a size of 2000 (reduced using IG feature ranker).
All these parameters were established taking into consideration the integral evaluation methodology proposed by PérezDíaz et al. [25] for accurately ranking different contentbased spam filtering models. Additionally, in the work of Méndez et al. [49], IG showed the best performance for all the compared models, while in [25] the authors experimentally computed the best number of features (using the IG feature ranker) for all the available classifiers. Finally, with the goal of ensuring the validity of our results, all the experiments were conducted under a stratified 10fold cross validation schema [60].
To correctly assess the performance achieved by applying our RS revision method when compared to the independent execution of ML classifiers, we have chosen four groups of wellknown measures: (i) percentage of correctly classified messages, false positive and false negative (FN) errors, (ii) score (also known as score or measure) [67, 68], (iii) balanced score [68], and (iv) Total Cost Ratio (TCR) [22].
4.2. Obtained Results and Discussion
By applying the experimental protocol defined in the previous section, we straightforwardly evaluate the suitability of our proposed approach to improve the performance of different widely recognized ML classifiers. In this context, Table 3 shows the percentage analysis of the different type of errors (FP and FN) as well as the hits achieved by the analyzed ML techniques, giving specific information about the performance gain obtained by the use of the proposed RSbased approach. As described in Section 3, RS rules are automatically applied to revise the output of each ML classifier when it initially classifies a given message as spam.

As initially shown in Table 3, the percentage of correct classifications (% OK) using ML techniques was improved when RS revision rules were applied with the only exception of Flexible Bayes algorithm. The particular behavior of Flexible Bayes classifier can be explained by the very high number of FN errors, which cannot be successfully addressed by our proposal that is only applied in those cases in which an incoming email is initially classified as spam. In the light of these results, the overall combination of ML techniques with the proposed revision approach was able to reduce the number of misclassifications of legitimate emails. This behavior avoids the incorrect filtering of relevant messages for the end user with a minimal footprint in FN errors (ability to detect spam).
With the goal of having a more insightful perspective about these initial results, we also computed score and balanced score values, merging recall and precision for different alternatives. Table 4 presents the obtained results.

As shown in Table 4, the combination of precision and recall measures with the same weight () evidences slightly worse results when applying RS in combination with Flexible Bayes and SVM. However, this assumption is unrealistic from a real user perspective for which classification errors own a very different importance. In this line, Table 4 reveals that when increasing the penalization of type I (FP) errors (using lower values of ), the RSbased revision approach achieves great evaluation results.
In this context, and with the goal of providing a further analysis about the real impact of type I errors from a costsensitive point of view, we carried out TCR evaluations for all the analyzed models. These results are shown in Figure 2.
(a) TCR score with λ = 1 and 9
(b) TCR score with λ = 999
As clearly shown in Figure 2(a), if the cost of an FP error is considered as important as a FN misclassification (), SVM and Flexible Bayes classifiers do not achieve additional benefits. However, a significant improvement is obtained by the application of our automatic revision procedure when working in real scenarios (situation modeled by assigning to different values).
5. Conclusions and Future Work
In this work, we have presented a RSbased postprocessing technique able to reduce type I (FP) errors made by different wellknown classifiers previously applied in the antispam filtering domain. To this end, we have designed a straightforward algorithm able to extract simple and complementary revision rules exploiting the same corpus used to train the original classifiers. Our approach is only applied to those messages initially classified as spam, alleviating the use of valuable computational resources in real implementations.
Results achieved by the execution of the experimental protocol have demonstrated the effectiveness of our proposal for improving the performance of different ML classifiers. Particularly, different costsensitive measures (such as TCR or balanced score) obtained accurate rates for our RSbased revision approach when dealing with type I errors. The main advantage of its combined execution is an increase on classification hits, which is an important issue to augment the final classifier user experience.
Moreover, the impact on the time required for carrying out the final classification when our proposed method is applied is negligible because (i) the postprocessing is not applied on each classification (only for messages initially classified as spam) and (ii) the time and computer resources needed to evaluate the matching of rules are very low. Additionally, the knowledge acquisition and representation process represented in Figure 1 (as well as the training of the standard ML classifiers) can be executed in a different machine with the goal of saving computational resources on the hardware used to deploy the antispam filter.
The main drawback of our approach is the deterministic nature of the generated revision rules. In this regard, Pawlak and colleagues [52] have shown the limitations of RS deterministic approaches when compared to probabilistic ones that work with information uncertainty inherent in many classification problems (such as spam). Additionally, the main advantage of probabilistic models lies on providing a unified approach for both deterministic and nondeterministic knowledge representation systems. Taking this idea into account, our main line of future research work includes searching for complementary probabilistic approaches able to generate rules that outperform the capabilities of our current algorithm. Moreover, in order to complement our current work, we also find interesting the identification of novel feature selection and extraction methods. To this end, we believe that regular expressions representing more than one token could be more effective than features made up of a single one. Finally, we also found interesting the idea of carrying out the dynamic validation of rules in order to detect when they became obsolete.
Competing Interests
The authors declare that there are no competing interests regarding the publication of this paper.
Acknowledgments
This work has been partially funded by (i) the 14VI05 ContractProgramme from the University of Vigo, (ii) the INOU1506 Project from the University of Vigo, and (iii) Agrupamento INBIOMED from DXPCTSUGFEDER unha maneira de facer Europa (2012/273). SING group thanks CITI (Centro de Investigación, Transferencia e Innovación) from University of Vigo for hosting its IT infrastructure.
Supplementary Materials
AdditionalFile1.java is a reference implementation of the rule extraction method introduced in this work. The implementation has been developed in Java and can be easily executed using a Java Runtime Environment.
References
 J. van Rijn, “The ultimate mobile email statistics overview,” 2015, http://www.emailmonday.com/mobileemailusagestatistics. View at: Google Scholar
 J. Jordan, 53% of Emails Opened on Mobile, Email Testing and Email Marketing Analytics—Litmus, 2015, https://litmus.com/blog/53ofemailsopenedonmobileoutlookopensdecrease33.
 The Radicati Group Inc, A Technology Market Research Firm, Email Statistics Report, 2013–2017, 2015, http://www.radicati.com/wp/wpcontent/uploads/2013/04/EmailStatisticsReport20132017ExecutiveSummary.pdf.
 Statista, Global Email Spam Rate 2012–2015, 2016, http://www.statista.com/statistics/270899/globalemailspamrate/.
 N. PérezDíaz, D. RuanoOrdas, F. FdezRiverola, and J. R. Méndez, “Wirebrush4SPAM: a novel framework for improving efficiency on spam filtering services,” Software—Practice and Experience, vol. 43, no. 11, pp. 1299–1318, 2013. View at: Publisher Site  Google Scholar
 D. RuanoOrdás, J. FdezGlez, F. FdezRiverola, and J. R. Méndez, “Effective scheduling strategies for boosting performance on rulebased spam filtering frameworks,” Journal of Systems and Software, vol. 86, no. 12, pp. 3151–3161, 2013. View at: Publisher Site  Google Scholar
 D. RuanoOrdás, J. FdezGlez, F. FdezRiverola, and J. R. Méndez, “Using new scheduling heuristics based on resource consumption information for increasing throughput on rulebased spam filtering systems,” Software—Practice and Experience, 2015. View at: Publisher Site  Google Scholar
 S. Görling, “An overview of the Sender Policy Framework (SPF) as an antiphishing mechanism,” Internet Research, vol. 17, no. 2, pp. 169–179, 2007. View at: Publisher Site  Google Scholar
 J. M. M. da Cruz, Spam: Classement Statistique de Messages Électroniques: Une Approche Pragmatique, Presses des Mines, 2012.
 Ryholite Inc, Distributed Checksum Clearinghouses, 2015, http://www.rhyolite.com/dcc/.
 J. Timmis, A. Hone, T. Stibor, and E. Clark, “Theoretical advances in artificial immune systems,” Theoretical Computer Science, vol. 403, no. 1, pp. 11–32, 2008. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 J. Timmis, T. Knight, L. N. de Castro, and E. Hart, “An overview of artificial immune systems,” in Computation in Cells and Tissues, pp. 51–91, Springer, Berlin, Germany, 2004. View at: Google Scholar
 S. J. Delany, P. Cunningham, A. Tsymbal, and L. Coyle, “A casebased technique for tracking concept drift in spam filtering,” KnowledgeBased Systems, vol. 18, no. 45, pp. 187–195, 2005. View at: Publisher Site  Google Scholar
 F. FdezRiverola, E. L. Iglesias, F. Díaz, J. R. Méndez, and J. M. Corchado, “SpamHunting: an instancebased reasoning system for spam labelling and filtering,” Decision Support Systems, vol. 43, no. 3, pp. 722–736, 2007. View at: Publisher Site  Google Scholar
 C.H. Wu, “Behaviorbased spam detection using a hybrid method of rulebased techniques and neural networks,” Expert Systems with Applications, vol. 36, no. 3, part 1, pp. 4321–4330, 2009. View at: Publisher Site  Google Scholar
 A. H. Mohammad and R. A. Abu Zitar, “Application of genetic optimized artificial immune system and neural networks in spam detection,” Applied Soft Computing Journal, vol. 11, no. 4, pp. 3827–3845, 2011. View at: Publisher Site  Google Scholar
 S. Jiang, G. Pang, M. Wu, and L. Kuang, “An improved Knearestneighbor algorithm for text categorization,” Expert Systems with Applications, vol. 39, no. 1, pp. 1503–1509, 2012. View at: Publisher Site  Google Scholar
 X. Zhou, Y. Hu, and L. Guo, “Text Categorization based on Clustering Feature Selection,” Procedia Computer Science, vol. 31, pp. 398–405, 2014. View at: Google Scholar
 V. Mitra, C.J. Wang, and S. Banerjee, “Text classification: a least square support vector machine approach,” Applied Soft Computing Journal, vol. 7, no. 3, pp. 908–914, 2007. View at: Publisher Site  Google Scholar
 H. Drucker, D. Wu, and V. N. Vapnik, “Support vector machines for spam categorization,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1048–1054, 1999. View at: Publisher Site  Google Scholar
 V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam filtering with Naïve bayes—which Naïve bayes?” in Proceedings of the 3rd Conference on Email and AntiSpam (CEAS '06), July 2006. View at: Google Scholar
 I. Androutsopoulos, J. Koustias, K. V. Chandrinos, G. Paliouras, and C. Spyropoulos, “An evaluation of naïve Bayesian antispam filtering,” in Proceedings of the 11th European Conference on Machine Learning, Workshop on Machine Learning in the New Information Age, pp. 9–17, Barcelona, Spain, 2000. View at: Google Scholar
 M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, “A Bayesian approach to filtering junk email,” Tech. Rep. WS9805, AAI Press, 1998. View at: Google Scholar
 SpamAssassin Group, The Apache SpamAssassin Project, 2015, http://spamassassin.apache.org/.
 N. PérezDíaz, D. RuanoOrdás, F. FdezRiverola, and J. R. Méndez, “SDAI: an integral evaluation methodology for contentbased spam filtering models,” Expert Systems with Applications, vol. 39, no. 16, pp. 12487–12500, 2012. View at: Publisher Site  Google Scholar
 Forbes, The World Most Valuable Brands, 2015, http://www.forbes.com/powerfulbrands/list/.
 Official Gmail Blog, “The mail you want, not the spam you don't,” 2015, https://gmail.googleblog.com/2015/07/themailyouwantnotspamyoudont.html. View at: Google Scholar
 F. Lardinois, Gmail Has Now 900M Active Users, 2015, http://techcrunch.com/2015/05/28/gmailnowhas900mactiveusers75onmobile/.
 N. PérezDíaz, D. RuanoOrdás, J. R. Méndez, J. F. Gálvez, and F. FdezRiverola, “Rough sets for spam filtering: selecting appropriate decision rules for boundary email classification,” Applied Soft Computing Journal, vol. 12, no. 11, pp. 3671–3682, 2012. View at: Publisher Site  Google Scholar
 Z. I. Pawlak, Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer Academic, New York, NY, USA, 1991.
 Z. I. Pawlak, J. GrzymalaBusse, R. Slowinski, and W. Ziarko, “Rough sets,” Communications of the ACM, vol. 38, no. 11, pp. 88–95, 1995. View at: Publisher Site  Google Scholar
 Z. I. Pawlak, “Rough sets,” International Journal of Computer & Information Sciences, vol. 11, no. 5, pp. 341–356, 1982. View at: Publisher Site  Google Scholar
 Z. Pawlak, “Rough sets: present state and the future,” Foundations of Computing and Decision Sciences, vol. 18, no. 34, pp. 157–166, 1993. View at: Google Scholar  MathSciNet
 M. Glymin and W. Ziarko, “Rough set approach to spam filter learning,” Proceedings of the International Conference of Rough Sets and Intelligent System Paradigms (RSEISP '07), vol. 4585, pp. 350–359, 2007. View at: Google Scholar
 W. Ziarko, “Variable precision rough set model,” Journal of Computer and System Sciences, vol. 46, no. 1, pp. 39–59, 1993. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 Y.F. Chiu, C.M. Chen, B. Jeng, and H.C. Lin, “An alliancebased antispam approach,” in Proceedings of 3rd International Conference of Natural Computation (ICNC '07), pp. 203–207, August 2007. View at: Publisher Site  Google Scholar
 G.H. Lai, C.M. Chen, C.S. Laih, and T. Chen, “A collaborative antispam system,” Expert Systems with Applications, vol. 36, no. 3, pp. 6645–6653, 2009. View at: Publisher Site  Google Scholar
 G. Lai, C. Chou, C. Chen, and Y. Ou, “Antispam filter based on data mining and statistical test,” Computer and Information Science, vol. 208, pp. 179–192, 2009. View at: Google Scholar
 Y. Yang, “A novel framework based on rough set, ant colony optimization and genetic algorithm for spam filtering,” International Journal of Advancements in Computing Technology, vol. 4, no. 14, pp. 516–525, 2012. View at: Google Scholar
 W. Zhao and Y. Zhu, “Classifying email using variable precision rough set approach,” in Rough Sets and Knowledge Technology, G.Y. Wang, J. F. Peters, A. Skowron, and Y. Yao, Eds., vol. 4062 of Lecture Notes in Computer Science, pp. 766–771, Springer, 2006. View at: Publisher Site  Google Scholar
 D. C. Whitley, M. G. Ford, and D. J. Livingstone, “Unsupervised forward selection: a method for eliminating redundant variables,” Journal of Chemical Information and Computer Sciences, vol. 40, no. 5, pp. 1160–1168, 2000. View at: Publisher Site  Google Scholar
 W. Zhao and Z. Zhang, “An email classification model based on rough set theory,” in Proceedings of the International Conference on Active Media Technology (AMT '05), pp. 403–408, May 2005. View at: Publisher Site  Google Scholar
 W. Zhao and Y. Zhu, “An email classification scheme based on decisiontheoretic rough set theory and analysis of email security,” in Proceedings of the IEEE Region 10 Conference (TENCON '05), pp. 1–6, Melbourne, Australia, November 2005. View at: Publisher Site  Google Scholar
 B. Zhoy, Y. Yao, and J. Luo, “A threeway decision approach to email spam filtering,” in Advances in Artificial Intelligence: 23rd Canadian Conference on Artificial Intelligence, Canadian AI 2010, Ottawa, Canada, May 31–June 2, 2010. Proceedings, vol. 6085 of Lecture Notes in Computer Science, pp. 28–39, Springer, Berlin, Germany, 2010. View at: Publisher Site  Google Scholar
 C. Zhao, W. Zeng, M. Jiang, and Z. He, “A decisiontheoretic rough set approach to spam filtering,” in Proceedings of the 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD '13), pp. 130–134, July 2013. View at: Publisher Site  Google Scholar
 X. Jia and L. Shang, “Threeway decisions versus twoway decisions on filtering spam email,” in Transactions on Rough Sets XVIII, J. F. Peters, A. Skowron, T. Li, Y. Yang, J. Yao, and H. S. Nguyen, Eds., vol. 8449 of Lecture Notes in Computer Science, pp. 69–91, Springer, 2014. View at: Publisher Site  Google Scholar
 X. Jia, K. Zeng, W. Li, T. Liu, and L. Shang, “Threeway decisions solution to filter spam email: an empirical study,” in Rough Sets and Current Trends in Computing: 8th International Conference, RSCTC 2012, Chengdu, China, August 17–20, 2012.Proceedings, vol. 7413 of Lecture Notes in Computer Science, pp. 287–296, Springer, Berlin, Germany, 2012. View at: Publisher Site  Google Scholar
 R. BaezaYates and B. RibeiroNeto, Modern Information Retrieval, Addison Wesley, 1999.
 J. R. Méndez, F. FdezRiverola, F. Díaz, E. L. Iglesias, and J. M. Corchado, “A comparative performance study of feature selection methods for the antispam filtering domain,” in Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining: 6th Industrial Conference on Data Mining, ICDM 2006, Leipzig, Germany, July 1415, 2006. Proceedings, vol. 4065 of Lecture Notes in Computer Science, pp. 106–120, Springer, Berlin, Germany, 2006. View at: Publisher Site  Google Scholar
 J. R. Méndez, I. Cid, D. GlezPeña, M. Rocha, and F. FdezRiverola, “A comparative impact study of attribute selection techniques on naïve bayes spam filters,” in Advances in Data Mining. Medical Applications, ECommerce, Marketing, and Theoretical Aspects: 8th Industrial Conference, ICDM 2008 Leipzig, Germany, July 16–18, 2008 Proceedings, vol. 5077 of Lecture Notes in Computer Science, pp. 213–227, Springer, Berlin, Germany, 2008. View at: Publisher Site  Google Scholar
 J. R. Méndez, E. L. Iglesias, F. FdezRiverola, F. Díaz, and J. M. Corchado, “Analyzing the impact of corpus preprocessing on antispam filtering software,” Research on Computing Science, vol. 17, pp. 129–138, 2005. View at: Google Scholar
 Z. Pawlak, S. K. M. Wong, and W. Ziarko, “Rough sets: probabilistic versus deterministic approach,” International Journal of ManMachine Studies, vol. 29, no. 1, pp. 81–95, 1988. View at: Publisher Site  Google Scholar
 SpamAssassin, SpamAssassin Public Corpus, 2003, https://spamassassin.apache.org/publiccorpus/.
 I. Androutsopoulos, G. Paliouras, and E. Michelakis, “Learning to filter unsolicited commercial email,” Tech. Rep. 2004/2, NCSR “Demokritos”, 2004. View at: Google Scholar
 G. Cormack and T. Lynam, “TREC 2005 spam track overview,” in Proceedings of the 14th Text REtrieval Conference (TREC '05), November 2005. View at: Google Scholar
 G. Cormack, “TREC, 2006 spam track overview,” in Proceedings of the 15th Text REtrieval Conference (TREC '06), pp. 117–127, November 2006. View at: Google Scholar
 G. V. Cormack, “TREC 2007 spam track overview,” in Proceedings of the 16th Text REtrieval Conference (TREC '07), Gaithersburg, Md, USA, November 2007. View at: Google Scholar
 S. Hettich, C. L. Blake, and C. J. Merz, “UCI Repository of machine learning databases,” 1998, http://archive.ics.uci.edu/ml/datasets/Spambase. View at: Google Scholar
 H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009. View at: Publisher Site  Google Scholar
 R. Kohavi, “A study of crossvalidation and bootstrap for accuracy estimation and model selection,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 1137–1143, 2004. View at: Google Scholar
 P. Resnick, RFC2822—Internet Message Format, 2001, https://www.ietf.org/rfc/rfc2822.txt.
 G. H. John and P. Langley, “Estimating continuous distributions in bayesian classifiers,” in Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI '95), pp. 338–345, 1995. View at: Google Scholar
 Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” in Proceedings of the 13th International Conference on Machine Learning (ICML '96), pp. 148–156, 1996. View at: Google Scholar
 V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995. View at: Publisher Site  MathSciNet
 J. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods—Support Vector Learning, B. Schoelkopf, C. Burges, and A. Smola, Eds., pp. 41–65, The MIT Press, 1998. View at: Google Scholar
 S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, “Improvements to Platt's SMO algorithm for SVM classifier design,” Neural Computation, vol. 13, no. 3, pp. 637–649, 2001. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 D. M. W. Powers, “Evaluation: from precision, recall and Fmeasure to ROC, informedness, markedness and correlation,” International Journal of Machine Learning Technology, vol. 2, no. 1, pp. 37–63, 2011. View at: Google Scholar
 C. J. V. Rijsbergen, Information Retrieval, ButterworthHeinemann, 1979.
Copyright
Copyright © 2016 N. PérezDíaz et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.