Abstract

Support vector machine (SVM) is one of the top picks in pattern recognition and classification related tasks. It has been used successfully to classify linearly separable and nonlinearly separable data with high accuracy. However, in terms of classification speed, SVMs are outperformed by many machine learning algorithms, especially, when massive datasets are involved. SVM classification speed scales linearly with number of support vectors, and support vectors increase with increase in dataset size. Hence, SVM classification speed can be enormously reduced if it is trained on a reduced dataset. Instance selection techniques are one of the most effective techniques suitable for minimizing SVM training time. In this study, two instance selection techniques suitable for identifying relevant training instances are proposed. The techniques are evaluated on a dataset containing 4000 emails and results obtained compared to other existing techniques. Result reveals excellent improvement in SVM classification speed.

1. Introduction

Support vector machine (SVM) has performed remarkably well in classification related and pattern recognition problems. Its high classification accuracy makes it one of the most preferred machine learning (ML) algorithms. However, SVM has high classification complexity which scales linearly with number of support vectors (SVs). SV increases with increase in dataset instances. That is, massive dataset produces many SVs and consequently increases SVM classification speed, hence making it unsuitable for real time systems. SVM training time is , where n is number of training instances [1, 2]. Instance selection techniques have been successfully used to improve SVM classification speed and training complexity. These techniques are used to minimize SVM training time by extracting relevant instances from training set. The extracted instances (also called SVs) are instances close to decision boundary. Eliminating instances that are non-SVs does not have negative impact on SVM training result [1]. This study presents two instance selection techniques and applies them to phishing email classification. A brief introduction to instance selection and phishing is presented next.

(1) Instance Selection. Numerous ML related problems require automatic classification of new instances. Prior to classification, a classifier is typically trained on a set of instances, called training set. Generally, training datasets contain redundant instances; hence, removal of these instances improves computational complexity of classifier. Instance selection techniques are designed to remove irrelevant instances from dataset. They aim at reducing training time of a classifier. Instance selection is particularly useful for instance-based classifiers, where classification of one instance involves the use of an entire training set [3]. Instance selection can start with either an empty set (incremental technique) or a training set (decremental technique) [3]. For incremental technique, instances are added into an empty set, and for decremental technique, instances are removed from training set [3]. Instance selection techniques can be classified into two: wrapper and filter [3]. Wrapper based instance selection depends on accuracy achieved by a classifier, and filter-based instance selection techniques do not depend on a classifier [3]. Filter-based instance selection techniques are typically faster than wrapper based techniques [3]. In this study, two filter-based instance selection techniques are developed and applied to email classification. Performances of these techniques are evaluated and they yielded excellent results.

(2) Phishing. Phishing is an effort to acquire sensitive information from users by electronic means, generally for fraud. Typically, phishing is perpetrated by creating a replica website of a legitimate organization. Phishing attack is one of the major threats encountered by many online users in recent times. Since the advent of electronic commerce in 1994, phishing has advanced at a fast pace [4]. Undoubtedly, high patronage of online businesses is one of the primary causes of the raid increase in online fraud. In 2014, 1,198 companies lost 179 million US dollars to email scam [5]. Between October 2013 and August 2015, 7000 companies in USA lost about 750 million US dollars to phishing [5]. Moreover, between 2014 and 2016, total loss to email scam (by organizations) is estimated to be 2.3 billion US dollars [5]. The urgent need for a robust phishing detection system cannot be overemphasized. A secured phishing detection system should be capable of identifying and protecting users from known and novel phishing attacks [6]. Many solutions have been proposed in literature to handle phishing; however, ML-based techniques are one of the few techniques that yielded high classification accuracy, because of their ability to detect both existing and emerging fraudulent attacks. This paper proposes two improved SVM-based solutions for phishing email classification.

Many instance selection techniques have been proposed in literature to reduce the computational complexity of SVM. Some proposed wrapper and filter-based instance selection techniques are presented in this section.

2.1. Wrapper Based Instance Selection Techniques

Wrapper techniques perform instance selection using a classification model [3]. During instance selection, datasets are divided into subsets, and each subset is used to train a model. Afterwards, each model is tested on a separate subset. Furthermore, weight for each subset is evaluated by calculating the number of correctly classified instances. Finally, subset with the best weight is selected and used to build the main model. Although wrapper based techniques typically select optimal subsets, the selection process is time consuming. Some proposed wrapper based instance selection techniques for SVM speed optimization are presented below.

García et al. [7] introduced an evolutionary algorithm (EA) based technique for imbalanced classification in exemplar-based learning. In the study, authors calculated the distance of each piece of data to different exemplars and used EA to select the best exemplar. The selected exemplar was then used for training. In another work, Cano et al. [8] performed a study on performance of EA-based instance selection techniques. In the study, authors focused on four EA models and compared their performance to non-EA algorithms. Result obtained from study revealed that EA performed better compared to non-EA algorithms. Li et al. [9] proposed a SVM-based instance selection technique. In the study, authors combined SVM and a KNN-based instance selection technique (called DROP2 [10]). SVM was used to select SVs, and DROP2 was used to further reduce the selected SVs. The resultant dataset was then used to train SVM. Garain [11] proposed an instance selection technique based on Artificial Immune System (AIS). In the study, authors used the idea of AIS to select the fittest set of instances from a dataset. Zhang and Sun [12] proposed a tabu search based technique for instance selection. In the study, different subsets were selected and tabu search was applied to each subset. Each subset was evaluated, and subset that produces the best classification accuracy was selected.

2.2. Filter-Based Instance Selection Techniques

Filter-based instance selection technique performs instance selection using a choice function [3]. Instance selection is performed based on scores assigned to instances. Unlike wrapper-based techniques, instance subsets produced by filter-based techniques are usually not tailored to a certain type of classification model; they are more general. Some filter-based techniques are discussed next.

Riquelme et al. [13] proposed an instance selection technique for selecting boundary instances. Authors designed a selection rule that discards weak instances far from a boundary. Weakness of an instance is determined by weakness of all attributes which describes the instance. That is, weakness(I) = , where is the number of features describing instance I. Lyhyaoui et al. [14] proposed a clustering-based instance selection technique for obtaining boundary instances in multiclass datasets. In the study, authors obtained boundary instances by selecting cluster centers close to opposite classes. In another work, De Almeida et al. [15] proposed a clustering-based technique using -means algorithm. The technique was designed with an assumption that training vectors close to a separating margin are prospective SVs, and training vectors far from margin are likely non-SVs. In the study, authors divided the training dataset into different clusters. Afterwards, training vectors in clusters containing only one class were discarded (only their cluster centers were considered) and training vectors in clusters containing more than one class were selected for training. Selection was based on the assumption that clusters with multiple classes possibly contain SVs, because they are near a separating margin. In another study, Chen et al. [16] proposed a clustering-based instance selection technique. In the study, authors used clustering algorithm to obtain cluster centers of instances in a positive class. Afterwards, the cluster centers were used as reference points to select boundary instances. The algorithm was designed on the assumption that negative instance close to cluster centers of positive class and positive instance far from cluster centers of positive class are close to the boundary. In other words, positive instances close to a boundary contribute less to the decision surface and negative instances close to a boundary contribute more to the decision surface.

Panda et al. [1] proposed an instance selection technique capable of selecting data instances close to a decision boundary. The selected boundary instances are believed to be SVs. The technique consists of two stages. The first stage is responsible for identifying a set of nearest neighbors for all instances in a dataset, and the second stage is responsible for selecting instances close to a boundary. Authors developed a scoring function that assigns high scores to instances closer to a decision boundary. In another study, García et al. [17] introduced an instance selection algorithm based on memetic algorithm. Memetic algorithm combines EA and local search. In the study, authors designed the local search to select relevant instances and also improve classification accuracy.

In this study, for comparison purpose, two of the reviewed instance selection techniques were implemented and applied to phishing emails. The two techniques (Chen et al. [16] and Panda et al. [1]) and their results are presented next.

As aforementioned, Panda et al. [1] designed a scoring function for selecting instances close to a decision boundary. The scoring function is given in denotes the score accorded to by . is the squared distance from to the closest instance of the opposite class on its neighborhood list. is the mean of . During implementation in this study, squared Euclidean distance was used for distance computation. Pseudocode for the scoring function is shown in Algorithm 1, and classification step is given as follows [1]:(i)Identify all nearest neighbors (NN) for each instance in the dataset.(ii)Compute exponential decay score for each instance and its NN belonging to opposite class.(iii)Determine the score for each instance.(iv)Based on the scores, select boundary instances.Result for the KNN-based technique is shown in Table 1. The table shows the result for varying number of and varying number of subsets (i.e., boundary instances). The result reveals an improvement in SVM classification speed. Also, clustering-based technique proposed by Chen et al. [16] was implemented. Algorithm for the technique is shown in Algorithm 2 and classification steps are shown as follows [16]:(i)Select instances from dataset, , for positive class, PC.(ii)Select instances for negative class, NC, where NC = .(iii)Apply clustering to positive class to obtain cluster centers (or means).(iv)Select boundary instances using obtained cluster centers. To achieve this, do the following.(v)For each cluster center,(a)compute distance between cluster center and selected positive instances;(b)sort distance and remove positive instances that are close to the boundary;(c)compute distance between cluster centers and negative class;(d)add negative instances that are close to the boundary.(vi)End For(vii)Use selected positive instance for training.As shown in Table 2, the algorithm improved SVM classification speed without degrading classification accuracy.

Notation
= # of instances in dataset
= # of nearest neighbors
= Normalized score of instance
= # of contributors to the score of instance
= nearest neighbors of instance
= squared distance between and
procedure Determine_Scores
Input: , , ,
Output:
/  Determine exponential decay parameter    /
= 0
counter = 0
for   to
 nearest-opposite-neighbor-found = false
for   to
   if  
    if (!nearest-opposite-neighbor-found)
     nearest-opposite-neighbor-found = true
     
    
    counter = counter + 1
/  Determine the score of instances  /
for   to
 nearest-opposite-neighbor-found = false
for   to
   if  
    if (!nearest-opposite-neighbor-found)
     nearest-opposite-neighbor-found = true
     
    
    ++
for   to
return  
Input: training set with classes and instances, the ratio of selected instances .
Output: , the set of selected instances with class () is treated as positive and the rest classes as negative
Procedure:
(1) for each class ,
(2) ;
(3) Perform -means clustering on class and get cluster centers
(4) for each center
(5)  compute ) between and each instance ;
(6)  if  
(7)   get instances in closest to and delete them from , where
(8)  end if
(9)  for each class
(10)     Search instances of class with least distance metrics and select them into , where
       and
(11)    end for
(12)  end for
(13) endfor

3. Proposed Instance Selection Techniques

This section presents two instance selection techniques proposed in this study. The first technique is based on firefly algorithm, and the second technique is based on edge detection in image processing. Both techniques were evaluated on a dataset consisting of 3500 ham emails and 500 phishing emails. The ham emails were obtained from SpamAssassin [18] and the phishing emails were obtained from https://monkey.org/ [19]. The datasets contain higher proportion of ham emails, because, in real world, mail users receive more legitimate emails than phishing emails. All the emails were well labelled and evenly distributed into 10 folders. Afterwards, 10-fold cross validation was performed. A brief introduction to firefly algorithm (FFA) and edge detection is discussed in Sections 3.1 and 3.2, respectively.

3.1. Dataset Processing and Feature Extraction

Prior to classification, 16 features were first extracted from emails in the dataset. Features extracted are similar to the features used in one of our previous studies [20]. Furthermore, the extracted features were normalized, and IG for all the features was calculated. Afterwards, best nine features were selected and converted to the input format required by libSVM [21], the SVM library used in this study. During classification, Gaussian transformation is used to scale down the feature vectors, to ensure that each vector has a mean of zero and a unit variance. Firefly parameters used in this study are similar to the parameters suggested by Yang [22]. Also, parameter selection technique used in this study is similar to the technique recommended by Hsu et al. [23]. More details are provided in Tables 10 and 11.

3.2. Firefly Algorithm

FFA is a nature inspired (NI) algorithm, developed by Yang [24]. It is based on the flashing behavior of fireflies [25]. Most firefly species produce short flashlight at regular intervals to attract mating partners and prey and to send warning signals to predators [24]. Firefly light intensity is inversely proportional to the square of the distance between fireflies. Additionally, as distance increases, light is absorbed in the atmosphere, and light intensity decreases [24]. Flashlight can be formulated, such that it will be associated with the value of objective function. FFA has many variants; however, this study focuses on the original algorithm, formulated using three idealized rules as follows [24]:(1)Fireflies are unisex; hence they can be attracted to each other irrespective of their sex.(2)Firefly attractiveness and brightness are proportional, and they decrease with respect to distance. Therefore, brighter firefly attracts less bright fireflies. Also, fireflies move randomly if they are of equal light intensity.(3)Firefly brightness is determined by the objective function landscape.

3.2.1. FFA-Based Instance Selection Technique

This study introduces an instance selection technique (called FFA_IS) based on FFA. FFA_IS is designed with an objective of minimizing the number of instances used for training. Given a set of training instances, fireflies are evaluated (using the objective function defined in (2)), and the best firefly is selected and used to train SVM. Each firefly consists of a binary array of instances (called instance mask), where 1 indicates that an instance is selected, and 0 indicates otherwise. During experiment, instance mask for each firefly is randomly initialized to 0 and 1. Afterwards, objective function for each firefly is evaluated and the global best is saved. Furthermore, fireflies are moved from one position to another, their attractiveness is calculated, and their fitness value is updated. The process is repeated until a predefined number of generations are reached. Finally, the best firefly is selected, its instance mask is processed, and instances with the value of 1 are selected and used to train SVM. A constrain is added to ensure that at least number of instances are selected for training, where is user defined. Hence, if the total number of selected instances () is less than , additional instances are randomly selected, where . This constraint is added to eliminate the possibility of having zero selected instances.

3.2.2. Objective Function for FFA_IS

Objective function for FFA_IS is given in (2). As aforementioned, the ultimate goal is to minimize number of selected instances; hence percentage reduction is the criteria used in designing the objective function. Objective function assigns more weight to fireflies with low percentage reduction. Firefly with the highest weight is selected and used for training. As aforementioned, FFA_IS is designed to ensure that at least instances are selected for training.where TNI is size of instance mask and TS is total number of selected instances.

3.2.3. Result and Discussion

FFA_IS algorithm (shown in Algorithm 3) was evaluated on a dataset consisting of 4000 emails, and it yielded promising results. During evaluation, different experiments were performed. For each experiment, different sizes of instance masks and different number of fireflies were used. As shown in Table 4, classification accuracy obtained ranges between 99.25% and 99.68%. Moreover, speed obtained ranges between 23.54 seconds and 213.17 seconds. Although the proposed technique is not designed to select boundary instances, as reflected in the result, it can be used to select relevant instances for SVM training and consequently improve SVM classification speed. During the experiments performed on the clustering-based technique (proposed by Chen et al. [16]), it was observed that over 80% of training dataset was selected for training. Hence, in this study, FFA_IS was used to further reduce the number of training instances selected by the clustering-based technique. Two sets of experiments were performed to test the performance of the hybridized technique (called FFA_Clus). In the first set, 100% of the selected instances were used to train SVM, and in the second set, instances selected by FFA_Clus were used to train SVM. Table 3 shows the results of the experiments. Result reveals that FFA_Clus improved the classification speed of the clustering-based technique by 98%, without degrading the classification accuracy. This implies that robust instance selection techniques can be developed using FFA_IS in combination with clustering-based techniques.

Notation
= number of fireflies
NS = number of selected instances
= Max Generation
Min = Minimum number of selected instances
= Initial Attractiveness Value
= alpha
= Gamma
= Objective function, where and = number of fireflies
IM = Instance Mask or Subset of instances.
= Size of Instance Mask
= Dataset
GB = Global Best
TS = Training subset
= light intensity
Input: , , , ,  , ,
Output: TS
(1) Define
(2) Initialize IM of each firefly
(3) Evalute to determine for each firefly
(4) Select firefly with the highest , and save in GB
(5) while ()
  (5.1) for   to
     (5.1.1) for   to
       (5.1.1.1) If ()
         (5.1.1.1.1) Move firefly towards firefly
       (5.1.1.2) end if
       (5.1.1.3) Calculate attrativeness variance with distance using
       (5.1.1.4) Evaluate to determine new fitness value for firefly
       (5.1.1.5) Update light intensity of firefly
     (5.1.2) end  
  (5.2) end  
  (5.3) Update GB
(6) end while
(7) Calculate NS of GB
(8) if NS in GB < Min
  (8.1) update GB by assigning 1 to (Min – NS) instances that was not selected
(9) End if
(10) For   to
  (10.1) If   in GB is equal to 1
    (10.1.1)
  (10.2) End if
(11) Output TS
3.3. Edge Detection

Edge detection in image processing is a technique used to identify object boundaries in images [26]. Object boundaries are points in images with sharp change in image brightness [26]. Generally, images contain some quantity of redundant data that requires pruning, for effective classification. Hence, to reduce computational complexity, edge detection is a highly essential preprocessing step [27]. Edge detection is applied to images with the aim of identifying important features, removing less relevant information, and consequently reducing the image size. Generally, edge detection is used for segmentation of images, feature extraction, and feature detection in image processing, computer vision, and machine vision [2628]. Edge defection conserves essential structural properties of images and computer space [27]. Edge detection algorithms include Canny algorithm, Sobel algorithm, and Roberts algorithm. Figure 1 shows an example of an image and its detected edges.

Concept of edge detection in image processing is similar to concept of boundary detection in SVM classification. Edge detection aims to select objects located at boundary positions, and boundary detection algorithms aim to select instances (also called SVs) close to a decision boundary. In this study, an instance selection technique based on edge detection is proposed.

3.3.1. Edge Instance Selection Algorithm

This study proposes an instance selection technique called Edge Instance Selection Algorithm (EISA). EISA borrows its idea from edge detection in image processing. Given a set of training instances, EISA identifies an edge instance and selects instances close to it. EISA consists of two main stages: distance computation stage and edge instance selection stage. In the first stage, EISA computes squared Euclidian distance between each instance and all other instances in the dataset. Furthermore, in this stage, for each in the dataset, based on the proximity of other instances to , EISA votes a corresponding (i.e., edge instance), where is the farthest distance from . In the second phase, firstly, edge instance with the highest vote is selected. Afterwards, -nearest neighbors of the voted edge instance are computed and used to train SVM model. Algorithm 4 shows the full algorithm of EISA. Some experiments were performed to test the efficiency of EISA, and result reveals that EISA significantly improved SVM classification speed.

Notation
= number of dataset instances
= Dataset
= number of nearest neighbors
= Edge
EI = Edge Instances
= Vote for each instance. Vote is an array of size .
Input: , ,
Output: EI
Initialize
(1) For   to
  (1.1) For   to
   (1.1.1) Compute distance between and , where
  (1.2) End  
  (1.3) Select Instance with largest distance
  (1.4) Increment for
(2) End  
(3) Select instance with highest vote
(4) 
(5) For   to
  (5.1) Select -nearest neighbors for EI and save in EI
(6) End  
(7) Return EI
3.3.2. Results and Discussion

EISA was evaluated on a dataset consisting of 4000 emails. During evaluation, different values of were used, and as shown in Table 5, EISA produced a classification accuracy of 100% and FP and FN rate of 0.00%, when . This implies that 300 edge instances are sufficient to build an excellent SVM classifier. Furthermore, results obtained from EISA, FFA_IS, FFA_Clus, KNN-based technique, and clustering-based technique were compared to each other. Also, to evaluate the impact of instance selection on SVM, an additional experiment was performed. In the experiment, all training instances were used to train SVM; no instance selection technique was applied prior to training. Result obtained from the experiment was also compared to the techniques implemented in this study. As shown in Table 6, EISA produced the best classification accuracy, FP rate, and FN rate, followed by clustering-based technique and FFA_Clus. Furthermore, in terms of classification speed, all the results obtained show enormous improvement in SVM classification speed. EISA improved SVM speed by 92.5%, and FFA_Clus improved SVM speed by 98.8%. Furthermore, clustering-based technique, KNN-based technique, and FFA improved SVM classification speed by 43%, 98.1%, and 98.2%, respectively. Overall, EISA and FFA_Clus produced the best speed-accuracy trade-off compared to the other techniques.

Another set of experiments was performed on Spambase dataset, consisting of 4600 emails and 57 features. Spambase was obtained from UCI ML repository [29]. The experiment was performed with the aim of comparing the performance of the proposed techniques to other instance selection techniques in literature. In the experiment, EISA, FFA_IS, and FFA_Clus were compared to five other instance selection techniques, namely, PSC [30], DROP 3 [10], DROP 5 [10], GCNN [31], and POCNN [32]. Results for the experiment are shown in Table 7 and Figures 2 and 3. As shown, FFA_Clus and FFA yielded the best performance, in terms of classification accuracy and classification speed. Moreover, in terms of classification accuracy, EISA performed better than PSC, GCNN, DROP 3, and POCNN. Although, in terms of classification accuracy, DROP 5 performed better than EISA, EISA has better speed-accuracy trade-off. EISA is also faster than POCNN.

Statistical analysis of the results was performed using one-sample -test. The goal of the statistical analysis is to know whether it can be concluded, with 95% confidence level, that the proposed technique performs better (in terms of classification speed and accuracy) than PSC, DROP 3, DROP 5, GCNN, and POCNN. As aforementioned, 10-fold cross validation was performed, hence the reason for using -test. Also, since the number of samples is 10, from -test table, critical value is 2.2622. Result of the analysis is reported in Tables 8 and 9. As shown in Table 8, in terms of classification accuracy, there is a statistically significant difference between EISA and PSC. There is also a statically significant difference between FFA and PSC, DROP 3, DROP 5, GCNN, and POCNN. Moreover, there is a statistically significant difference between FFA_Clus and PSC, DROP 3, DROP 5, GCNN, and POCNN. Furthermore, as shown in Table 9, in terms of classification speed, there is a statistically significant difference between EISA and DROP 3 and DROP 5. There is also a statistically significant difference between FFA and PSC, DROP 3, DROP 5, GCNN, and POCNN. Moreover, there is a statistically significant difference between FFA_Clus and PSC, DROP 3, DROP 5, GCNN, and POCNN.

4. Conclusion and Future work

Instance selection techniques have been successfully used to reduce SVM speed complexity. Two types of instance selection techniques include filter and wrapper. Filter-based instance selection techniques are generally faster than wrapper based techniques. In this study, two filter-based instance selection techniques are introduced. Performance of the two techniques was evaluated, and result was compared to other existing techniques. Results reveal excellent improvement in SVM classification speed without significant reduction in classification accuracy. Moreover, the two techniques produced balanced speed-accuracy trade-offs. In the future, the two proposed techniques will be tested on other ML algorithms, and more NI-based instance selection techniques will be developed and tested.

Competing Interests

The authors declare that they have no competing interests.