Abstract
The aim of machine learning is to develop algorithms that can learn from data and solve specific problems in some context as human do. This paper presents some machine learning models applied to the intrusion detection system in WiFi network. Firstly, we present an incremental semisupervised clustering based on a graph. Incremental clustering or onepass clustering is very useful when we work with data stream or dynamic data. In fact, for traditional clustering such as Kmeans, Fuzzy CMeans, DBSCAN, etc., many versions of incremental clustering have been developed. However, to the best of our knowledge, there is no incremental semisupervised clustering in the literature. Secondly, by combining a Kmeans algorithm and a measure of local density score, we propose a fast outlier detection algorithm, named FLDS. The complexity of FLDS is while the results obtained are comparable with the algorithm LOF. Thirdly, we introduce a multistage systembased machine learning techniques for mining the intrusion detection data applied for the 802.11 WiFi network. Finally, experiments conducted on some data sets extracted from the 802.11 networks and UCI data sets show the effectiveness of our new proposed methods.
1. Introduction
Machine learning is a central problem in artificial intelligence. The purpose of machine learning is concerned with the development of algorithms and techniques that allow computers to learn. There are some principal kinds of machine learning such as supervised learning, unsupervised learning, and semisupervised learning. The application of machine learning techniques is very varied, for example, fault detection in bank data, transaction data, and intrusion detection system in networking, bioinformatics, natural language processing, image analysis, etc. [1]. Additionally, machine learning is very useful in cases in which human expertise does exist (robot in the Mars, in the sea, etc.), solution change in time (networking, surveillance), or solution needs to be adapted to particular cases. This paper focuses on developing machine learning techniques for intrusion detection systems in WiFi network.
Intrusion detection system (IDS) is one of the most emerging tasks in the network connectivity. Each year, there are lots of network attacks in the world; consequently, the cost for solving these problems is very big, and was reported to be about 500 billion USD in 2017. This problem is a challenge not only for government/organizations but also for individuals in daily lives. To protect the computer network system, in general, some methods can be used such as firewalls, data encryption, or user authentication. The firmware is one technique to protect the system, but nowadays, the external mechanisms have emerged and quickly become popular. One important method for data mining in intrusion detection problem proposed in the literature is to use machine learning techniques [2–8]. The IDS has monitored directly the network transactions where each transaction is either normal or malicious. The aim of IDS is to detect and alert network administrators when it detects a transaction that is an attack. In some case, the IDS can even immediately block the connection.
Generally, data mining task in IDS must detect two kinds of attack including known attacks and outlier (anomaly) attacks. For the known attacks, we can use a (semi)supervised learning method such as neural network, support vector machine, random forest, decision tree, and naïve Bayes, to mention a few, to construct a classifier from data training (labeled normal/attacks connection) [4–7, 9]. The classifier trained is used for detecting new connections, and the supervised learning model is illustrated in Figure 1. With the outlier attacks in which we do not know its labels, the trained classifier cannot detect them. In this case, we have to use another kind of machine learning called unsupervised outliers detection such as LOF [10], ODIN [11], and so on. The outliers detection process can be realized offline for some periods of time defined by users/experts. The general schema for outlier detection is presented in Figure 2, and this is the unsupervised learning model. The aim of this schema is to detect outliers in a period of time. For example of IDS systems, the users can set a period of time from u to for capturing the data, then the data will be transformed by the preprocessing step, and finally, we can use an outlier detection method to detect attacks from the observed data.
The contributions of our paper are as follows:(i)We propose an incremental semisupervised graphbased clustering. To the best of our knowledge, this is the first incremental semisupervised clustering algorithm. The preliminary work is presented in [12].(ii)We introduce a fast outliers detection method based on local density score and Kmeans clustering algorithm. The preliminary work is introduced in [13].(iii)We propose a multistage systembased machine learning techniques which can boost the accuracy of the intrusion detection process for the 802.11 WiFi data set.(iv)The experiments carefully conducted on data set extracted from Aegean WiFi Intrusion Dataset (AWID) show the effectiveness of our proposed algorithms [14]. The AWID is a publicly available collection of sets of data which contain real traces of both the normal and intrusive 802.11 traffic. Up to date, AWID is one of the standard data sets to evaluate the capacity of IDS systems.
This paper is organized as follows. Section 2 presents the related work. Section 3 introduces the new incremental semisupervised clustering method and a new fast outlier detection algorithm. Section 4 presents experiments for the proposed algorithms and proposes a hybrid framework applied for the AWID data set. Finally, Section 5 concludes the paper and presents some direction for further research studies.
2. Incremental Clustering and Outlier Detection
2.1. Incremental Clustering
Clustering is the task of partitioning a data set into k clusters in which the points in the same cluster are similar and the points in different clusters are dissimilar. The context of incremental clustering is as follows: given some current clusters, the incremental clustering is onepass clustering kind which aims to identify cluster label for incremental data points. Incremental clustering is very useful for data stream or dynamic data (data warehouse). In general, the incremental clustering is combined with two processes of insertion and deletion. Given a set of clusters, the insertion step aims to identify the labels of a new data point based on the current clusters. In some cases, some new clusters will be created or the new data points will be integrated with the current clusters. With the deletion process, if we want to remove one or some data points, we need to reform the clusters because some clusters may be affected by these operations. For each kind of clustering, there are some incremental clustering algorithms proposed in the literature such as Incremental Kmeans [15], IncrementalDBSCAN [16], or Incremental graph clustering [17]. The key idea of these algorithms is that we need to identify the situation for each kind of algorithm for the insertion step and deletion step. The incremental clustering addresses the problem of identifying the label for a new data object or updating clusters when we remove points in the current clusters. This problem is very meaningful when we tackle with the big data in which the data set is too big to fit into the available memory. For each kind of clustering, there are some versions of incremental clustering proposed in the literature.
In [16], the Incremental densitybased clustering (IncrementalDBSCAN) is introduced. Based on the notion of densitybased clustering, the IncrementalDBSCAN can efficiently add and delete points for the current clusters. The adding process of a new point has some cases; for example, the new point can be noise, the new point will be added in a cluster, and the new point can merge some clusters. For the deletion process, the point can be a noise point and the point can split to some clusters or not affect the current clusters. Some cases of the insertion process and deletion process of IncrementalDBSCAN are shown in Figure 3.
In [15], a singlepass incremental clustering for large data set based on Kmeans is introduced (named GenIC). GenIC updates each center with each new data point and merges clusters only at the end of a generation (i.e., window of data). By a generalized incremental algorithm, GenIC algorithm can move a center in the list of centers using a weighted sum of the existing center and the new point presented. The idea of GenIC is to divide the stream of data into chunks or windows as is common with streaming algorithms. We view each chunk of n data points as a generation and think of the “fitness” of a center as being measured by the number of points assigned to it. In general, the fittest centers survive to the next generation, but occasionally new centers are selected and old centers are killed off. The GenIC is compared with Kmeans and shown the effectiveness in running time and less affected by the choice of initial centers than Kmeans. In [18], a version of Incremental Kmeans clustering is also proposed. In the algorithm, clusters are built incrementally by adding one cluster center at a time. In [19], a novel twophase static singlepass algorithm as well as a dynamic twophase singlepass algorithm based on Fuzzy Cmeans have been presented and are showing high utility. The idea behind the multistage methods reported in the paper is that an estimate of the partition matrix and the location of the cluster centers can be obtained by clustering a sample of the data. A small sample is expected to produce a fast yet less reliable estimation of the cluster centers. This leads to a multistage approach, which involves several stages of sampling (with replacement) of the data and estimating the membership matrix for the next stage. The experiments conducted show the effectiveness of the proposed method. In [17], Chandrasekhar et al. propose an incremental local density clustering scheme for finding dense subgraphs in streaming data, i.e., when data arrive incrementally (ILDC). The incremental clustering scheme captures redundancy in the streaming data source, by finding dense subgraphs, which correspond to salient objects and scenes. The ILDC process performs greedy operations like cluster expansion, cluster addition and cluster merging based on the similarity between clusters defined. The ILDC shows the effectiveness when using in imageretrieval applications. In [20], an incremental semisupervised ensemble clustering algorithm has successfully presented, named ISSCE. ISSCE uses constraints to update incremental members. The authors develop an incremental ensemble member selection process based on a global objective function and a local objective function to remove the redundant ensemble members. The experiment results show the improvement of ISSCE over traditional semisupervised clustering ensemble approaches or conventional cluster ensemble methods on six realworld datasets from UCI machine learning repository and 12 realworld data sets of cancer gene expression profiles. In the context of classification, we need to find the label for a new data object by using a classifier trained by data training. The problem of identifying the label for a new object in incremental clustering can be seen similar to classification context.
2.2. Outlier Detection Problem
Outlier (anomaly) detection is one of the important problems of machine learning and data mining. As mentioned in [21], outliers detection is the problem of finding patterns in data that do not conform to expected behavior. The applications of outlier detection can be found in many applications such as intrusion detection, credit fraud detection, video surveillance, weather prediction, discovery of criminal activities of electronic commerce, etc. [9, 21]. There are some kinds of outliers including point outliers, contextual outliers, and collective outliers. In this paper, we focus on point outliers detection that can be applied in a variety of applications. For a data set consisting of points, a point will be called outlier if it is different from a large number of the rest of the points. To detect outliers, there are some principal methods in the literature such as classification methods, nearest neighbor methods, clustering methods, statistical methods, distancebased methods, etc.
For the classificationbased outliers detection, we have two categories: multiclass and oneclass anomalies detection methods. In multiclass classification techniques, we assume that the training data contain labeled points of all normal classes. The learner using a supervised learning model trains a model using the labeled data. The classifier can distinguish between each normal class and the rest of the class. A test point will be called outlier if it does not belong to any normal class. In oneclass outliers detection methods, we assume that the number of normal class is only one. The classifier learns a model that can detect the boundary of the normal class. If a test point does not fall in the boundary, it will be called outliers. Although many techniques have been done, however, the main disadvantage of these methods based on the availability of accurate labels for normal classes which is not easy to apply for real applications.
For the nearest neighborbased outlier detection methods, we use the assumption as follows: normal points belong to the dense regions, while outliers belong to the sparse regions. The most famous method of this kind is the LOF algorithm. The idea of LOF is based on the local density evaluation score for points. Each point will be assigned a score which is the ratio of the average local density of the knearest neighbors of the point and the local density of the data point itself. Many variants of LOF can be cited here such as COF [22], ODIN [11], LOCI [23], etc. The main drawback of the method is the complexity required.
For the clusteringbased outliers detection techniques, the idea here is using clustering methods to group data into clusters. The points do not belong to any clusters called outliers. Some clustering methods can be detected outliers such as DBSCAN [24], SNN [25], etc. In fact, the purpose of clustering is finding clusters, so the outliers are just the product of the clustering process and hence are not carefully optimized. One more reason that can be made here is the complexity of clustering techniques required .
In the statistical outliers detection methods, these methods are based on the assumption as follows: normal data points occur in highprobability regions of a stochastic model, while anomalies occur in the lowprobability regions of the stochastic model. Some methods have been done for the kind of outliers detections. In general, statistical methods fit a statistical model (Gaussian distribution, the mixture of parametric statistical distribution, etc.) to the given data and then apply a statistical inference test to determine if an unseen instance belongs to this model or not. The key limitation of these methods is the assumption about the distribution of data points. This assumption is not true, especially when the dimension of data is high [21].
In the distancebased outliers detection methods, a point is considered as outlier if it does not have enough points in the data set that distance from this point is smaller than the threshold value [26].
3. Proposed Method
3.1. Semisupervised GraphBased Clustering
In recent years, semisupervised clustering is an important research topic that is illustrated by a number of studies introduced [27]. The purpose of semisupervised clustering is to integrate side information for improving the clustering performances. Generally, there are two kinds of side information including constraints and seeds. Given a data set X, constraints involve mustlink and cannotlink in which the mustlink constraint between two observations and means that x and y should be in the same cluster, and the cannotlink constraint means that x and y should not be in the same cluster. With seeds, a small set of labeled data (called seeds) will be provided for semisupervised clustering algorithms. In fact, this side information is available or can be collected from users [28–31]. We can cite here the work of semisupervised clustering for Kmeans [32], hierarchical clustering [33], graphbased clustering [34, 35], spectral clustering [36, 37], densitybased clustering [38], etc. While many semisupervised clustering algorithms are introduced, to the best of our knowledge, there are no incremental semisupervised clustering algorithms in the literature.
Our new incremental clustering introduced in the next section is based on the work of semisupervised graphbased clustering using seeds (SSGC). We choose the SSGC algorithm because SSGC algorithm has several advantages such as SSGC use only one parameter and SSGC can detect clusters in varied density regions of data [35]. SSGC includes two steps as the following description (see Algorithm 1):

Step 1. Given a knearest neighbor graph presenting a data set X, this step uses a loop in which at each step, all edges which have the weight less than threshold θ will be removed. The value of θ is initialized by 0 at the first step and incremented by 1 after each step. This loop will stop when each connected component has at most one kind of seeds. The main clusters are identified by propagating label in each connected component that contains seeds.
Step 2. The remaining points (graph nodes) that do not belong to any main clusters will be divided into two kinds: points that have edges which relate to one or more clusters and other points which are isolated points. In the first case, points will be assigned to the cluster with the largest related weight. For the isolated points, we can either remove them as outliers or label them.
We note that, in SSGC, the weight of the edge (the similarity) between two points and in the knearest neighbor graph is equal to the number of points that the two points share, as the following equation:where is the set of knearest neighbors of the specified point.
SSGC is efficient when compared with the semisupervised densitybased clustering in detecting clusters for batch data; however, it is not adapted for data stream or data warehousing environment where many updates (insertion/deletion) occur.
3.2. Incremental GraphBased Clustering Using Seeds
In this section, we propose IncrementalSSGC, based on the SSGC algorithm. In the IncrementalSSGC, the seeds will be used to train a knearest neighbor graph to construct connected components and identify the value of θ as in SSGC algorithm. Like other incremental clustering algorithms, two procedures must be developed, including insertion and deletion.
Algorithm 2 shows the insertion step of IncrementalSSGC for a new data point . At first, the list of edges between and the current clusters is created, and all edges with weight smaller than θ will be removed. If the list is empty, it is illustrated that is an outlier with the current situation, and hence, will be added in a temporary list . In the case of existing edges between and some connected components, we need to remove some edges until connects to components with one kind of label. Finally, the label of will be identified by the label of its connected components. In Step 10, and its related edges will be added to L; some edges between and will also be recalculated if appears in the nearest neighbors list of or . In Step 12, after some insertion steps, we can examine the points in .

Algorithm 3 presents the detailed steps of the deletion process. When we want to remove a point from the current clusters, we simply remove and all edges related with in the graph. Step 2 of the algorithm shows the updating process. In this step, we need to update all edges affected by . It means that all edges between and must be updated if appears in the commune list of the nearest neighbors. Finally, Step 3 is simply to remove all edges that have weight less than θ.

3.2.1. The Complexity Analysis
Now, we will analyse the complexity of IncrementalSSGC. Given a data set with n object, we recall that the complexity of SSGC is , in which k is the number of nearest neighbors. Assuming that we have the current clusters including n objects, we will analyse the complexity of the insertion and deletion process of IncrementalSSGC at step as follows.
For the insertion process which aims to identify the cluster label for a new data point , in Step 1, to create the list of edges between and the current clusters, the complexity is . In Steps 2, 6, and 7, the complexity is just . In Step 10, some edges between and will also be recalculated if appears in the nearest neighbors list of or ; in fact, the number of such edges is also small. So for the insertion of a new point, the complexity is .
For the deletion process, the complexity of Step 1 is . In Steps 2 and 3, the number of edges updated is the number of edges that received as commune points, and the value of commune points depends on the data set. Let q be the average value of deletion processes; in fact, q is determined by perfoming experiments. So, the complexity of a deletion process is .
In summary, with the analysis of the insertion and deletion process above, we can see that it is very useful for data set that we usually need to update. In the next section, we also present the running time of both SSGC and IncrementalSSGC for some data sets extracted from intrusion detection problem.
3.3. A Fast Outlier Detection Method
Given a knearest neighbors graph (kNNG), the local density score LDS of a vertex kNNG is defined as follows [39]:in which ω is calculated as in equation (1), and is the number of nearest neighbors used. The LDS is used as an indicator of the density of the region of a vertex u. The LDS value is in the interval of [0, k – 1]; the larger the LDS of u, the denser the region that u belongs to, and vice versa. So, we can apply the way of LDS's calculation to identify outliers. To detect outlier by this method, we have to use a parameter as the threshold: the point which has LDS value smaller than the threshold can be seen as an outlier and vice versa. Similar to LOF, the method has required of complexity.
To reduce the running time of the method, we propose a Fast outlier detection method based on Local Density Score, called FLDS. The basic idea of the algorithm FLDS is to use divideandconquer strategy. Given a data set X to find outliers, first, the input data set will be split into k clusters using Kmeans algorithm. Next, knearest neighbor graphs will be used for each cluster and identify outlier on each local cluster. The outliers found in all clusters will be recalculated on the whole data set. The idea of divideandconquer strategies by using the Kmeans in the preprocessing step has been successfully applied in solving some problems such as fast spectral clustering problem [40] and fast minimum spanning tree problem [41] and in the efficient and effective shapebased clustering paper [42]. The FLDS algorithm is described in Algorithm 4.

The FLDS algorithm is an outlier’s detection method based on Kmeans and local density score using graph. The complexity of FLDS is ; in which the value of k may be used up to [41, 42]; is evaluated approximately equal to k; so the complexity of the FLDS is .
4. Experiment Results
This section aims to evaluate the effectiveness of our proposed algorithms. We will show the results of the IncrementalSSGC, the results of FLDS, and the results when using our methods for a hybrid framework for intrusion detection problem. The IncrementalSSGC will be compared with the IncrementalDBSCAN, while the FLDS will be compared with the LOF.
The data sets used in the experiments are mostly extracted from the Aegean WiFi Intrusion Dataset (AWID) [14]. AWID is a publicly available collection of sets of data in an easily distributed format, which contain real traces of both the normal and intrusive 802.11 traffic. In the AWID, many kinds of attacks have been introduced, but they also fall into four main categories including flooding, injection, and impersonation. The AWID has 156 attributes, we use 35 attributes extracted by an artificial neural network, as presented in [8]. We also use some supplement data sets that come from UCI [43] and data sets with different size, shape, and density and contain noise points as well as special artifacts [44] in this experiment.
4.1. Experiment Setup
4.1.1. Data Sets for Incremental Clustering Algorithms
To show the effectiveness of the IncrementalSSGC, two aspects will be examined including the running time and accuracy. 5 UCI data sets and 3 data sets extracted from AWID will be used for testing IncrementalSSGC and IncrementalDBSCAN. The details of these data sets are presented in Table 1.
To evaluate clustering results, the Rand Index is used. Given a data set X with n points for clustering, is an array containing the true labels, is an array containing the results of a clustering algorithm, the Rand Index (RI) is calculated as follows:in which is the number of pairs that are in the same/different clusters in both partitions and . The bigger the Rand Index, the better the result.
4.1.2. Data Sets for FLDS and LOF
We used 5 data sets extracted from AWDI and four 2D data sets including DS1 (10000 points), DS2 (8000 points), DS3 (8000 points), and DS4 (8000 points) [44] for FLDS and LOF. These 2D data sets have clusters of different size, shape, and orientation, as well as random noise points and special artifacts. The details of these AWID data sets are presented in Table 2.
To compare LOF and FLDS for AWID data sets, we use the ROC measure that has two factors including False Positive (False Alarm) Rate (FPR) and False Negative (Miss Detection) Rate (FNR). The detail of these factors is shown in the following equations:in which True Positive () is the number of attacks correctly classified as attack; True Negative () is the number of normal correctly detected as normal; False Positive () is the number of normal falsely classified as attacks, namely false alarm; and False Negative () is the number of attacks falsely detected as normal.
To combine FPR and FNR values, we calculate the Half Total Error Rate (HTER) that is similar to the evaluation method used in [11], defined as follows:
4.2. Clustering Results
We note that there is no incremental semisupervised clustering algorithm in the literature. So we compare the performance obtained by our algorithm and the IncrementalDBSCAN algorithm. IncrementalDBSCAN can be seen as the state of the art among Incremental clustering proposed. The algorithm can detect clusters with different size and shape with noises. Because both SSGC and IncrementalSSGC produce the same results, we just show the results for IncrementalSSGC and IncrementalDBSCAN. The results are shown in Figure 4.
We can see from the figure that the IncrementalSSGC obtains better results compared with the IncrementalDBSCAN. It can be explained by the fact that the IncrementalDBSCAN cannot detect clusters with different densities as mentioned in the paper …we assumed that the parameter values and of DBSCAN do not change significantly when inserting and deleting objects…. This assumption means that the IncrementalDBSCAN cannot work well with the data set having different densities. In contrary to IncrementalDBSCAN, the algorithm IncrementalSSGC does not depend on the density of the data because the similarity measure used is based on shared nearest neighbors.
4.2.1. Running Time Comparison
Figure 5 presents the running time for IncrementalSSGC and IncrementalDBSCAN for three AWID data sets. We can see the running time of both algorithms is similar. It can be explained by the fact that both algorithms use knearest neighbor to find clusters for each step of incremental. We also present the running time of the SSGC algorithm for reference purpose. From this experiment, we can see advantages of the incremental clustering algorithms.
4.3. The Results of FLDS and LOF
Table 3 presents the results obtained by FLDS and LOF for 5 AWID data sets. We can see that the results of FLDS are comparable with the algorithm LOF. The parameters used for both methods are shown in Table 4. For some 2D data sets, Figure 6 presents the results obtained by FLDS and LOF. Intuitively, the outliers detected by both methods are mostly similar. We can explain the results by the fact that the strategy for evaluating a point is outlier or not based on local density score.
(a)
(b)
Figures 7 and 8 illustrate the running time comparison between FLDS and LOF. With 4 data sets mentioned above, it can be seen from the figure that the calculation time of FLDS is about 12 times faster than the running time of LOF. This is the significant improvement compared with LOF. It can be explained by the fact that the complexity of FLDS is just compared with of LOF.
4.4. A Framework for Intrusion Detection in 802.11 Networks
In this section, we propose a multistage systembased machine learning techniques applied for the AWDI data set. The detail of our system is presented in Figure 9. Three components are used for intrusion detection task: a supervised learning model (J48, Bayes, random forest, support vector machine, neural network, etc.) trained by labeled data set, and this model can be seen as misuse detection component; an outlier detection method (LOF, FLDS, etc.) is optionally used to detect new attacks in some periods of time; additionally, for the AWID data sets as presented above, it is very difficult to detect impersonation attacks, so we use an Incremental clustering algorithm (IncrementalDBSCAN, IncrementalSSGC, etc.) for further finding this kind of attack.
In this experiment, we use J48 for the misuse detection process and IncrementalSSGC for the detecting impersonation attacks. In the outliers detection step, we propose to use FLDS or LOF, and the results have been presented in the subsection above. Because the outliers detection step can be realized offline for some periods of time, we just show the results obtained by combining J48 and IncrementalSSGC. The confusion matrix of these results is illustrated in Table 5. The total accuracy obtained compared with in the paper [14]. We can explain the results obtained by IncrementalSSGC by the fact that the algorithm used the distance based on shared nearest neighbors, which overcome the limit of transitional distance measures such as Euclidean or Minskovki distance, and the shared nearest neighbors measure does not depend on the density of data. This proposed system is generally called hybrid method which is one of the best strategies in developing Intrusion Detection Systems [7, 9] in which there is no single classifier that can exactly detect all kinds of classes.
We also note that for real applications, whenever an attack appears, the system needs to immediately produce a warning. The multistage systembased machine learning techniques provide a solution for users for constructing the real IDS/IPS system that is one of the most important problems in the network security.
5. Conclusion
This paper introduces an incremental semisupervised graphbased clustering and a fast outlier detection method. Both methods can be used in a hybrid framework for the intrusion detection problem of WiFi data sets (AWID). Our proposed multistage systembased machine learning techniques provide a solution to guideline for constructing the real IDS/IPS system that is one of the most important problems in the network security. Experiments conducted on the extracted data sets from the AWID and UCI show the effectiveness of our proposed methods. In the near future, we will continue to develop other kinds of machine learning methods for intrusion detection problem and test for other experimental setup.
Data Availability
The data used to support the findings of this study can be downloaded from the AWID repository (http://icsdweb.aegean.gr/awid/download.html).
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.