Abstract

This paper focuses on an important research problem of cyberspace security. As an active defense technology, intrusion detection plays an important role in the field of network security. Traditional intrusion detection technologies have problems such as low accuracy, low detection efficiency, and time consuming. The shallow structure of machine learning has been unable to respond in time. To solve these problems, the deep learning-based method has been studied to improve intrusion detection. The advantage of deep learning is that it has a strong learning ability for features and can handle very complex data. Therefore, we propose a deep random forest-based network intrusion detection model. The first stage uses a slide window to segment original features into many small pieces and then trains a random forest to generate the concatenated class vector as rerepresentation. The vector will be used to train the multilevel cascade parallel random forest in the second stage. Finally, the classification of the original data is determined by voting strategy after the last layer of cascade. Meanwhile, the model is deployed in Spark environment and optimizes cache replacement strategy of RDDs by efficiency sorting and partition integrity check. The experiment results indicate that the proposed method can effectively detect anomaly network behaviors, with high F1-measure scores and high accuracy. The results also show that it can cut down the average execution time on different scaled clusters.

1. Introduction

The rapid development of cloud computing, edge computing, and 5G technologies have widely infiltrated our politics, economy, culture, and other aspects of life. The massive data generated from these everyday scenarios will boost more valuable output from big data; meanwhile, these extensive applications could make the prospect of big data complicated and unsafe. Considering the complexity, high dimension, heterogeneity, and processing speed of large data volumes, potential risks exist not only in the system architecture but also in the data itself. Most traditional protection solutions can no longer satisfy the requirements in big data environment because the distributed data source makes it difficult to define the boundaries of the dataset, which will threaten the authenticity of the data being analyzed.

Outliers are also known as anomalies or deviants in data mining and statistical analysis. In cyberspace security, outlier detection is a process to analyze suspects whose key-value or behavior pattern is significantly different from the normal objects. Detection algorithm recognizes the abnormalities and then cleans the confirmed data to ensure data security. Outlier detection is now a hot topic in academia and industry. For example, anomaly spots that appear in magnetic resonance imaging or other types of medical diagnosis devices typically indicate disease conditions, and the outlier records in product payment from unusual locations or frequent large transactions would help detect credit card fraud in financial situations. Other examples are rumor detection in social media and congestion detection in urban traffic management [13]. Among these challenges, network intrusion detection is critical for cyberspace security [4]. Network intrusion detection is a technology designed and configured to ensure the security of computer systems that can detect violations of security policies in computer networks. The combination of software and hardware for intrusion detection is an intrusion detection system (IDS). An intrusion detection system (IDS) is aimed to analyze the network traffic or the activity of a single machine in order to discover nonauthorized activities. Such activities can be originated by a malware or can be related to a human attack operated locally or remotely [5]. There are already many machine learning algorithms that have been widely used to identify outliers in networks [68]. But most conventional methods are mainly limited with unacceptable accuracy in detection when network data are often complex and high-dimensional, such as back propagation (BP), support vector machine (SVM), and random forest (RF) [9]; the accuracy of the UNSW-NB15 dataset does not exceed 91% [10]. That reveals that the shallow structure of machine learning has been vulnerable to respond. Despite the fact that intrusion detection is a key issue, research studies on the methods which combine deep learning and machine learning are still insufficient [11]. Though deep neural networks are powerful, they are very complicated with too many hyperparameters, and the learning performance depends seriously on the careful tuning of these hyperparameters, so the training costs are huge. The above scenarios usually make the training of deep neural networks very hard; sometimes, it is like an art rather than engineering. This inspires us to explore other deep learning structures for network anomaly detection.

Ensemble learning is an important approach of machine learning, and random forest is one of the classic algorithms in ensemble learning. It fits to high-dimensional data and has only a few parameters, and the training of the RF is not complicated. So, we consider using the forest as a layer to replace the neurons in our deep network structure. Besides, the training of each decision tree in the random forest is independent, which is natural for parallel deployment. Each layer of this deep forest structure can be deployed in parallel to speed up the training process.

In order to reduce the obstacles caused by the numerous parameters of the present deep learning-based method in intrusion detection and to further improve the classification accuracy and scalability, this paper proposes a detection model based on feature segmentation and deep structure of parallelized random forest (FS-DPRF). The main contributions of this paper are as follows.(1)A deep cascade structure of random forest is proposed, and each layer is parallelized to improve the accuracy and scalability and to fit for massive data in detection task. Various types of attacks can be classified.(2)A slide window is introduced to segment the high-dimensional features into small size feature vectors for training, which can reduce the calculation volume of each computing and keep the integrity of original information.(3)Compared with the classic parallel random forest in Spark, the approach optimizes the replacement for RDD loaded in memory with efficiency sorting and partition integrity check, which can improve cluster task execution efficiency.

The performance of the proposed model is verified and compared with other algorithms in four network intrusion datasets, and the experiment results fully prove the effectiveness of the proposed model on network anomaly detection.

The remainder of the paper is organized as follows. Section 2 reviews the related work. In Section 3, we illustrate the detection model designed for intrusion detection. Section 4 introduces the memory optimization designed for model parallelization. Evaluate the model with a series of experiments in Section 5. Finally, the conclusion of this paper is presented in Section 6.

At present, many scholars have studied the intrusion detection issue. A recent survey by Buczak and Guven [12] made a comprehensive review of the current data mining and machine learning methodologies of intrusion detection; the survey described the strengths and weaknesses of the algorithm and provided a clear outlook for future work. The classic algorithms can be categorized into artificial neural networks [13, 14], clustering-based, Bayesian network, ensemble learning [15], SVM-based [16], and hidden Markov models (HMMs). Khalvati et al. [17] proposed the SVM hybrid learning (distance sum-based SVM, DSSVM) method. In DSSVM, the distance sum is calculated based on the correlation between each data sample and the clustering feature dimension obtained from the dataset, and then, the SVM is used for classification and has a high detection rate. Vinayakumar et al. [18] used the convolutional neural network (CNN) for network intrusion detection, the research models network traffic as a time series, and then used supervised learning methods to model TCP/IP protocol packets within a predefined time range. The effectiveness of this network structure in intrusion detection has been proved on the KDD99 dataset. Potluri and Diedrich [19] proposed an accelerated deep neural network (A-DNN) structure, and it is used to identify anomalies in network data and process them with an accelerator platform. Experimental results show that this method is feasible and effective in NSL-KDD. Gao et al. [20] introduced a deep belief network into the field of anomaly detection. A multilayered Boltzmann machine is used to form a neural network classifier. When using deep belief networks in comparison with SVM on the KDD99 dataset, the former shows a better performance. Dominguez et al. [21] evaluated unsupervised algorithms from various research fields by doing lots of comparative experiments and unsupervised feature learning works in most cases, but still lack interpretability and require manual analysis. Hundman et al. [22] proposed a model based on LSTM and a novel dynamic threshold approach. The model does not rely on scarce labels or false parametric assumptions to deal with time series data and achieves high accuracy with good interpretability. Manzoor et al. [23] introduced a density-based ensemble method for feature-evolving streams problem, which measures outliers at multiple scales or granularities and especially works well in high noise environments.

In recent years, the stacking method [24] and boosting method have become popular in ensemble learning. Liu et al. [25] proposed the isolation forest algorithm to establish an anomaly index based on the path length from leaf node to root node. The detection effect of global outliers is good, but it is weak at dealing with local sparse points. The gradient boosting decision tree (GBDT) proposed by Friedman [26] generates a prediction model in the form of a set of basic learners and combines the basic learners into a strong learner through iteration. Each time the model is established, the gradient descent direction of the loss function of the model would be established first. In successive iterations, the residual is continuously reduced to produce a vertically deepened tree. It has the advantages of high prediction accuracy and strong robustness to outliers. Chen and Guestrin [27] put forward a scalable tree boosting system (XGBoost); its main idea is also boosting according to the negative gradient direction of the loss function. The biggest difference is that the empirical error was expanded by second-order Taylor expansion, and some regular items were added, which make loss function scalable, and have a high precision and a good fitting effect. But there were too many hyperparameters which make classification quite dependent on the tuning result. Su et al. [28] proposed an intrusion detection method using the XGBoost algorithm on an unbalanced dataset; it uses the improved SMOTE algorithm to oversample the minority samples and downsample majority samples. The method is based on the premise of changing the original feature distribution of the data, which not only increases the calculation burden of the model but also easily loses some important information in the sample and affects the final detection performance. Farnaaz and Jabbar [8] proposed to use the random forest algorithm to detect various types of attacks and verify the model on NSL-KDD data. The results prove that the detection accuracy of DOS, PROBE, U2R, and R2L is improved, but the capability of feature processing is weak. In the latest research, Roberto et al. [5] proposed a probabilistic-driven ensemble model (PDE) that uses logistic regression algorithms to evaluate the effect of ensemble learning classifiers. The model excludes predictors with lower probabilities from the classification process and combines the most effective algorithms by weighted probability criteria. Experiments on the NSL-KDD dataset show that the PDE has a high performance in detecting intrusions. Zhou et al. [29] proposed a novel ensemble system based on the modified AdaBoost with the area under the curve (M-AdaBoost-A) algorithm. Strategies such as SMV and PSO are applied to combine multiple M-AdaBoost-A based classifiers. It shows better performances for two intrusion detection issues: 802.11 wireless networks and traditional enterprise networks, but it lacks evaluation of model time consumption. Khan et al. [30] proposed a deep learning model (TSDL) based on stacked autoencoder with a soft-max classifier. Their deep learning model works in a cascade manner; the model uses a probability score value as an additional feature in the final decision stage in order to detect the normal state and other classes of attacks. TSDL has achieved impressive results in the accuracy of multiclass detection on UNSW-NB15 and KDD99.

The deep learning model usually has a good performance, but it has too many complicated hyperparameters to be adjusted. In most cases, it seems difficult to have a good performance with less complexity. To solve the problem, the model we propose introduces a sliding window and a deep structure into random forests to enhance the diversity of decision trees, thereby improving the generalization ability of ensemble learning and the accuracy in network intrusion detection, also with much fewer parameters. At the same time, our method optimizes the data cache replacement of RDDs on the Spark cluster and cuts down the execution time of detection tasks.

2.1. Algorithm Selection Criteria

The selection of the algorithm needs to refer to the IDS architecture, which can be divided into a centralized structure and a distributed structure.

Most IDS algorithms use a single-machine centralized structure, that is, data collection and analysis are performed on a host. This method performs detection based on host audit data. The centralized structure has the advantages of simple structure and easy implementation. The disadvantage is that the processing time is slow. So, it is suitable for small network systems.

Distributed structure comprises hierarchical structure and collaborative structure. The hierarchical structure is a tree-type hierarchical system, like the proposed model in this paper, and it combines the simplification of a centralized structure and the robustness of a distributed structure. The distributed structure also makes the detection time faster, which is suitable for larger-scale network systems.

3. Model Description

The proposed detection model is described as feature segmentation, deep parallel random forest, and voting strategy. Feature segment is the first stage of the model, which segments original features to reduce the calculation volume of high-dimensional data in a single compute and generates a concatenated class vector as a new representation. In the second stage, the concatenated class vector will be used to train deep parallel random forest which predicts a probability distribution of original data type. Finally, the voting strategy after the last layer of cascade will confirm the outlier. Figure 1 shows the overview of the FS-DPRF model.

3.1. Feature Segment

The first stage in the model simplifies the original data features which are shown in Figure 2, by using a slide window to segment features into many same sized feature vectors; the data dimension of each feature vector is less than the original feature, and it reduces the calculation volume during every single compute in random forest. Assume that a linear feature vector of length is n, the window length of a feature slice is m, and each time slides 1 unit length, n − m + 1 m-dimensional feature vectors will be generated. Suppose that there is a detection task that contains c categories, after feature processing, a linear feature vector of length n will generate a new feature vector of length c (n − m + 1). Similarly, for an image data, feature segment will generate a new feature vector of length c (n − m + 1)2. For instance, there is an intrusion sample data including 40 features, and there are four types of attacks such as DOS (denial of service), R2L (remote to local), U2R (user to root), and PROBE (surveillance and probing). And the slice window size is set to 10. Then, there will be a total of 31 feature vectors where each one is 10-dimensional.

After that, each feature vector will sequentially be put into a single-layer random forest, and then, class probability vectors [31] will be generated. A detailed explanation of the generation process of the class probability vector is depicted in Figure 3.

The entropy of feature vector will be calculated by the Gini index before node split. The Gini index is a model for calculating the entropy defined in the following equation.where t is the target split node, and represents the probability that node t belongs to class Ck.

The class probability is derived from the group of values that eventually fall on the leaf node and then averages the predictions of all decision trees in the forest to get the output class vector. The 31 feature vectors before will transform into 31 class vectors in which each one is 4-dimensional. Finally, as shown in Figure 2, all class vectors will connect to form a rerepresented feature vector as an enhanced representation corresponding to original data features. And the new feature will be used as input to train the cascade random forest in the next stage.

3.2. Deep Parallel Random Forest

The parallel random forest forms the deep forest structure by cascade stacking. Each new layer in the cascade structure concatenates the rerepresented feature vector and the class vector of the previous layer as input. Specifically, each layer of the cascade PRF will count the prediction results of all decision trees on input samples and generates the probability of different class distribution, as a class vector. Subsequently, the class probability vector will connect with the transformed feature which is formed by feature segmentation to train the next layer. For example, the rerepresented feature vector in the first stage will be input to train the cascade random forest. The first layer of the cascade will output a 4-dimensional class vector according to the previous assumption; then, it will connect with the input feature vector to train the next layer and so on. The structure of the deep parallel random forest is shown in Figure 4. Compared with the parallel random forest, the cascade PRF can improve the generalization ability of ensemble learning. It is worth noting that each time a new cascade layer is expanded, the cascade structure will randomly extract 80% of the training set for growing and the remaining 20% as the validation set to verify the performance gain of the new cascade layer. When the performance improvement is lower than the threshold, the training process will be terminated automatically and the number of cascaded PRF layers will be finally determined.

3.3. Voting Strategy

In ensemble learning, individual learners will output the final prediction after combining the independent judgment through the voting method. For an actual outlier detection task, it can be simplified as an anomaly classification task and identifies outliers by using voting strategy. The prediction of the last layer in cascade PRF will be the final result where the output classes of all decision trees in the last layer are counted, and then, the decision is made by using voting strategy based on the probability distribution. Majority voting is used in anomaly detection tasks with high reliability requirements. If a sample receives more than half of the votes, it is predicted as an outlier, otherwise it is rejected. However, if the task prediction result is necessary, the majority voting method will degenerate to the plurality voting method; in this condition, if many prediction results get the same votes at the time, one would be selected. The majority voting and plurality voting are defined aswhere hi represents the decision treei, and T represents the number of decision tree in forest. N is the dimension of probability vector. cj is the one of the class labels in collection {c1, c2, c3…cN}. The basic learner hi will make a prediction which belongs to the set of class labels {c1, c2, c3…cN}, and probability distribution of hi on the sample x is an N-dimensional vector , where is the probability output of hi on class label cj.

The detailed steps of FS-DPRF are described in Algorithm 1.

Input: training dataset D = {(x1, y1), (x2, y2)… (xm, ym)};
x: potential anomaly data.
Output: H (x): voting result of sample x;
CPRF: Deep random forests where {PRFi |i = 1, 2, …, N}.
(1)CPRF = {∅}
(2)Initialize hyperparameters: tolerance t and slice window size winSize
(3)D′ = Feature Grained (D); //D′ is newly generated feature vector.
(4)do
(5)i = 1//layer i of cascaded PRF.
(6)for j = 1, 2, …, T do
(7)  PRFi = {∅}
(8)  Dj ⟵ Bootstrap sampling (D′)
(9)  Treej ⟵ decision tree (Dj)
(10)  PRFi+ = {Treej}
(11)end for
(12)if (tolerance ≥ t)
(13)  CPRF+ = {PRFi}
(14)else
(15)  Break
(16)i = i + 1
(17)while (TRUE)
(18)H (x) = voting method (x)//the last layer votes for classification
(19)Return CPRF

4. Parallelization on Spark

Spark is a distributed computing framework developed by UC Berkeley AMP Lab. Spark supports a variety of ways to combine with other big data platforms, which enables it to process large-scale data efficiently. Its memory-based Resilient Distributed Dataset (RDD) mechanism allows data intermediately cached in memory [32], saving a lot of I/O operation overhead, and is well qualified for iterative and ensemble algorithms. The framework has unique advantages in processing. Each decision tree of RF is built independently of each other, and each subnode of decision tree is also split independently. The structures of the FS-DPRF model and decision-tree based forest enable the computing tasks have natural parallelism [33]. However, the training data in the parallel random forest generation process requires multiple iterations, and a large number of RDD data blocks need to be reused in the iteration until the convergence are met. Spark’s default least recently used replacement algorithm (LRU) cannot cope with our model’s requirement on the reuse of RDD data block because it could easily swap high-reuse block out of the cache, causing inefficiency job execution [34]. Based on these facts, a cache hierarchical replacement optimization for RDD objects is presented, which can effectively improve the cluster execution efficiency during the process of building FS-DPRF.

4.1. High Reuse Caching

First, Spark’s cache mechanism assigns a cache manager to each worker to manage the RDDs and calculate the cache size. The RDD data size requires a storage which is no larger than the remaining memory. Otherwise, the replacement will be implemented.where Si represents the total size of all RDDi partitions, and Sij is the size of the partition j of RDDi. The computing cost between RDD partitions is another very important factor, which is defined aswhere STij is the start time, and the ETij is the end time of each RDD partition; both are obtained by the partition dependency mechanism in Spark. Note that CTij already includes communication overhead. Thus, we can get the weight of each RDD, which is defined as follows:

Here, W (Rij) is the weight of partition j of RDDi, and W(Ri) is the weight of RDDi. μ is an impact factor defined by a different work environment. f (Rij) represents the usage count of partition j of RDDi.

Second, processing time is linearly related to the size of the data block in general. The execution time of the RDD can be represented by the percentage of the RDD size which occupies the memory size in the Spark cluster environment.where T (Rij) represents the execution time of partition Rij. Sij is the partition size of Rij, and Scache is the cluster memory. Since each partition of the RDD under the task set is executed in parallel, the total execution time of the RDD is the longest among all partitions. Finally, the execution efficiency of RDD can be quantified as the ratio of the weight value of RDD to the execution time, and ε(Ri) is used to represent the execution efficiency of each RDD, which is defined in the following equation:

The directed acyclic graph (DAG) of Spark divides the RDD stage and generates the RDD structure tree; we then calculate the execution efficiency of each RDD and cache high reused RDDs in the MapcacheList (rddi, ε). The steps of the high reuse cache method are described in Algorithm 2.

Input: : RDD structure.
Output: : Cache collection of RDD.
(1)for (i = 0 to .Length-1)
(2)  calc //Calculate RDD execution efficiency
(3)  if (>1)
(4)  
(5)end if
(6)end for
4.2. Hierarchical Replacement

Hierarchical replacement is the second step of optimization for parallel. It classifies the RDD target before the replacement, giving priority to incomplete RDDs. As it is shown in Figure 5, we design the IntegrityCheck function to verify RDD, and the function will check the partition and mark down the integrity in a collection where the flag records the partition status. If the partition of RDD is incomplete, it will be marked as FALSE and be replaced; otherwise, it will be marked as TRUE. Then, the RDD with less efficiency will be replaced according to MapcacheList(rddi, ε). The process of hierarchical replacement is presented in Algorithm 3.

Input: : cache collection of RDD, : cluster cache size, : cached RDD size, : cache candidate, : size of candidate.
Output:
(1)calc ;
(2)IntegrityCheck (rdds);
(3);
(4)if ()
(5)for ((k,))
(6)  if ( = = FALSE)
(7)  Replace (k = , = TRUE)
(8)  end if
(9)end for
(10)if ()
(11)  
(12)else
(13)  for ((k,))
(14)  v.QuickSort ();//Sort by execution efficiency
(15)  if (<)
(16)  Replace (k = ,  = )
(17)  end if
(18)end for
(19)  
(20)end if
(21)end if

5. Preliminary Assumptions and Hypotheses

There are three preliminary assumptions for the excellent performance of deep learning models:(i)Layer-by-layer processing(ii)Feature transformation(iii)Sufficient model complexity

Traditional machine learning methods such as decision trees are processed layer by layer, but they lack sufficient complexity. The ensemble method can increase the complexity, such as random forest, but it is still not complex enough because there is no feature transformation process, and the processing is always performed in the same feature space. Therefore, our main hypothesis is that the feature segmentation and cascading structure can make the random forest increase the feature transformation ability and sufficient complexity on its original basis, thereby improving generalization ability. Another hypothesis is that the optimization of the spark-based RDD cache replacement strategy can reduce the training and detection time of the proposed model. The experiment requires the conversion of character features to numerical feature training models. The results and analysis in following subsections prove the hypothesis claims.

6. Experiment

6.1. Dataset and Preprocessing

In order to evaluate the proposed model and report the experiment results, four intrusion datasets are selected, i.e., NSL-KDD [35], UNSW-NB15 [36], CICIDS2017, and CICIDS2018 [37].

NSL-KDD is an improvement of the KDD 99 dataset which was collected from a simulated US Air Force network environment over 9 weeks. The train set does not contain redundant records. In addition, there are no duplicate records in the test set, which makes the detection rate more accurate. Each piece of data contains 43 features including a label. The labels are divided into 5 classes, including attack and normal. The types of attacks are divided into four categories: DOS (denial of service attack), R2L (unauthorized access from the remote master), U2R (unauthorized local super user privileged access), and PROBE (port monitoring or scanning). Normal represents normal data.

The second dataset used in the experiment is UNSW-NB15. The dataset was collected in 2015 under the real network environment of Australian Center for Cyber Security (ACCS). The network traffic record contains true normal activity and attack behavior. The network record of this dataset contains 49 network features including a class label, and there are 10 types of network including normal behavior and 9 abnormal intrusion attacks. CICIDS2017 and CICIDS2018 datasets are the recent datasets that were developed by the Canadian Institute of Cyber Security. These two datasets are closer to the real network environment. The CICIDS2017 contains 83 original features. We have removed some features, such as source and target IP, ID, and timestamp, because using this information may lead to overtraining. Finally, we got a dataset containing 80 features and selected 2515416 samples for experiments. Similarly in CICIDS2018, we also removed some unnecessary features and selected an unbiased subset of the original dataset. All datasets details are shown in Tables 15.

The features of the four datasets are composed of many numerical features and several character features that the characters cannot be directly used in the proposed intrusion detection model, and the experiment uses the one-hot encoding method to convert it from character to number. For example, the second column of the NSL-KDD dataset “protcol_type” contains three different values: “tcp,” “udp,” and “icmp,” and encoding represent them as [0, 0, 1], [0, 1, 0], [1, 0, 0]. After encoding, the data are normalized to avoid that the size relationship between values will affect the training results, and all the feature values are mapped to the interval [0, 1].where yi represents the value after the feature is normalized, xi represents the feature value, and min(x) and max(x) represent the minimum and maximum values within the range of feature values, respectively.

The experiment cluster is deployed in High-Performance Computing Center of Hebei University, which consists of a master node and 50 slave nodes. The hardware conditions of each slave node are 2Intel Xeon E5-2680 v2 (Ivy Bridge| 10C | 2.8 GHz), 64 GB DDR3 ECC 1866 MHz four-channel memory. Moreover, the master node is equipped with 4Intel Xeon E7-4850 (Ivy Bridge|10C|2.0 GHz) and 512 GB DDR3 REG 1333 memory. Internal connection bandwidth is 56 Gbps IB, and chip transmission delay 100 ns. The system setup for all nodes is CentOS-7-GenericCloud-1503.qcow2, Hadoop 2.6.3, Scala-2.10.5, and Spark 1.6.1.

6.2. Hyperparameter Setting

In this part, NSL-KDD is taken as an example dataset to illustrate the influence of hyperparameters in the proposed model and to demonstrate the process of parameter tuning.

Equation (11) interprets that about 1/3 of samples will not appear in the collection set whenever bootstrap, which is called out-of-bag (OOB) data.

These data will not participate in the establishment of the decision tree and can replace the validation set to verify the model.

OOB error rate is calculated to evaluate the effect of different sliding window sizes on the model. As can be seen from Figure 6, when the window size is d/4, the average error rate is the lowest, and the average OOB error rate is the highest at d/16, where d is the raw feature length. This result interprets that a more fine-grained window size is not necessarily better when trying to enhance the generalization performance of the model. With the increase of decision trees, the error rate starts to converge at about 0.06. So, it is finally recommended that d/4 is the suitable size.

Then, the remaining parameters are adjusted by using 10-fold cross-validation. For instance, n_estimators is the number of decision trees. Generally speaking, more trees make the model more robust and have better performance. Considering a wider availability, we searched it on the range of (0, 500] by step-size 50 and compared the results after 10-fold cross validation to get the optimum value. Finally, Table 6 summarizes the rest of hyperparameters setting of FS-DPRF.

6.3. Model Evaluation

In this section, we compare FS-DPRF with parallel random forest (PRF), DSSVM [17], and A-DNN [19]. The classification performance is measured by the accuracy, recall precision and F1, detection rate (DR), and false alarm rate (FAR). The F1 score is an evaluation index that comprehensively considers the recall rate and precision, and its definition is shown in equation (13). The higher F1 score value means the better the classification performance of the algorithm. The evaluation metrics are defined as follows:where TP represents the number of true attacks predicted as attack type, FN represents the number of true attacks predicted as normal, FP represents the number of true normal predicted as attack type, and TN represents the number of normal predicted as normal. The comparison results are the average value obtained from ten experiments under different datasets. In order to verify the performance of the model under two classifications: normal and attack. In this part, all attack types are regarded as abnormal types. We mark all data as two types: normal and abnormal. Table 7 shows the comparison results of normal and abnormal on given datasets.

On the NSL-KDD dataset, the accuracy of the comparison algorithm is more than 90%, and the deep neural network has achieved a good result of 98%, but it is still 1% lower than FS-DPRF. Similarly, precision and recall reflect that our model’s ability to confirm normal network behavior is the best. On the UNSW-NB15 dataset, shallow machine learning algorithms began to perform weak, only 80%–85% on accuracy. The two methods based on deep learning carry a high accuracy rate on above 94.2% and 97.7%. FS-DPRF is better than the deep neural network. As data become more and more complex and closer to the real network environment, the average accuracy of shallow algorithms on the two datasets of CICIDS2017 and CICIDS2018 has been lower than 90%. Precision and recall have also lost their competitiveness. The accuracy rates of A-DNN and FS-DPRF on the CICIDS2017 dataset reached 96.5% and 97.4%, respectively. Although the precision of A-DNN is higher, leading by 1%, the recall rate of the FS-DPRF is higher. The higher the recall rate, the higher the probability that the attack is predicted, which means that FS-DPRF has better attack detection capabilities on CICIDS2017. The last group of data is the result of the CICDIS2018. The accuracy of FS-DPRF is 3% higher than that of A-DNN, and the precision and recall rate are also higher than A-DNN by 1.3% and 2.4%, respectively.

Figure 7 shows the comparison results of FS-DPRF, parallel random forest, DSSVM, and deep neural network on F1 index. The score of the new method on the NSL-KDD dataset is higher than that of shallow machine learning, with an advantage of 3%–5%, and it is also 1% higher than A-DNN. The results on the UNSW-NB15 dataset show that there is a gap between the performance of deep learning-based methods and PRF and DSSVM. Deep learning-based methods perform better, and the score of FS-DPRF is 3% higher than that of deep neural networks. On CICIDS2017 and CICIDS2018, the F1 scores of the four methods are slightly lower than the experimental data of the first two groups, but the deep learning-based method still maintains the lead over PRF and DSSVM. The F1 score of FS-DPRF on CICIDS2017 is 1% higher than that of A-DNN, and the F1 score of FS-DPRF on CICIDS2018 is 1.9% higher than that of A-DNN. The score of the F1 index on those four datasets shows that the forest-based deep learning network in this normal/abnormal two-type classification experiment is better than the shallow machine learning method and is competitive compared with the deep neural network.

In order to verify the detection ability of the model in multiclassification, we did another set of experiments on the NSL-KDD dataset. According to the original label, all data are divided into five classes as shown in Table 8. The data preprocessing and the parameter settings are the same as the previous part, and ten experiments are performed to take the average. It is worth noting that, considering the support vector machine’s binary classification limitation, we spend a lot of labeling works to test DR and FAR on “a certain attack/other” separately. It can be seen from the experimental results in Table 8 that the method proposed in this paper has improved the detection rate on 5 class labels, and the FR is relatively low. Although the detection results of U2R attack types are not particularly ideal, this is also related to the imbalance in the distribution of categories in the dataset. In summary, the method in this paper has shown the best strength in multiclass attacks and normal classification experiments.

We then tested our model by average execution time and speedup, which are used to measure the scalability of parallel cluster. The speedup is defined aswhere p is the number of CPUs, T1 refers to the execution time of the sequential execution algorithm, and Tp represents the execution time of p nodes parallel algorithms. As it can be seen from Figure 8, with the increase of slave nodes, the average execution time of the model on four given datasets is reduced. The decrease trend of execution time on different datasets is not exactly the same due to the data size. The average execution time and the number of cluster nodes show a strong correlation in all cases, which indicates that the proposed method has good scalability.

The speedup experiment tested the model in different numbers of slave nodes. As shown in Figure 9, the speedup of each dataset all increase routinely when the number of nodes increases from 1 to 25 and tends to slow down by the number of nodes from 25 to 50. The result shows that the model has a good speedup performance in datasets of different volumes and dimensions. However, it does not show a perfect linear growth like the definition shown above, and it can be interpreted that the communication overhead and task scheduling costs would become larger as the cluster scale increased.

Finally, we setup 20 slave nodes for experiment and compared the cache performance between Spark’s default LRU algorithm and our optimized method in FS-DPRF. It can be seen from Figure 10 that the FS-DPRF has less execution time, where the time reduces by 8.8% in NSL-KDD and 8.9% in UNSW-NB15 compared to LRU. The execution time of FS-DPRF on the CICIDS2017 and the CICIDS2018 also decreases by 7.2% and 13.3%, respectively, compared with LRU. The column chart indicates the cache replacement strategy proposed in this paper can cut down the execution time of anomaly detection task successfully. Even if the efficiency sort and partition integrity check in the replacement sacrifice a part of memory, as shown in Figure 11, that FS-DPRF is a little bit higher than the LRU algorithm in memory occupancy rate, and acceptable real-time performance is more important to intrusion detection. Therefore, the optimization for parallelization successfully improves the task execution efficiency of the proposed intrusion detection model.

7. Limitations of the Research

The limitation of the research is that the model will consume a lot of memory, so to get a well-trained intrusion detection model requires powerful computing equipment. Although the model proposed in this paper has achieved good results trained by CPUs in the Spark distributed environment, unfortunately, the current structure is naturally not suitable for GPUs. This makes the model temporarily unable to be better accelerated on the GPUs like a deep neural network.

8. Conclusion and Future Work

From the work of predecessors, the ensemble learning-based method has shown convincing performances in challenging missions that can be abstractly understood as classification problem. The main scientific contribution of this paper is to propose a deep learning model based on ensemble decision trees. Inspired by deep neural networks, we used layers composed of random forests to imitate the hidden layers and fully connected layers in the neural network to build a cascading model of random forests. The proposed model utilizes a slide window to segment sample features into many small pieces, which can reduce calculation volume of high dimension data for every compute and keep the integrity of raw features. The cascade structure improves the generalization ability and has a higher accuracy rate. The model has only a few hyperparameters while achieving good generalization ability. Another part of the contribution is to propose a cache replacement strategy for RDDs in the Spark environment and determine the priority order of RDD loading by calculating weights and completeness. It effectively reduces the average execution time of intrusion detection tasks on distributed clusters. The experimental results on four datasets have demonstrated that the model proposed in this paper performs better than the parallel random forest and support vector machine in F1-measure and accuracy and achieves competitive performance compared to the state-of-the-arts approach of deep neural networks. Although the model reduces the average execution time, it increases the memory consumption and does not support GPU acceleration temporarily. Therefore, the model is more suitable for deployment on a distributed cluster with sufficient memory, which also reflects the limitations of our model. In the future, the work will focus more on the optimization processes of the features of training data to improve the prediction accuracy and will further research on the issue of unbalanced data distribution in the intrusion detection task.

Data Availability

The datasets are available at https://www.unb.ca/cic/datasets/index.html, Cyber Range Lab of the Australian Center for Cyber Security (ACCS) (https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/), and Canadian Institute for Cybersecurity (https://www.unb.ca/cic/datasets/ids-2017.html https://www.unb.ca/cic/datasets/ids-2018.html).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Natural Science Foundation of Hebei Province, China (F2019201427) and Ministry of Education Fund Project of China (2017A20004).