Abstract

With the development of information technology, computer networks have become a part of people’s lives and work. However, computer viruses and malicious network attacks make network security face huge challenges, and more accurate detection of attacks has become the focus of attention to current computer fields. This paper proposes an intrusion detection model, which is mainly based on the XGBoost (Extreme Gradient Boosting), and uses the WOA (Whale Optimization Algorithm) to find the best parameters for it. The collected network data are first preprocessed by the PCA (Principal Component Analysis) dimensionality reduction method, and then, the preprocessed data are imported into the WOA-XGBoost algorithm so that the overall model has better intrusion detection capabilities for data after training. The experimental results are applied to the well-known KDD CUP 99 data in the computer network field, and compared with the accuracy of the results obtained by parameter adjustment in the traditional way, it shows that the intrusion detection model under this method has better accuracy.

1. Introduction

A computer network occupies an important position in today’s social life. Life is full of various Internet-based services, such as online chat, online banking, and online games, which are functions we use every day. However, as people use the Internet more and more frequently, the number of malicious activities on the Internet is also increasing.

With the gradual increase in the number of malicious attacks in the network, people have proposed dozens of detection technologies based on data information to determine whether there is an attack. These methods have a faster running speed and a relatively complete sample database [1]. The key is to have a higher accuracy rate.

Among the numerous research studies on intrusion detection, the research ideas of Forrest et al. have received the most widespread attention. Their idea is to regard the intrusion detection problem as a classification problem so that the problem can be detected using pattern recognition ideas and methods. Normal data and abnormal data are distinguished [2]. But most traditional intrusion detection methods only perform comparative classification and ignore the importance of feature selection. Therefore, many unknown attack traffic cannot be accurately identified and can gradually constitute potential threats. At the same time, traditional data classification processing methods cannot flexibly process the massive data generated by the network.

Wang and Lu proposed a model that superimposes the XGBoost model and the LTSM (Long Short-Term Memory) model to analyze the abnormal state of IoT devices. The model first collects the system call sequence, performs intrusion detection on it, and recognizes its abnormal behavior. By constructing a real IP camera system and testing it with typical IoT attacks, the model shows that the model has good performance, stability, and generalization ability [3].

Bhattacharya et al. proposed a classification model based on PCA and XGBoost. The model first uses the PCA firefly algorithm to reduce the dimensionality of the dataset and then uses the XGBoost algorithm for classification. The model has been tested on the Kaggle dataset, which shows that the model has better performance than existing machine learning models [4].

The above methods all use the XGBoost method as the classifier for intrusion detection and combine with other data processing methods (such as PCA dimensionality reduction) to obtain an intrusion detection model with better performance.

Mafarja et al. proposed a novel wrapper feature selection approach based on augmented WOA. They introduced V-shaped and S-shaped transfer functions into WOA, used them to detect attacks of loT, tested them with UCI datasets, and compared them with optimizers of other algorithms. The experimental results show that the algorithm is superior to other algorithms in many aspects such as accuracy and fitness [5].

Haghnegandar and Wang proposed a power system intrusion detection model based on the WOA algorithm and an artificial neural network (ANN). The model uses the WOA algorithm to adjust the weight vector of a neural network to minimize the mean square error. The model has been tested by the Mississippi State University Electric Power Attack Database and compared with other commonly used classifiers to prove that the proposed model has good superiority [6].

The above methods use the idea of classification to detect attacks or malicious activities in different networks. They all use the WOA algorithm to optimize the algorithm parameters of the classifier. In traditional metaoptimization algorithms, the GA (Genetic Algorithm) method requires more initial parameters to be set, and the search speed is relatively slow, and it takes time to obtain an accurate solution. It is relatively long, and PSO (Particle Swarm Optimization) is easy to fall into a local optimal solution due to the simple way of updating the position of particles. Therefore, this paper chooses the WOA algorithm to train the XGBoost model to improve the accuracy of intrusion detection. Through the combination of WOA optimized parameters and the classifier, the detection methods become more effective and more accurate.

The machine learning model XGBoost has been used in the computer network or Internet of Things intrusion detection, but XGBoost has certain limitations in selecting or optimizing model parameters [7, 8]. Combining the XGBoost model with the metaheuristic algorithm, WOA can effectively overcome these limitations [9]. Therefore, this paper proposes an intrusion detection algorithm based on the WOA optimized XGBoost model, which uses the powerful optimization capabilities of the WOA algorithm to optimize key parameters for XGBoost, effectively improving the prediction accuracy of the XGBoost model, so as to more accurately detect intrusions or attacks in the network environment behavior.

2. Background Materials

2.1. Whale Optimization Algorithm (WOA)

Whale Optimization Algorithm (WOA) is a new metaheuristic optimization algorithm that simulates the hunting mechanism of humpback whales by simulating the bubble net feeding and attack mechanisms of humpback whales [1012]. The WOA algorithm is mainly composed of three parts: roundup, net bubble attack, and prey search. In this algorithm, the whale itself hunts in a single or multidimensional space by changing its position vector. The whale itself represents the candidate solution, and the position or value of the candidate solution represents the parameters for solving the problem. Due to its reasonable search mechanism, it can effectively balance the exploration and development phases. The detailed process of the algorithm is as follows:(1)Searching and encircling prey:In the above formula, represents the current iteration number, is the best individual position of the whale currently obtained, and and are the coefficient vectors, which can be expressed by the following formula:The value of decreases linearly with the increase of the number of iterations in the range of [0, 2], and the value of is randomly generated in the range of [0, 1], which makes the value of obtain [−1, 1] random value in the range.(2)Spiral updating position: according to the spiral update position strategy shown in the following formula, a new position that conforms to the spiral motion can be identified between the original position of the whale and the current position of the prey:where represents the distance from the ith whale to the target prey, is an internal parameter, and is a random number obtained in the range of [−1, 1].In order to simulate simultaneous behaviors, the whale is randomly selected to move in a circle or along a spiral path with a probability of 50%. The following formula can be used to express the method of generating a new position of the whale: is randomly generated in the range of [0, 1].(3)Search for prey:in order to balance, a whale individual in the current population will be randomly selected as the current optimal solution for the population as a whole, and other whales will move around their optimal individual positions. The mathematical expression is as follows:

In the WOA algorithm, each whale starts from a random position, and then, each individual whale updates its position according to the best individual whale position obtained after each iteration or a randomly selected individual whale.

2.2. Extreme Gradient Boosting (XGBoost)

Extreme gradient boosting (XGBoost) was proposed by Chen and Guestrin in 2016. This algorithm improves the calculation method of the objective function on the basis of gradient boosting and reduces the calculation time [13]. During the training period, parallel computing is automatically realized to solve big data science problems quickly and accurately [14].

The core concept of XGBoost is to learn new features by adding a tree structure, fitting the residuals of the final prediction, and then obtaining the sample score. By adding the scores of each tree, the final prediction score of the sample can be obtained. For n samples with m features, the formula for predicting scores with K addition functions is as follows:where is the space of the regression tree, is one of the regression trees, and represents the independent structure score of each -leaf tree.

XGBoost transforms the optimization problem of the objective function into the problem of finding the minimum value of the quadratic function and uses the second derivative information of the loss function to train the tree model. At the same time, the tree complexity is added as a regular term to the objective function to avoid the overfitting problem. The objective function of XGBoost is as follows:where is the actual value of the ith target; is the predicted value of the ith target; is the difference between and ; n is the sample size; is the tree complexity; K is the number of sample features.

The iterative result of the objective function in time is as follows:where is the complexity of the decision tree where the variable is calculated in the t-th iteration; is a constant.

If the second-order Taylor expansion of the loss function is carried out, and the loss function is set as the mean square error, then the objective function iswhere and are the first and second derivatives of the mean square loss function, respectively.

2.3. Principal Component Analysis (PCA)

Principal Component Analysis (PCA) was first proposed by Pearson in 1901, and then, a large number of scholars conducted in-depth research on it to gradually improve its theory[1517]. The PCA method can convert the variable problem in the high-dimensional space to the low-order space and reduce the original high-dimensional variables to form new variables. These new variables are a linear combination of the original variables. The variance of the newly formed variable is calculated [18]. The larger the variance, the greater the information contained in the new variable and the more information contained in the variable, which is called the first principal component [19]. By analogy, the second, third, ... nth principal components can be defined, and the covariance between the principal components should be zero. The PCA method is used for data dimensionality reduction, and the specific process is as follows:(1)Read the dataset in the form of a matrix X. Each row of the matrix represents a piece of data, and each column represents a characteristic attribute.(2)Calculate the covariance matrix and solve for the eigenvalues , , …, and the corresponding eigenvectors , ,…, .(3)The eigenvalues of the data are sorted in descending order from largest to smallest to get , ,…, , and the corresponding eigenvectors , ,…, are also obtained.(4)Take the first k columns of , ,…, to get the projection matrix D, and Y = DX is the data after dimensionality reduction to k dimensions [20].

3. Proposed WOA-XGBoost Methodology

XGBoost, as an excellent machine learning algorithm in recent years, has good running speed and accuracy and is widely used in classification problems. When using XGBoost classification, it is necessary to adjust the parameters of the trainer to improve its performance. The choice of parameters determines the accuracy of the XGBoost model. The commonly used parameter adjustment method is generally the grid search method, but the search range of the grid search method is too narrow, and it is not easy to find the optimal parameters. This paper proposes an XGBoost classification algorithm based on WOA optimization parameters. The Python toolkit XGBoost is selected to optimize three important parameters in the XGBoost classifier: learning rate (learning_rate, ETA for short), maximum depth of the tree (max_depth), and sample sampling rate (subsample) [21, 22].

Learning_rate: when updating leaf nodes, the weight will be multiplied by ETA. By reducing the weight of the feature, the promotion calculation process is more conservative. The commonly used value range is [0, 1], and the default value is 0.3.

Max_depth: it controls the complexity of the decision tree. The larger the value, the more complex the model, but overfitting will occur. The default value is 6.

Subsample: the subsample ratio of the training set means that XGboost selects the sample ratio of the first spanning tree, which can effectively prevent overfitting. The default value is 1.

The proposed method based on WOA is detailed as follows [2326]:Step 1: preprocess the collected data with a normalized method.Step 2: the preprocessed data are subjected to PCA dimensionality reduction to obtain data with lower dimensionality after dimensionality reduction so as to facilitate subsequent training of the XGBoost model.Step 3: initialize the WOA algorithm, where the initial parameters of the algorithm are given. Set each whale in a 3-dimensional space, and encode the 3 dimensions as key parameters eta, max_depth, and subsample, respectively.Step 4: set the upper and lower limits of the XGBoost algorithm parameters that need to be optimized to generate the initial population of whales so that the position of each whale is within a suitable range.Step 5: according to the obtained whale population, based on the XGBoost model, calculate the fitness of each whale position.Step 6: sort the obtained fitness values to get the best whale position of the current whale population and save it as the current global best position.Step 7: update the position of the whale in each subgroup through equations (1)–(8).Step 8: enter iterative optimization and repeat Steps 3–5. When the number of iterations reaches the maximum, stop the loop and obtain the best parameters eta, max_depth, and subsample from the final best whale position.Step 9: bring the obtained best parameters eta, max_depth, and subsample into the XGBoost model to get the best intrusion detection model after training.

4. Experimental Studies

In order to prove the effectiveness of the combination of the PCA dimensionality reduction method proposed in this paper with the WOA-XGBoost model, the experimental environment is Anaconda 3, which runs on Intel(R) i7-1165G7 @2.8 GHz, 16 GB RAM, and 64-bit Windows operating system.

4.1. Data Description

The dataset used in this experiment is the KDD CUP 99 dataset, which contains normal data and four types of attacks, namely, the denial of service (DOS), remote-to-local (R2L) attacks, user-to-privilege (U2R), and needle attack (probe). Each piece of data contains 41 features [27]. Table 1 lists the function description of each feature. [28, 29].

The dataset used in this experiment is a 10% version of KDD CUP 99, which contains 480,000 pieces of data. 90,000 pieces of data including normal and attack data are randomly selected to train and test the model. In addition, 30,000 datasets are randomly selected to test the trained model.

4.2. Data Preprocessing and PCA Dimensionality Reduction

The three nominal features (protocol_type, service, and flag) in the dataset are first converted to the values “0,” “1,” and “2,” and then the min-max standardization method is used to normalize the data. The processed dataset is shown in Table 2 below.

The data are processed by PCA on the dataset, and the variance percentage of each feature retained after processing is calculated by PCA. The results are shown in Table 3.

From the results shown, it can be calculated that when 19 features are selected to be retained, the sum of the percentages of retained features is 99.96%. The data are reduced to 19 dimensions using PCA data. The processing results are shown in Table 4.

4.3. Performance Evaluation Metrics

In this experiment, four types of data are mainly recorded, which are as follows: this is the number of attacks detected as attacks, which is recorded as TP; this is the number of times that attacks are detected as normal, which is recorded as FN; this is the number of times that normal attacks are detected as normal, denoted as TN; this is the number of attacks normally detected, denoted as FP [14].

In order to evaluate the performance of the classifier, this paper defines three indicators of classification accuracy, sensitivity, and specificity. Classification accuracy refers to the probability that data can be classified correctly; sensitivity refers to the probability of normal data being classified as positive; specificity refers to the probability of attacking data being classified as negative. These indicators can be described by the following formula:

5. Experimental Results

The data obtained after the preprocessing of PCA dimensionality reduction were brought into several classifiers for comparison. In order to obtain more accurate and reliable experimental results, we adopted a ten-fold CV segmentation strategy to ensure the verification. For the performance of the classifier, performing CV multiple times can improve the final result, so 10 runs of ten-fold CV are used to get the final average result.

For the dataset given in this paper, the experimental results of each average measurement of 10 runs of ten-fold CV are summarized in Table 5. It can be seen from the table that the WOA-XGBoost model method performs best among all methods [24]. This algorithm is better than other algorithms in terms of ACC, sensitivity, and specificity. The average ACC is 0.9906, the sensitivity is 0.9958, and the specificity is 0.9574. The results obtained with GridSearch-SVM [3034] are the most unsatisfactory. This also shows that the WOA algorithm is better than the GridSearch method for parameter optimization.

Figure 1 shows the detailed comparison of the ACC of these four methods under 10 runs of ten-fold CV, Figure 2 shows the detailed comparison of the sensitivity of these four methods under 10 runs of ten-fold CV operation, and Figure 3 shows the detailed comparison of specificity of these four methods under 10 runs of ten-fold CV operation. It can be seen from the figure that WOA-XGBoost is significantly better than the other three classifier models in 10 runs of ten-fold CV results.

The comparison of all methods on the indicators is shown in Figure 4. It can be seen that the index value of WOA-XGBoost is the best. In addition, the best confusion matrix of WOA-XGBoost for 10 runs is shown in Figure 5. Among them, 11 cases of normal data were judged as attack data, and 1,008 cases of attack data were judged as ordinary data.

6. Conclusions

In this paper, combined with the machine learning framework XGBoost, a new efficient intrusion detection algorithm, namely, WOA-XGBoost, is proposed, and this algorithm is used in the KDD CUP 99 dataset for intrusion detection. The main innovation is to use the WOA optimization algorithm to automatically select and optimize the main parameters of XGBoost. Compared with the traditional manual optimization or grid search method optimization, the WOA method can search for a larger range of parameters and has better accuracy. The experimental results on the KDD CUP 99 dataset show that the intrusion detection algorithm of WOA-XGBoost is significantly better than the methods based on GridSearch-XGBoost, WOA-SVM, and GridSearch-SVM. Therefore, the WOA-XGBoost classification method proposed in this paper can be used as a good detection tool for network data intrusion detection. In the future stage, as the network development makes the characteristics of attack data change at any time, this method should be extended to real-time data stream detection to realize real-time intrusion detection.

Data Availability

The dataset used in this experiment is the KDD CUP 99 dataset.

Conflicts of Interest

The authors declare that they have no conflicts of interest.