Abstract

Neural networks have been proved to perform well in network intrusion detection. In order to acquire better features of network traffic, more learning layers are necessarily required. However, according to the results of the previous research, adding layers to the neural networks might fail to improve the classification results. In fact, after the number of layers has reached a certain threshold, performance of the model tends to degrade. In this paper, we propose a network intrusion detection model based on residual learning. After transforming the UNSW-NB15 data set into images, deeper convolutional neural networks with residual blocks are built to learn more critical features. Instead of the cross-entropy loss function, the modified focal loss is calculated to address the class imbalance problem in the training set and identify minor attacks in the testing set. Batch normalization and global average pooling are used to avoid overfitting and enhance the model. Experimental results show that the proposed model can improve attack detection accuracy compared with existing models.

1. Introduction

With the continuing expanding network scale, network security confronts more sophisticated threats than ever before. Hence, network security issues are attracting increasing attention. Commonly used network security systems that discover suspicious attacks involve firewalls, intrusion detection systems (IDSs), and intrusion prevention systems (IPSs) [1]. Among them, the task of IDSs is to collect and identify abnormal behaviors in the network [2]. By analyzing captured data packets, IDSs can check legitimate network behaviors, detect the attacks, and report the attacks for further containment.

Conventional IDSs tend to get low detection rates and high false positive rates due to their reliance on patterns of known attacks. Researchers have applied artificial intelligence (AI) algorithms in the designing part of IDSs to provide better performances. The performance of an IDS is closely related to the selected classifier, while traditional machine learning algorithms tend to perform poorly in the scenarios when large amounts of network data packets are included. In recent years, deep learning has achieved outstanding results in multiple fields. The advantage of deep learning is that it can learn the hierarchical features from a large amount of data to improve model efficiency [3]. The application of deep learning can reduce costs of IDSs and strengthen the abilities to identify attacks.

Convolutional neural networks (CNNs) can extract deep and critical features from the given data. It is a general perspective that increasing the number of network layers can help learn better features; hence, the performance of model is improved. However, simply stacking more layers may fail these tasks. Furthermore, after the number of layers has reached a certain threshold, it may even lead to performance degradation. Residual learning is proposed to address the issue above. Residual is the error between the actual value and the estimated value, and residual learning is originally derived from the residual representation in image recognition [4]. Residual learning is realized by establishing a direct connection between the input and the output. CNNs based on residual learning have achieved outstanding results in image recognition [5]. They are easier to train and optimize than common CNNs. In network intrusion detection, it is also vital to build deeper networks to improve the detection capabilities of IDSs. Because residual learning allows CNNs to be deeper, this paper introduced the concept of residual learning into IDSs.

UNSW-NB15, an imbalanced network intrusion detection data set, is selected to evaluate the model. In real-time network intrusion detection, the class imbalance problem seriously affects the classification results [6]. Prediction models that predict only the dominant classes fail to identify the minor classes. Resampling techniques are common solutions to class imbalance problems. However, resampling techniques have their disadvantages. Oversampling might disrupt the original data, and it takes more time to train the model when using oversampling techniques. Undersampling might cause the loss of vital information, affecting the classification capabilities. Focal loss was originally proposed to balance the loss between samples. We apply the modified focal loss function in the proposed model to enhance the abilities to detect minor classes without disrupting the training data.

The major contributions of the paper can be summarized as follows:(1)Propose a deep learning-based intrusion detection model with a higher accuracy compared with other existing models(2)Introduce residual learning into the model to address the network degradation problem, allowing the model to learn deeper features(3)Use a modified focal loss function to deal with the class imbalance problem in the training set

This paper is organized as follows:(1)The first chapter gives a brief overview of network intrusion detection and the motivation of the proposed methodology(2)The second chapter introduces the related work(3)The third chapter provides the methodology and implementation process in detail(4)The fourth chapter carries out the experiments and analyzes the testing results(5)The final chapter concludes the paper

Data preprocessing is a key step in network intrusion detection. It can extract key features that have great influences on the classification results, effectively reducing the size of data and improving the efficiency of given classifiers. Zhang et al. [7] proposed an effective network traffic classification method, which used principal component analysis (PCA) to remove the irrelevant features and applied Gaussian Naive Bayes as the classifier. Kasongo et al. [8] applied a filter-based feature reduction technique on the UNSW-NB15 data set using the XGBoost algorithm and then implemented several algorithms to classify the data. Results demonstrated that the feature selection method increased the test accuracy. Sun et al. [9] proposed an improved Naive Bayesian learning method which took the influence of different features into account. It achieved a higher accuracy than traditional machine learning algorithms. It can be seen that the performance of traditional classifiers is excessively dependent on the extracted features. However, traditional machine learning algorithms are shallow learning algorithms which require feature engineering. To build the fittest model, optimization of parameters is also needed. The size of the data set also affects the efficiency of the models. These difficulties slow down the training process of traditional machine learning algorithms and affect the overall network security.

In recent years, deep learning models have been gradually applied to intrusion detection to enhance the classification classifiers due to their high efficiency and easy implementation. Among deep learning models, CNNs have made great success in many fields [1012], and researchers have applied CNNs in intrusion detection. Qian et al. [13] analyzed the network traffic with a CNN. In the training phase, rectified linear unit (ReLU) served as the activation function and adaptive moment estimation (Adam) algorithm was used to optimize the model. Lai et al. [14] also used a CNN as the intrusion detection model, achieving a higher accuracy rate than other deep learning models.

In the aspect of residual learning, the concept of Residual Network (ResNet) was proposed by He et al. [15] from Microsoft Research Institute to deal with the performance degradation problem as the number of layers grows. ResNets have outperformed common CNNs in image classification and object recognition [1618]. Because of residual learning, the depth of ResNets is deeper than that of the traditional CNNs. In network intrusion detection, a deeper CNN can extract more critical features and get better classification results. Therefore, CNNs based on residual learning have been attempted in network intrusion detection. Wu et al. [19] proposed a deep neural network built upon residual blocks to discover malicious network behaviors, achieving a low false alarm rate. Chouhan et al. [20] proposed a multipath residual learning-based CNN architecture that was being evaluated on NSL-KDD data set, showing significant improvements over the previous research.

However, while residual learning can improve the overall performance of CNNs, in the practical aspect, it does not improve models’ abilities to detect minor attacks due to lack of original training data. Classes in most modern network intrusion detection data sets are imbalanced. Therefore, most IDSs fail to provide better performances for attacks with fewer samples. Focal loss was proposed by He et al. [21] in 2017. Focal loss takes the different level of training difficulty of samples into consideration and focuses more on the difficult-to-train samples; therefore, it has been applied in many fields, such as object detection, imbalanced data classification, and so on [2224]. To identify classes with fewer training samples more accurately, a modified focal loss function is used to replace cross-entropy loss function in the proposed model.

Choosing a suitable data set is vital for the building of IDSs. In recent years, most commonly used public data sets in network intrusion detection are KDD99 [25], NSL-KDD [26], and UNSW-NB15 [27]. In spite of being the most popular data sets in network intrusion detection, KDD99 and NSL-KDD are out-of-date due to old and redundant data. Evaluating IDSs using KDD99 and NSL-KDD does not reflect satisfactory results due to their shortcomings. According to the previous research and statistical analysis, compared with the other two data sets, UNSW-NB15 has more complex feature sets, contains more modern normal traffic scenarios, covers richer types of attack traffic, and contains fewer incomplete samples. Also, most new cyberattacks are variants of these known attacks in the UNSW-NB15 data set. Therefore, UNSW-NB15 can more accurately reflect the characteristics of modern network traffic data and is more suitable for evaluating IDSs. Therefore, we choose UNSW-NB15 data set as the evaluation set for the proposed model.

In summary, with the powerful capabilities of automatic feature extraction, deep learning has been applied to network intrusion detection. However, how to build deeper networks without triggering the performance degradation problem and address the class imbalance problem in the training set are two major challenges. In this paper, a residual learning-based CNN is constructed to learn deeper features of network traffic, and the modified focal loss function is introduced into the proposed model to detect minor attacks.

3. Methodology

The proposed methodology consists of three parts: data preprocessing, model constructing, and model evaluation. First, network flows are converted into images. Then, CNNs with residual learning are constructed. Finally, trained models are tested and evaluated. The main structure is shown in Figure 1.

3.1. Data Set

As stated before, UNSW-NB15 is a network intrusion detection data set, which is processed and built through collecting different types of network connection data. This data set includes multiple types of contemporary attacks. Each flow of the data set contains 47 features, and the data set divides the network behaviours into nine categories of attacks plus the category of normal behaviours. These attacks can also be divided into 177 categories according to the environments that the specific attack depends on.

In this paper, part of the data set known as the UNSW-NB15 training set and UNSW-NB15 testing set are selected as the training data and testing data. They are data sets which are used for intrusion detection after redundant flows and features are processed. The distribution is shown in Table 1.

3.2. Data Preprocessing

Features in UNSW-NB15 contain numeric features and symbolic features; therefore, symbolic features should be digitized first. Then, processed features are normalized to obtain a standardized data set and converted into matrices.

One-hot encoding is used to map symbolic features into numerical features, and labels are mapped into digits using label encoding. The specific implementations are as follows:(1)One-Hot Encoding. One-hot encoding mainly uses a state register of size X to encode a character, and each character will have an independent register bit(2)Label Encoding. Labels in the UNSW-NB15 data set are divided into 10 categories. Coding rules are shown in Table 1

This paper uses Min-Max normalization. The main function of Min-Max normalization is to unify the feature values in the interval of [0,1]:where is the normalized eigenvalue, is the original eigenvalue, is the minimum eigenvalue, and is the maximum eigenvalue. After numerical normalization, each flow of the new set contains 196 features, so the data are converted into 14  14 matrices; then, the matrices are changed into black and white images.

3.3. Network Construction

Figure 2 shows the overall structure and the parameters of the CNN model. The proposed model extracts the features of input data by the convolution layers and pooling layers. Feature maps are then input into a global average pooling layer. Finally, the model classifies the sample data with a softmax layer.

3.3.1. Convolution and Pooling

Convolution layers are the core parts of CNN models. Convolution layers in proposed model extracted spatial features of given data and produced a feature map as the output. ReLU is often used as the activation function:and the function of pooling layers is to reduce the size of feature maps.

3.3.2. Batch Normalization

In the training part of CNNs, with the change of the parameters of the previous layer, the input distribution of the next layer will change correspondingly, making it more difficult to train deeper neural networks. Batch normalization, in the training process of CNNs, makes the input of each layer maintain the same distribution and provides with the solution to the difficulty of network training, thus effectively improving the training speed of networks and avoiding overfitting. Input data are all divided into batches, for instance, parameter “batch_size” is set to 128; therefore, 128 pieces of data are input as a batch at a time. Batch normalization layers are to normalize each batch so that the distribution of data remains unchanged. Suppose we have a batch of inputs:

The output of batch normalization is computed bywhere and are learned parameters and is calculated through

In this paper, batch normalization layers are placed after convolution layers and before the activation functions.

3.3.3. Residual Learning

Compared with common CNNs such as the LeNet-5 [28] and the AlexNet [29], ResNets introduced residual learning into the constructing of CNNs. The depth of a CNN has a great influence on the final classification results, so deeper networks are often constructed. However, as the network depth increases, the phenomenon of gradient explosion might occur, and the performance of the network will degrade. According to the previous experimental results, simply adding convolution layers and pooling layers to the network does not improve the accuracy of the network but leads to the deterioration of network performance. In this paper, residual learning is used to address the issue above. Residual refers to the residual difference between the local input and output:

In contrast to identity mapping, the learning goal of residual learning is 0, that is, to reduce the difference between the input and the output, allowing the original input to be directly connected to one certain network layer, so that the network can learn the residual. Residual learning is realized by a fast shortcut connection between the input and output of a block. It not only avoids adding additional parameters and computations to the network, but also effectively trains the parameters in the network and guarantees the performance while the network can learn deeper features. Two blocks used in this paper are shown in Figure 3. In the construction of plain models, the normal block exhibited in (a) is used, while the residual block shown in (b) is used to construct residual networks.

3.3.4. Global Average Pooling

At the end of CNN models, flatten layers are often adopted to flatten the data processed by the previous layers into a one-dimensional vector. The output size is gradually reduced through full connection layers, and the final output is obtained through an activation function. Since every node in flatten layers and full connection layers is connected to each other, too many parameters may lead to overfitting. A global average pooling layer is an average pooling layer without filter size. It averages the entire feature map. Using a global average pooling layer can reduce the count of calculating parameters and accordingly reduce the possibility of overfitting. In this paper, a global average pooling layer is used to replace the flatten layer. The principle of global average pooling is shown in Figure 4.

3.3.5. Softmax and Loss Function

Finally, the probability distribution of each label is calculated through the softmax layer:where denotes the total count of classes. denotes the input of softmax layer. Cross-entropy loss function is defined aswhere denotes the probability that tested sample belongs to class .

With the obvious class imbalance problem in the training set, preventing loss function from optimising one category while suppressing other categories is important. To increase the classification accuracy for minor classes, we need to make the model pay more attention to them during training. Resampling is one of the most common methods to deal with imbalanced data. Among resampling methods, undersampling may cause the loss of vital information while oversampling may add new information to disrupt the data and greatly increase training time. Compared with cross-entropy loss function, focal loss aims to solve the class imbalance problem so that if the number of samples that are easy to train is large, contribution of certain samples to the total loss is small. In other words, focal loss function focuses on minor samples. In our multilabel classification, focal loss is defined aswhere is a modulating factor that reduces loss contribution from easy samples. was calculated throughwhere represents the category prediction probability and is the label value. As , goes to 0 and the weights of samples that are easy to train to the loss are reduced. And is a weighting factor that can be used to scale the minor classes separately. In this paper, we introduce the multilabel focal loss where was set to 2 and was calculated throughwhere denotes the number of samples belonging to class and denotes the total number of samples in the training data.

4. Experiments

4.1. Experimental Environments

Experimental environments of this paper are shown in Table 2.

4.2. Evaluation Metrics

Accuracy, precision, recall, and F1-measure are adopted as evaluation metrics.(1)Accuracy (Acc): the ratio of the number of correctly classified samples to the total number of samples tested.(2)Precision (P): the ratio of correctly classified positive samples to the total number of positive samples.(3)Recall (R): the ratio of accurately identified positive samples to the total number of positive samples in the testing set.(4)F1-measure (F1): the weighted average of precision and recall.

True Positive (TP) denotes the number of positive samples correctly classified as positive; True Negative (TN) denotes the number of negative samples correctly classified as negative; False Positive (FP) denotes the number of negative samples misclassified as positive; and False Negative (FN) denotes the number of positive samples misclassified as negative. The confusion matrix is shown in Table 3.

4.3. Experimental Performance Evaluation

In the training phase, 6 CNN models are constructed, including 3 plain models and 3 residual models. Model Pn (Rn) consists of n normal blocks (residual blocks) shown in Section 3.3.3. Each model is trained by the processed training set for 100 epochs. The learning rate is set to 0.01. And after calculating the loss, an optimizer is needed to update the network weights. Adam is selected as the optimizer. Performances of 6 models are evaluated by calculating the model accuracy and the weighted average of precision, recall, and F1-measure. We choose weighted average to evaluate the overall performance, because compared with other average methods like the micro average and the macro average, the weighted average method takes the number of samples belonging to each class into consideration, so its results are more convincing to reflect the performance of the model. The weighted average is defined aswhere denotes the total amount of classes, denotes the number of testing samples in class , and is the testing result of metric on class .

We record the training loss of the above 6 models every 20 epochs to examine the effects of residual learning on CNNs. It can be seen from Figure 5, by utilizing residual learning in the blocks, training loss is greatly reduced. By adding residual blocks in the CNN, we achieve lower training loss, indicating that residual learning can address the network degradation problem. Also, as the figure demonstrates, the training loss at the very beginning is quite large, but as the training process progresses, the loss value continues to decrease. When the training epoch reaches 20, the training loss tends to decrease at a slower rate.

According to the comparison results from Table 4, the overall performance of CNNs has been significantly improved with residual blocks added into the plain models, indicating that we can build deeper CNNs with residual learning. Results can also demonstrate that with the increasing number of network layers, residual networks can achieve better performances than shallow residual networks on the whole. The model with 3 residual blocks (R3) achieves the highest overall classification accuracy of 88.695%. R3 (RLF-CNN) will be further compared with the state-of-the-art classification algorithms.

In order to evaluate the abilities of proposed method to detect attacks like Shellcode and Worm in network intrusion detection, we conduct several experiments and compared the recall value of each class with Multilayer Perceptron (MLP) and Long Short-Term Memory (LSTM), LeNet-5, AlexNet, and CNN with simple cross-entropy loss function (RLC-CNN). Support Vector Machine (SVM) and Random Forest (RF) are commonly used machine learning algorithms in network intrusion detection [30, 31]. We select SVM and RF to compare their classification results with those of deep learning algorithms. The number of training epochs of deep learning models was also set to 100. Table 5 and Figure 6 demonstrate the recall values of all models on each class. Figure 6 also shows the overall accuracy of each model. It can be seen from Table 5 that the performance of deep neural networks is significantly better than the classic machine learning algorithms. Classic machine learning algorithms need manually designed features of network traffic before the training phase, while deep learning algorithms automatically extract features.

Among deep learning algorithms, we are able to build deeper networks to learn more critical features of network traffic due to the residual blocks. Table 5 and Figure 6 demonstrate that our model achieves better results than other deep learning algorithms. With residual learning, CNNs can provide better performances in network intrusion detection.

In terms of minor classes, all the other models perform poorly due to the class imbalance problems in the training set. Our model utilizes focal loss to address the issue above. Although in some dominant classes, RLF-CNN’s performance slightly weakens due to their reduced weights in the loss function, RLF-CNN outperforms other classifiers in the classification of minor classes with higher recall values, indicating that focal loss is more suitable in classifying imbalanced data sets and enhancing the detecting capabilities.

To prove our model’s ability to detect normal flows and attacks, we compare it with other algorithms using metrics including True Positive and True Negative. The testing results are shown in Table 6.

Among these models, RLC-CCNN is the improved version of RLC-CNN possessing the same class weights as the ones used in the focal loss of RLF-CNN. SMOTE-RF [32] is an algorithm of Random Forest combined with SMOTE. Pelican [19] and S-ResNet [1] are improved residual networks which have faster convergence velocity and better testing results compared to other deep learning algorithms. As shown in Table 6, all the models above can identify over 99.3% of all 37000 normal samples correctly. But compared with other contemporary algorithms for network intrusion detection, RLF-CNN can identify more attacks correctly, given that most of the attacks in the data set are minor samples, showing higher attack detection rates.

Compared with SMOTE-RF, our model detects more attacks while it avoids generating new data. SMOTE-RF generates over 300000 training samples, consuming a lot of time and memory. Also, tradition machine learning algorithms lack the abilities to acquire data features automatically; therefore, with the absence of feature engineering techniques, SMOTE-RF is inferior to others in detecting attacks. Compared with other residual networks, our model got better results in the detection of attacks. It can also be seen that our model outperforms RLC-CCNN. Focal loss enables the model to focus on samples that are harder to learn, and testing results indicate that the focal loss can learn complex samples more efficiently and is superior to class weights in the training phase.

5. Conclusions

In this paper, a network intrusion detection method based on residual learning and focal loss has been proposed. Experimental results show that models with residual learning are easier to train, achieving lower loss values on the training data and higher accuracy rates on the testing data. Compared with other deep learning algorithms, RLF-CNN has achieved better performance in terms of several metrics due to residual learning. And our model uses a modified focal loss function to deal with the class imbalance problem existing in the training data. Also, the proposed model shows better results than a CNN with the same class weights. Despite outstanding results, this study has its potential limitations. Although our model has outperformed other deep learning algorithms in the detection of minor attacks with the focal loss, its performance to detect some dominant classes has weakened due to reduced weights. Therefore, how to improve the model’s performance on minor classes without affecting its abilities to detect dominant classes is an important issue that needs to be addressed in the future. Also, UNSW-NB15 data set only contains a few types of attacks; due to the low tolerance for errors in IDSs, we will combine other data sets to cover various types of attacks in the future. Last but not least, due to limited computing resources, deeper neural networks with more residual blocks and normal blocks cannot be tested. So, with more powerful resources in the future, we will continue to perform more experiments and maybe get better results when it comes to detecting network attacks.

Data Availability

The processed UNSW-NB15 data sets used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publishing of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 61906099) and the Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natural Resources (no. KF-2019-04-065).