Abstract

Network intrusion poses a severe threat to the Internet of Things (IoT). Thus, it is essential to study information security protection technology in IoT. Learning sophisticated feature interactions is critical in improving detection accuracy for network intrusion. Despite significant progress, existing methods seem to have a strong bias towards single low- or high-order feature interaction. Moreover, they always extract all possible low-order interactions indiscriminately, introducing too much noise. To address the above problems, we propose a low-order correlation and high-order interaction (LCHI) integrated feature extraction model. First, we selectively extract the beneficial low-order correlation between the same-type features by the multivariate correlation analysis (MCA) model and attention mechanism. Second, we extract the complicated high-order feature interaction by the deep neural network (DNN) model. Finally, we emphasize both the low- and high-order feature interactions and incorporate them. Our LCHI model seamlessly combines the linearity of MCA in modeling lower-order feature correlation and the nonlinearity of DNN in modeling higher-order feature interaction. Conceptually, our LCHI is more expressive than the previous models. We carry on a series of experiments on the public wireless and wired network intrusion detection datasets. The experimental results show that LCHI improves 1.06%, 2.46%, 3.74%, 0.25%, 1.17%, and 0.64% on the AWID, NSL-KDD, UNSW-NB15, CICIDS 2017, CICIDS 2018, and DAPT 2020 datasets, respectively.

1. Introduction

The COVID-19 pandemic in 2020 moved the main scenes of people’s lives and work from offline to the Internet overnight. Based on “Cisco’s 2021 Global Networking Trends Report [1],” 14.6 billion IoT devices will be connected to the network by 2022. The scale and complexity of IoT networks are rapidly expanding, and people are facing increasingly complex network attacks. Cisco checks 47 terabytes of network traffic every day, analyzes 28 billion flows, and records 1.2 trillion security incidents. The industrial control systems, networking equipment, and industrial cloud platforms are more vulnerable to intrusions under increasingly open network connections, causing lots of loss. Thus, it is essential to study information security protection technology in IoT. Among them, network intrusion detection is the foundation and core of ensuring IoT security.

Network intrusions detection consists of abuse-based and anomaly-based detection [2]. Abuse-based detection systems usually set rules to match network behavior. Thus, they get a high detection rate and low false-positive rate for known intrusions. However, unknown intrusions and their variants, zero-day attacks can easily evade them. Moreover, maintaining an updated rules database is a complex and labor-intensive task, as the rule setting needs a large amount of network security expertise. On the contrary, anomaly-based detection techniques are more promising for detecting zero-day and unknown intrusions [3], and they have become the mainstream.

At present, more effective methods of network intrusion detection are to use deep learning models. As a powerful approach to learning feature representation, deep learning models have the potential to learn sophisticated feature interactions. These models (e.g., convolutional neural network (CNN) [4], recurrent neural network (RNN)/long short-term memory (LSTM) [5], DNN [6]) fitting the best classification curve through data mining and statistical analysis. Moreover, they are not constrained by expertise in network security; the profiles of legitimate behaviors are developed based on data mining and statistical analysis. Thus, they achieved a relatively satisfactory performance. However, there are some disadvantages to these models. For example, the CNN-based models are biased to the interactions between neighboring features, while RNN-based models are more suitable for data with sequential dependency. The existing deep learning models only capture the high-order feature interaction and always fully or partly ignoring the low-order interaction (e.g., multivariate correlation among features) [7]. However, in reality, multivariate correlations are widely present in intrusion data. For example, the identification, flag, and offset features in the network layer cooperate to guide IP fragmentation reorganization; the transport layer’s SYN, FIN, ACK, PSH, and RST features work together to reflect the TCP status. Thus, ignoring the low-order feature correlation is not conductive to classification.

Meanwhile, some researchers have explored the feature correlation to distinguish between different intrusions. In [7, 8], researchers used the MCA method for accurate network traffic characterization by extracting the geometrical correlations between features. They extracted every correlation between every two different features. Dong et al. [9] proposed an intrusion detection model based on the MCA-LSTM from the temporal correlation features of intrusion data. From the above literatures, we notice that the current correlation analysis methods always extract the correlations between any two different features. There are some problems: (1) dimension disaster and too much noise: if they do not perform feature selection, any feature combination (i.e., correlation) is considered. Thus, the feature dimension of generated multivariate correlations is too large, resulting in dimension disaster and overfitting. Moreover, in these feature combinations, only a tiny part is meaningful; most are noise. In network intrusion data, there are usually protocol-type and statistical-type features. The correlation between the same types of features is strong. However, the correlation between different types of features is weak. Thus, generating the correlation between different types of features brings in too much noise. (2) Too much useful correlation loss: if they conduct the feature selection, thus many useful feature combinations will be lost due to the limited features. Moreover, the weak feature correlation between different types is still generated. Too much correlation loss and noise are not conducive to feature extraction.

In summary, the existing models are biased to low- or high-order feature interaction. And in the low-order correlation extraction, too much noise is generated since they seldom consider the difference of features and the influence of correlations generated by different types of features. To solve the above problems, we propose a low-order correlation and high-order interaction integrated feature extraction model. Following contributions have been made in this paper: (1)To extract features more perfectly, we integrate the low-order correlation captured by the MCA model and the high-order interaction obtained by the DNN model(2)To characterize low-order correlation more effectively, we comprehensively analyze the difference of features and divide them into different types. We selectively extract the useful low-order correlation between features in the same type, avoiding the dimension disaster and too much noise or correlation loss(3)To consider the classification influence in generated correlations, we employ attention to estimate the importance of different correlations when incorporating their latent representations(4)To evaluate the effectiveness and robustness of our LCHI model, we conduct a series of experiments on public wireless (e.g., AWID) and wire datasets (e.g., NSL-KDD, UNSW-NB15, CICIDS 2017, CICIDS 2018, and DAPT 2020). The experimental results show that our model has higher accuracy than the previous models

The rest of this paper is organized as follows: Section 2 describes the related works. In section 3, we state the problems of low- and high-order feature interaction extraction. Section 4 presents the LCHI model and key technologies. We present the theoretical analysis of our model and experimental evaluation in Section 5 and 6, respectively. Section 7 concludes this paper.

To improve the performance of network intrusion detection, researchers try to extract features from different aspects. This section mainly explains the current situation from two perspectives: high-order feature interaction and low-order feature correlation.

2.1. High-Order Feature Interaction Extraction

The deep learning models are widely adopted to network intrusion detection to extract the sophisticated hidden features [1015]. In intrusion detection, Wang et al. [16] proposed an SDAE-ELM-based integrated deep intrusion detection model to overcome the long training time and low classification accuracy and to achieve timely response to intrusion behavior. They also constructed a DBN-Softmax-based integrated mode for host intrusion detection. The experimental results show that the proposed models can effectively improve detection accuracy. Ge et al. [6] proposed a feed-forward neural network model with embedding layers (to encode high-dimensional categorical features) for multiclass classification. Using a second feed-forward neural networks model, they applied transfer learning to encode high-dimensional categorical features to build a binary classifier. Kasongo et al. [17] proposed a feed-forward deep neural network (FFDNN) wireless IDS system using a Wrapper-Based Feature Extraction Unit (WFEU). The WFEU used the Extra Trees algorithm to generate a reduced optimal feature vector. They compared the WFEU-FFDNN to the standard machine learning algorithms that include random forest (RF), support vector machine (SVM), Naïve Bayes (NB), decision tree (DT), and -nearest neighbor (kNN) and got a better detection performance.

Some researchers adopt the RNN/LSTM and CNN models to process the temporal-spatial features of intrusion data. Zhang et al. [5] proposed a unified model combining multiscale convolutional neural network with long short-term memory (MSCNN-LSTM). The model processed the spatial and temporal features by the MSCNN and LSTM models, respectively. And then, the spatial-temporal features were employed to perform the classification. For the real-time detection of anomalies within the in-vehicle network, Khan et al. [18] developed LSTM based false information attack/anomaly detection model. They adopted the LSTM model to process the time-series data to get the sequential variation patterns. In the face of zero-day attacks, Sms et al. [19] applied RNN models to find complex patterns in attacks and generated similar ones. They demonstrated that RNNs are helpful to generate new, unseen mutants of attacks as well as synthetic signatures from the most advanced malware to improve the intrusion detection rate. The authors in [20] proposed a feature representation method to transform the raw flow-based statistical features into more discriminative representations and developed an ensemble of machine learning-based classifiers optimized to discriminate the malicious flows from the benign ones.

Other deep learning models are also applied to intrusion detection. Balasundaram et al. [21] proposed the novel bat extreme learning three-tier intrusion detection architecture based on novel bat optimized extreme learning machines to detect the various cyberattacks over the IoT network. The proposed system consists of scalable IoT deployable software and a data analyzer mechanism. Shitharth et al. [22] proposed an Intrusion Weighted Particle-Based Cuckoo Search Optimization (IWP-CSO) and Hierarchical Neuron Architecture-Based Neural Network (HNA-NN) techniques to detect the encrypted and external intrusions. Gu et al. [23] proposed an SVM and naive Bayes-based intrusion detection framework. They implemented the naive Bayes feature transformation to generate new data with high quality and then adopted an SVM classifier using the transformed data to build the intrusion detection model. The above models usually achieved better performance in intrusion detection. However, these methods commonly suffer from high false-positive rates because they only focus on the high-order feature interaction, intrinsically fully or partly neglecting the low-order correlation.

2.2. Low-Order Feature Correlation Extraction

Some researchers have explored the low-order feature correlation. Tan et al. [7] used the MCA for traffic characterization by extracting the geometrical correlations between two different features. And then they used the traditional statistical analysis method to construct a normal profile to distinguish normal traffic flow from DDoS flow. Yu et al. [24] exploited the flow correlation coefficient to distinguish the DDoS from the flash event. Like [7], they generated the correlations between any two different features, and the generated feature dimension was too large, resulting in dimension disaster easily. Jamdagni et al. [25] developed a refined geometrical structure-based analysis technique, where the Mahalanobis distance was used to extract the correlations between the selected packet payload features. To improve the low detection performance of network intrusion detection models caused by high-dimensional data, Dong et al. [9] proposed an MCA-LSTM intrusion detection model from the time correlation characteristics of intrusion data. They first made a feature selection process to select the optimal feature subsets based on the information gain. And then they used the MCA method to extract the correlations between different features. Finally, they adopted an LSTM model to extract the temporal features based on the correlations generated by the MCA. However, this method did not effectively integrate the low- and high-order feature interactions. Like [25, 26], the methods extracted the correlations from the selected feature subset, resulting in too much loss of useful correlations.

Factorization machines (FMs) [27, 28] are a popular solution that combines the advantages of SVM with factorization models. FMs model all interactions between variables using factorized parameters. It is a supervised learning approach that enhances the linear regression model by incorporating the second-order feature interactions. However, FM models feature interactions in a linear way, which can be insufficient for capturing the nonlinear and complex inherent structure of real-world data. To alleviate the IoT-dependent botnet attacks newfangled danger, Arul [29] suggested and estimated a deep nonlinear regression least-squares polynomial fit to recognize peculiar system traffic originating as of conceded IoT gadgets. However, the method generated too much noise.

From the above analysis, we notice that the current low-order feature correlation extraction methods usually extract possible correlations or interactions between any two different features, resulting in too much noise. Few consider the difference between different features when extracting low-order correlation, resulting in low detection performance.

3. Problems Statement

The limited original features in practice are not enough to characterize different attacks. Thus, it is important for network intrusion detection to learn feature interactions. There are two types of feature interactions: (1)Low-order feature interaction ().where the are different features. The is easy to obtain and understand based on background knowledge. As shown in the left subfigure in Figure 1(a), the directly reflects the principle of a DDoS attack. Since attackers often exploit bots to carry on the DDoS attack, we can infer that the and may effectively distinguish between normal and DDoS flows. However, among these , only a few are useful, and the others are noise. Thus, we need to selectively and effectively generate and evaluate meaningful . (2)High-order feature interaction ().where the parameter contains , etc. And the symbol in the upper right corner denotes the Kleene star operator. The is unknown and hard to understand, and it can be only extracted by deep learning models. As shown in the right subfigure of Figure 1(a), the interaction rules among features in the APT attack are unknown and hard to capture. In general, such feature interactions can be highly sophisticated, where both low- and high-order feature interactions should play important roles.

To distinguish the low-order feature interaction from the high-order feature interaction, we usually call the low-order interaction the low-order correlation. The low- and high-order feature interactions are both useful for distinguishing different attacks. However, existing research does not make full use of them. There are two problems:

Problem 1: the existing works are all biased to single low- or high-order feature interaction. However, the detection accuracy of single low- or high-order interaction is not satisfied, as shown in Figure 1(d). The environment setup is the same as Section 6. The typical low- or high-order feature interaction model is shown in Figure 1(b). Figure 1(c) represents the image effect of a single low- or high-order feature interaction in the normal, DDoS, and APT attacks. Without the high-order interaction, we can also distinguish between the DDoS and normal flows, as shown in the left subfigure of Figure 1(c). However, the APT attack is much more complex, and it is hard to distinguish only by the low-order interaction. Referring to the left subfigure of Figure 1(c), the APT flow is similar to the normal flow. Without the low-order interaction, we can also distinguish between the APT and normal flows based on the single high-order interaction, as shown in the right subfigure of Figure 1(c). However, lacking harmonization of low-order interaction, sometimes deep learning models overfit the complex and unknown APT attack as the samples are rare, resulting in the DDoS and normal flows intertwined, as shown in the right subfigure Figure 1(c).

Problem 2: the low-order correlation methods introduce too much noise as they always extract all possible correlations between any two different features indiscriminately. Among these correlations, only a few parts are useful; most are noise. Network intrusion data contains different types of features (e.g., protocol-type and statistical-type). The correlation between features of the same type is strong. However, the correlation between different types of features is weak and easy to become noise. Table 1 shows the effects of correlations generated between features in the same type, different types, and all types on the CNN, LSTM, and DNN models. The experimental setup is the same as Section 6. The accuracy in the same type is much higher than the others, and the feature dimension is lower. The accuracy in different types is much lower, and the feature dimension is not low. Due to the adverse effect of the correlations generated by different types, the accuracy of all types is limited. Furthermore, the feature dimension of all types is the highest, which needs much more training time.

4. LCHI Feature Extraction Model

4.1. Framework of the LCHI Model

Figure 2 shows the overall architecture of the LCHI model. The processes are as follows: Firstly, we extract the low-order feature correlation. We propose a feature division-multivariate correlation analysis (FD-MCA) method. We divide the network intrusion features into different types based on the characteristics of intrusion data (e.g., protocol-type features and statistical-type features). Based on the network background knowledge, we classify the correlated features into the same type as much as possible. Thus, the correlation between the same types is strong (e.g., SYN/FIN/ACK/PSH/RST), and the correlation between different types is weak. And then extract the low-order correlation between features only in the same type by the MCA method. In addition, we further fuse the correlations through an attention mechanism, considering the different effects of correlations. We, respectively, leverage the feature vectors of correlations obtained from different types into the same dimensional dense embedding of a common latent space. It is to lay the foundation for the following attention mechanism. Then, considering the different effects of generated correlations, we fuse the correlations through an attention mechanism. Secondly, we extract the high-order feature interaction based on the DNN model due to the full-connection characteristics. Finally, we incorporate the low-order correlation and high-order interaction to classify different attacks.

4.2. Low-Order Correlation Extraction by MCA and Attention
4.2.1. FD-MCA Method

The network intrusion data we usually obtain mainly contains three-type features: (1)The basic features of TCP connection: These features are obtained from the protocol header fields and cooperate to guide network data transmission (e.g., protocol, service)(2)The content features of TCP connection: These features are usually embedded in the data payload of the packet and reflect the content characteristics of intrusion behavior (e.g., num_failed_logins and num_root)(3)The statistical features of Network flow: These features are extracted from the network operation process and usually indicate the network operation status (e.g., same_srv_rate and dst_host_count)

Since the gap is vast between different types of features, the correlation between different type features may become the noise and should be ignored. Even though there are some correlations between the different type features that we do not know, the noncorrelated situations far exceed the correlated cases among all the feature correlations. Thus, the useful correlations generated are far less than the noise generated. Therefore, we divide the intrusion data into different types as shown above and only extract the multivariate correlations between the same-type features. It not only avoids introducing too many useless correlations but also generates useful correlations as much as possible. We apply the MCA [7] to extract the correlations by constructing the triangle area map (TAM).

The network intrusion data is , and denotes the th -dimension record. We first extract the geometrical correlation between the same-type features and map each record to . Triangle area mapping is used to extract the correlation between the th and th features. Then, we project the record on the -th two-dimension Euclidean: where , , and . , , and the others are 0. The is defined as a point on the Cartesian coordinate system in the -th 2D Euclidean subspace with coordinate . Then, we build a triangle formed by the origin and the projected points of the coordinate on the -axis and -axis. The triangle area is defined as

Thus, we construct the correlation of in every feature type, and all the triangle areas are arranged on the map with respect to their indexes. denotes the th raw and the th column of the . According to Equation (2), and when . Thus, the is symmetric:

Thus, we only take the upper triangle of the TAM:

Finally, we get the generated correlations from different feature types: correlations between the basic features of TCP connection (), correlations between the content features of TCP connection (), and correlations between the statistical features of network flow (). As shown in Figure 2, the original feature dimension is . The dimension of nonrepetitive and nonzero generated correlations by the existing MCA method is . However, in our FD-MCA method, the dimension is much lower, only .

4.2.2. Attention Mechanism for the Generated Correlations

This section evaluates the different impacts of generated correlations. Network intrusion data contains various low-order correlations. However, these three correlations have their emphasis and are not equally useful for distinguishing attacks. Different attacks have different weights on these correlations. Our model can be hindered by its modeling of these correlations with the same weight if we ignore the differences. Therefore, we improve our model by discriminating the importance of these correlations. The attention mechanism discriminates the importance of different components when compressing them into a single representation.

Since the dimensions of generated correlations are not equal, we first transform them into the same dimension. The original dimensions of , , and are , , and , respectively. The dimension unification is as follows: where , and are parameters.

The single representation of correlations is as follows: where , , and are the attention scores for correlations , and , respectively. . denotes the fused features. Here, the attention scores are calculated via a two-layer attention network: where is the first layer parameters, is the second layer parameters, and is the size of hidden layer, which we call attention factor. The symbol is the activation function, and we use the ReLU, which empirically shows good performance. The attention scores are normalized through the softmax function, a common practice by previous work, as follows: where .

4.3. High-Order Interaction Extraction by DNN

In this section, we introduce a deep learning model to extract the high-order feature interaction. Due to the fully connected characteristic of the DNN model, we employ a DNN to extract the high-order feature interaction in an implicit and nonlinear manner, as follows: where denotes the original features. denotes the hidden layer, and is the number of hidden layers. and are the parameters. denotes the output of the DNN. It is worth mentioning that the FD-MCA part and the DNN part share the same input, which may enable learning both low-order and high-order feature interactions from original features.

4.4. Fusion of the LCHI

According to the literature [31], considering the low- and high-order feature interactions simultaneously brings additional improvement over the cases of considering either alone in recommender systems. Inspired by this, we further incorporate the low-order correlation and high-order interaction for the resulting output of the LCHI model: where and are the parameters. The loss function of the LCHI model is where is the number of records, denotes the label of the original input sample, and denotes the predicted result.

5. Comparison of Existing Models Theoretically

This section compares the proposed LCHI model with the typical low- (e.g., PR, FM, and MCA) and high-order interaction models (e.g., CNN, LSTM, and DNN). The characteristics of each model are shown in Table 2.

5.1. CNN

The kernel convolution operation is , where and are the local input and output of layer . Thus, the local perception characteristic of the model makes it more inclined to the interaction between neighbor features, resulting in the loss of interactions between features that are separated by a larger distance.

5.2. LSTM

The kernel process is . The current state is affected by the current input and previous state passed down. Thus, the model is more suitable for time series data with sequential dependency. However, the timing relationship is not obvious in intrusion data sometimes.

5.3. DNN

As shown in Equation (11), the model is much suitable to capture the high-order interaction due to the full connected characteristic. However, the CNN, LSTM, and DNN models all lose the low-order feature interaction.

5.4. PR

. Thus, lots of new features are generated when obtaining the feature interactions. However, the PR may introduce some noise and cause dimension disaster (i.e., the dimension is ) when the original feature dimension or the value is large. Thus, is usually set as 1 or 2. And we show the result of low-order feature interaction by PR when in Table 3.

5.5. FM [26]

. The formal reflects the linear regression, and the latter denotes the pair-wise feature interactions. Compared with our LCHI model, the FM can be hindered by its modeling of all feature interactions with the same weight, as not all feature interactions are equally useful.

5.6. MCA [7] and MCA-LSTM [9]

As shown in Equation (4), the MCA can capture the low-order correlation, and the MCA-LSTM can capture both low- and high-order features. However, they all ignore the background knowledge and capture the correlations between distinct features, introducing too much noise. In addition, the MCA-LSTM conducts the feature selection, losing too many valuable correlations.

When extracting the low-order feature correlation, the MCA is more balanced compared with the PR and FM. While the PR generates features with higher dimensions and introduces more noise, the FM is more suitable for features with high sparsity. When modeling the high-order feature interaction, the CNN and LSTM models cause more loss, and the fully connected characteristic of the DNN model is more conducive to achieving this goal. Therefore, we adopt the MCA and DNN to extract low- and high-order feature interactions, respectively. The detection results in Table 3 show the advantages of the MCA and DNN models in low- and high-order feature interactions, respectively. The experimental environment is the same as Section 6.

5.7. Summarizations

The LCHI not only captures both the low- and high-order feature interactions simultaneously but also selectively generates useful low-order correlation. It makes full use of background knowledge to avoid too much noise or information loss. Moreover, the LCHI also evaluates the importance of different feature correlations by attention.

6. Model Evaluation

This section describes the model evaluation. We evaluate the effectiveness of our LCHI model from the following perspectives: (1)Detection performance of the LCHI model: we first compare the overall detection performance of the LCHI model with existing models and then we study the hyper-parameters of the LCHI model(2)Effectiveness of the FD-MCA method: we compare the detection performance of our FD-MCA method with the existing MCA methods (i.e., MCA with the full features and selected features)(3)Superiority of the attention mechanism: we compare the detection performance of our attention mechanism for the different low-order correlations with the directly spliced method for the correlations

6.1. Dataset Description and Metrics

This section aims to demonstrate the datasets and performance metrics. To illustrate the robustness and applicability of our LCHI model, we use the public wireless network intrusion dataset of Aegean WiFi intrusion dataset (AWID) [32] and the wired network intrusion datasets of NSL-KDD [33], UNSW-NB15 [34], CICIDS 2017 [35], CICIDS 2018 [36], and DAPT 2020 for APT [37] to evaluate the model.

The AWID dataset contains one type of normal data and three instances of typical intrusion data. Each sample contains a 154-dimension feature and a class label. Compared with the KDD Cup 99, NSL-KDD, and UNSW-NB15 datasets, which are all general-purpose datasets commonly used for IDS research, the AWID is solely generated from wireless network traffic. Unlike intercepting data from the traditional TCP/IP communication protocol, traces in the AWID were not artificially generated. They were naturally produced from a wireless local area network and were more in line with the actual situation. As the total number of AWID dataset is too large, in this study, we use 20% of the reduced AWID dataset (i.e., AWIDCLS-R-Trn and AWID-CLS-Tst), and Table 4 depicts the value distribution of the selected AWID dataset.

The NSL-KDD dataset contains one type of normal network flow and four types of intrusion data. Each sample contains a 41-dimension feature and a class label. The UNSW-NB15 dataset contains one type of normal network flow and nine types of intrusion data. Each sample contains a 42-dimension feature and a class label. In the CICIDS 2017 dataset, a part of the dataset “Friday-WorkingHours-Afternoon” is used to validate our model’s effectiveness. And the selected dataset contains 97718 records of benign traffic and 128027 records of DDoS network traffic, and each sample consists of a 78-dimension attribute and a class label. In the CICIDS 2018 dataset, a part of the dataset “data cicids2018 reconnaissance” is used for our model. And the selected dataset contains 667626 records of benign traffic and 193360 records of DoS-SlowHTTPTest traffic, and each sample has a 79-dimension attribute and a class label. In the DAPT 2020 dataset, we use the “data_custom_wednesday” dataset to test our model. And the selected dataset contains 8855 records of benign traffic, 8588 records of establish foothold traffic, and 44 records of reconnaissance traffic, and each sample has an 83-dimension attribute and a class label. Unlike the previously addressed datasets, these three datasets do not divide the data into training and testing groups. Thus, we randomly select 66% of the dataset for training and use the remaining 34% for testing. The value distributions of these datasets are shown in Table 4.

In this study, we use the following four indicators to evaluate detection performance: accuracy (ACC), F1-measure (F1), false-negative rate (FNR), and false-positive rate (FPR). The calculation formulas are as follows: where denotes the number of -type intrusion successfully detected. denotes the number of all the samples. TP, TN, FP, and FN denote the true positive, true negative, false positive, and false negative, respectively. The higher the ACC and F1 and the lower the FNR and FPR, the better the detection performance.

In this study, we use the “GTX 1080Ti four-card E5-2650, v2 32 cores, memory 64G, 440GB SSD, 4TB hard drive” to deploy our LCHI model and conduct the experiments in Python 3.7 and TensorFlow 1.14.

6.2. Comparison of Detection Performance

In this section, we first compare the overall detection performance of our LCHI model with previous intrusion detection models. And then we study the hyper-parameters of our LCHI model.

6.2.1. Overall Detection Performance

This section describes the overall performance of our LCHI model with existing models. In the AWID dataset, since the feature dimension is too large, we obtained 72 features after removing some all-zero features. And then we selected 10 protocol-type features as the basic features, and the remaining are considered statistical features. In the NSL-KDD dataset, the features were divided into three types: the basic features of TCP connection (1st–9th dimensions), the content features of TCP connection (10th–22nd dimensions), and the statistical features of network flow (23rd–41st dimensions). The UNSW-NB15 dataset was also divided into three types: the 1st–14th, 15th–22nd, and the remaining. In the CICIDS 2017 dataset, we selected 12 protocol-type features as the basic features and the remaining as the statistical features. In the CICIDS 2018 dataset, we obtained 68 features after removing some all-zero features as the feature dimension is too large. We selected 10 protocol-type features as the basic features and the remaining as the statistical features. In the DAPT 2020 dataset, we selected 11 protocol-type features as the basic features, and the remaining are considered statistical features.

In our LCHI model, the DNN consisted of three hidden layers, with 64 neurons per layer. ReLU activation was adopted for our model. We set and as the unified dimension size and the attention sizes. Table 5 shows the average detection results of the LCHI model and previous models (i.e., SVM, BN, RF [38], CNN, LSTM, DNN, and MCA [7]).

We can see that in these six datasets, our LCHI model has higher ACC and F1 than the other models. For the FNR and FPR, our model maintains a much lower value. The details are as follows.

In the AWID dataset, our model gets the higher ACC with a 99.60% rate, which is close to 100%. And the second largest ACC is 98.54%. Thus, our model improves the ACC by 1.06%. Meanwhile, the FNR of our model is all lower than the other models. Although the FPR of our model is not the lowest, the FPR is only 0.005%.

In the NSL-KDD dataset, the ACC of our LCHI model is 83.97%, while the second highest is 81.51% in the DNN model. We improve the ACC by 2.46%. For the FNR and FPR, our model has the lower FNR. Although the FPR of our model is higher than that of BN and FR, it maintains a much lower value.

In the UNSW-NB15 dataset, the FNR and FPR of our model are lower than others. For the ACC, our model achieves the best outcome with an 80.78% rate, followed by DNN and LSTM, with 77.04% and 76.20%, respectively. We improve the ACC by 3.74%.

In the CICIDS 2017 dataset, our model achieves the best outcome for the ACC metric with a 99.78% rate, followed by DNN and RF, with 99.53% and 99.16%, respectively. The FNR and FPR in our model are not the lowest. However, the values are only 0.28% and 0.13%, respectively. Under the premise of enhancing ACC and F1, our model maintains a minimal value on the FNR and FPR.

In the CICIDS 2018 dataset, our model achieves the best ACC with a 99.61% rate, followed by DNN and CNN, with 98.44% and 97.29%, respectively. The ACC of our model is close to 100%. The FNR and FPR of our model are not the lowest. However, the values are only 1.55% and 0.05%, respectively. Under the premise of enhancing ACC and F1, our model maintains minimal FNP and FPR.

In the DAPT 2020 dataset, we can see that the ACC and F1 in our LCHI model are higher than the other models, with 99.11% and 99.12%. Although the FNR and FPR are not always lower than the other models, their values are maintained very low, only 1.17% and 0.60%, respectively.

Our model not only extracts the multivariate correlations between the same feature types but also integrates the low-order interaction and high-order interaction features. While extracting the effective correlation features, we try our best to avoid feature redundancy or too much information loss. The experimental results verify the effectiveness of our LCHI model.

6.2.2. Impact of Hyperparameters and Statistical Analysis

We then study the impact of different hyperparameters of our LCHI model on the UNSW-NB15 dataset. The order is (1) activation functions, (2) dropout rate, (3) number of hidden layers, and (4) number of neurons per layer. The losses of different parameters are shown in Figure 3.

The ReLU, Sigmoid, and tanh are the most common activation functions for deep learning models. In this paper, we compare the performance of deep learning model when applying these three different activation functions. As shown in Figure 3(a), ReLU is more appropriate than Sigmoid and tanh in the LCHI model when the other parameters are the same.

Dropout refers to the probability that a neuron is kept in the network. It is a regularization technique to compromise the precision and the complexity of the neural network. We set the dropout to be 0, 0.1, 0.2, 0.3, 0.4, and 0.5. As shown in Figure 3(b), the model reaches the best performance when the dropout is set to 0.2.

As presented in Figure 3(c), an increasing number of hidden layers improve the performance of the models at the beginning. However, their performance degrades if the number of hidden layers keeps increasing. This phenomenon is because of overfitting. The loss is lowest when the number of hidden layers is set to 3.

When other factors remain the same, increasing the number of neurons per layer introduces complexity. We can observe from Figure 3(d) that increasing the number of neurons does not always benefit. For instance, the loss is the lowest when the number of neurons is 64. The model performs worse when we increase the number of neurons from 64 to 256 because an overcomplicated model is easy to overfit. In our tests, 64 neurons per layer is a good choice.

The above experiments show that our LCHI model gets lower loss when the activation function is ReLU, and the dropout is 0.2. The number of neurons per layer is set to 64, and the number of layers is set to 3. These parameters can effectively extract features, and the model is not too complicated, thus avoiding overfitting or long training time. Finally, we display the training loss and testing loss of different epochs for different models, as shown in Figure 4. We can see that our LCHI model gets lower loss than the other models both on the training and testing stages, indicating that LCHI can better fit the data and lead to more accurate detection.

Table 6 shows the training time per epoch of different models on the UNSW-NB15 and AWID datasets. In contrast to the LSTN, DNN, and MCA, the training time of our LCHI model is longer. Although the training time of these models is lower than the LCHI model, they all get lower detection accuracy (see Table 5). Thus, under the condition of not long training time, our LCHI model has higher detection accuracy and practical application value.

6.3. Effectiveness of the FD-MCA Method

To evaluate the effectiveness of the FD-MCA method, we conducted a series of experiments on different correlation extraction methods. We extracted the multivariate correlations between every two different features from the full features (i.e., full MCA), from the selected features (i.e., selected MCA), and the features in the same type (i.e., FD-MCA). In these tests, we employed the feature selection algorithm of literature [17] and selected 20, 20, 22, and 28 features for the AWID, NSL-KDD, UNSW-NB15, and CICIDS 2018 datasets, respectively.

The detection results are shown in Figure 5. In these four datasets, we can see that the ACC and F1 in our FD-MCA method are both higher than those in the full MCA and selected MCA methods. Comparing the feature dimension of our FD-MCA method, the dimension of full MCA is much higher than the others, and the dimension of selected MCA is much lower than the others. For the full MCA, although it generated the most correlations, the ACC and F1 are lower than others. The noise may be more than the useful correlations in these generated correlations, especially the correlations generated between different feature types. For the selected MCA, the selected features are much less than the original features. Thus, some useful features had been removed, resulting in much loss of useful correlations. Our FD-MCA method avoided the defects of the above two methods and only generated the correlations between the same feature type without removing any original feature. Under the guidance of background knowledge, our FD-MCA method can avoid excessive noise. The experimental results show that our method is superior to the others.

6.4. Superiority of the Attention Mechanism

This section compares the performance of different processing methods for the generated low-order correlations. Our model uses the attention mechanism to fuse these different correlations (i.e., attention). In addition, we can also directly splice these correlations (i.e., spliced). We compared these two methods to verify the superiority of our attention method. As shown in Figure 6, we observe that the attention’s training and testing losses are much lower than the spliced form, indicating that the attention method can better fit the data and more accurately detect. Figure 7 shows the loss of attention w.r.t. different attention factors. The attention’s performance is relatively stable across attention factors. Thus, the experimental results justify the rationality of attention’s design that estimates the importance score of different correlations.

We then adopted the DNN model to train these two processing methods on the AWID, NSL-KDD, UNSW-NB15, and CICIDS 2018 datasets. The detection results are shown in Table 7. In these four datasets, we can see that the ACC and F1 in our attention method are higher than those in the spliced method. The ACC has proved by 0.26%, 0.29%, 0.74%, and 0.11% in these four datasets, respectively. Compared with the spliced method, the attention method takes into consideration the effect of different low-order correlations. These low-level correlations have different distinguishing capabilities for different attacks and help distinguish the subtle differences between different attacks. Splicing the correlations loses these tiny differences. The experimental results verify the superiority of our attention mechanism.

7. Conclusions

This paper proposes a novel LCHI model to learn the low-order feature correlation and high-order feature interaction simultaneously. Firstly, we consider the difference of features and divide them into different types. Since the feature correlation between the same types is much stronger than between different types, we selectively generate the low-order correlation between the same-type features to avoid too much noise. We do not conduct the feature selection to avoid too much correlation loss. Moreover, we consider the influence of different correlations, and thus we adopt the attention mechanism on them. Secondly, we employ the DNN model to extract the sophisticated high-order feature interaction. Finally, we incorporate the low-order feature correlation and high-order feature interaction. Above all, our LCHI model seamlessly combines the linearity of MCA in modeling lower-order feature correlation and the nonlinearity of DNN in modeling higher-order feature interaction. Conceptually, LCHI is more expressive than the previous models. The experimental results indicate that our model gets higher accuracy than the others. In the future, more effective integration of low- and high-order feature interactions in real IoT networks will be our next task.

Data Availability

The data used to support the findings of this study are included within the article. We have described the data and their references in detail in this article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No. 61902013, No. U1636208, No.61862008) and the Beihang Youth Top Talent Support Program (Grant No. YWF-21-BJ-J-1039).