Abstract

To design and develop AI-based cybersecurity systems (e.g., intrusion detection system (IDS)), users can justifiably trust, one needs to evaluate the impact of trust using machine learning and deep learning technologies. To guide the design and implementation of trusted AI-based systems in IDS, this paper provides a comparison among machine learning and deep learning models to investigate the trust impact based on the accuracy of the trusted AI-based systems regarding the malicious data in IDs. The four machine learning techniques are decision tree (DT), K nearest neighbour (KNN), random forest (RF), and naïve Bayes (NB). The four deep learning techniques are LSTM (one and two layers) and GRU (one and two layers). Two datasets are used to classify the IDS attack type, including wireless sensor network detection system (WSN-DS) and KDD Cup network intrusion dataset. A detailed comparison of the eight techniques’ performance using all features and selected features is made by measuring the accuracy, precision, recall, and F1-score. Considering the findings related to the data, methodology, and expert accountability, interpretability for AI-based solutions also becomes demanded to enhance trust in the IDS.

1. Introduction

Cybersecurity system is developed based on different peers including technology, processes, and people. The relationships among these peers are the core of the trust management in the cybersecurity. For example, (1) the relationship between people and groups, (2) the relationship between people and organizations, and (3) the relationship between people and technology. The trusted peers are deployed in the cybersecurity system which aims to detect the cyberattacks [1].

Artificial intelligence (AI) defines a set of techniques that simulates the human intelligence in machines. The core idea of these techniques is extracting the knowledge from a collection of data. Consequently, there is no certain grantee level to trust the AI-based techniques due to the three aspects: (1) quality of data, (2) the degree of complexity for the methodology that was used to design the AI systems, and (3) AI engineer’s experiences. According to the context of this work, a set of peers of AI technologies can be interacted to perform the trust for the cybersecurity systems. Therefore, we cannot do trust AI technologies to prevent the cybersecurity systems against cyberattacks. Consequently, this work in this paper study the trust for AI-based solutions including machine learning and deep learning in cybersecurity systems considering (1) data, (2) methodology, and (3) expert accountability. To do so, we will investigate AI-based solutions for trust context of cybersecurity in terms of the quality of data, methodologies, and experiences.

1.1. Trust in Intrusion Detection Systems

Cybersecurity system is developed based on different peers, including technology, processes, and people. The relationships among these peers are the core of the trust management in the cybersecurity. For example, (1) the relationship between people and groups, (2) the relationship between people and organizations, and (3) the relationship between people and technology. The trusted peers are deployed in the cybersecurity system to detect the cyberattacks [1].

“Trust” is commonly used word in cybersecurity which describes the connection a foundation that must be established for cybersecurity systems including machine-to-machine (M2M) system. In M2M systems, trust can be defined as the confidence between machines to identify and manage their information technology assets. To achieve the trust chain between machines, cryptography, digital signatures, electronic certificates, and AI-based solutions are used. As these techniques are established to perform trust for M2M systems, trust seems like a simple function; it is often a fundamental challenge. In particular, the challenge of trust in cybersecurity is a broader notion about the quality of information being exchanged among machines, the methodologies used to design these techniques, and the expert accountability that use these techniques. According to this work, we will investigate AI-based solutions for trust context of cybersecurity. Many efforts have been made in research and industry to prevent the critical system from the cyberattacks. The IDS have received attention due to the continuously increasing cost to fight cybercrime [2]. The cybercrime type includes (1) malicious insiders, (2) denial of services, and (3) web-based attacks. Therefore, most companies and enterprises deploy cybersecurity systems (e.g., antivirus, firewall, and IDS).

The core function of the IDS is identifying the malicious attacks’ activities in advanced before they access the information and harm the confidentiality of the critical systems [3]. This demand of the security systems from both known and unknown threats opens a challenge for the research communities and industry to design secure and trustful systems against the cyberattacks [4]. This also opens up the issue about how to successfully secure from both known and unknown threats. There is no straightforward answer to this because of the increasing number of threats every year [4]. Recently, AI-based technologies, including machine learning and deep learning, play a vital role in learning from the previous attacks’ collected historical data. These models’ extracted knowledge is used to enhance the trust in IDS [5].

1.2. Contribution

Our main contributions are summarized as follows:We develop investigation methodology to study the trust impact in intrusion detection, including the data, methodology, and expert accountability by analyzing machine learning and deep learning models’ performanceCollect intrusion detection using wireless sensor network detection system (WSN-DS) and KDD Cup network intrusion datasetApply different feature engineering techniques, including correlation matrixCompare four machine learning (DT, KNN, RF, and NB) and deep learning (LSTM and GRU, using one and two layers) to study the trusted AI-based systems’ accuracy regarding the malicious data to detect any intrusion in the system

1.3. Paper Organization

The rest of this paper is organized as follows. A review of relevant works is conducted in Section 2. The methodology is provided in Section 3. The experiments and results are described in Section 4. The discussion is introduced in Section 5. Finally, the paper is concluded in Section 6.

Vinayakumar et al. [6] have used a deep neural network to develop IDS to predict unforeseen and unpredictable cyberattacks. Almomani et al. [7] have used artificial neural network (ANN) to develop IDS to classify different DoS attacks. The authors in [8] have used a multistage machine learning-based intrusion detection to detect and classify four types of jamming attacks. Abhale and Manivannan [9] have used different types of supervised machine learning to classify anomaly type of IDS. On the other hand, Alqahtani et al. [10] have proposed genetic-based extreme gradient boosting (XGBoot) to detect minority classes of attacks in highly imbalanced data traffics of wireless sensor networks. The authors in [11] have introduced an ensemble learning scheme for classifying network intrusion detection. However, Farrahi and Ahmadzadeh [12] have used various algorithms such as k-means clustering, Naïve Bayes, support vector machine, and OneR algorithms to classify regular traffic and DoS attack. Also, the genetic algorithm (GA) was implemented to detect the different types of intrusions [13].

Some researchers have used feature selection methods to select essential features will reduce the computational time of the algorithms. Due to network data’s significant features, many IDS were developed with feature selection [14]. Chebrolu et al. [15] classified primary elements in constructing IDS that is very crucial for real-world intrusion detection. Zaman and Karray [16] implemented a feature selection technique to construct a lightweight IDS. Vimalkumar and Radhika [17] implemented principal component analysis- (PCA-) based feature selection technique in the big data framework for IDS. Balakrishnan et al. [18] developed an IDS model with a gain ratio as a feature selection technique. Most of the IDS-based studies focused on the performance of the implemented model. Alkasassbeh et al. [19] concentrate on different types of attack, such as http flood, smurf, siddos, and udp flood. They implemented various machine learning algorithms to detect DOS intrusions and demonstrated the high accuracy of 98.36% using multilevel perceptron (MLP). Peng et al. [20] proposed an IDS system based on a decision tree to improve detection efficiency. Their method showed better performance over Naïve Bayesian and KNN methods.

3. Research Methodology

This section will describe our approach to investigate the trust impact in intrusion detection using machine learning and deep learning models. To do so, five phases are developed including (1) data collection to describe the datasets and their characteristics, (2) splitting datasets, (3) feature extraction methods, (4) optimization and training models, and (5) the evaluation metrics that will be used [19] for performance comparison (see Figure 1). Further details about the developed phases are described as follows.

3.1. Data Collection

In this section, we provide a description of the datasets used to find the optimal machine learning model and deep learning model that obtains the best performance for attack type classification in IDS. Two datasets were collected from wireless sensor network detection system (WSN-DS) and KDD Cup network intrusion dataset.

3.1.1. WSN-DS Dataset

The first dataset is WSN-DS which is a specialized dataset for detecting intrusions in wireless sensor networks. The WSN-DS dataset is collected by [7] to help better detect and classify types of denial-of-service (DoS) attacks. According to this work, we have used the WSN-DS dataset to study the machine learning and deep learning models performances with respect to the sensor nodes that can be able to detect attacks’ patterns from the normal traffic. Then, we have compared the machine learning and deep learning models’ performances to study the impact of trust in machine learning and deep learning models’ IDS.

The WSN-DS dataset contains 23 features extracted using LEACH routing protocol including Id, Time, Is_CH, who_CH, RSSI, Dist_To_CH, M_D_CH, A_D_CH, ADV_S, ADV_R, JOIN_S, JOIN_R, ADV_SCH_S, ADV_SCH_R, Rank, DATA_S, DATA_R, Data_Sent_BS, Dist_CH_BS, Send_code, Current_Energy, Consumed_Energy, and Attack_Type [7].

The dataset file has only 19 features including the class label [10]. These 19 features were Id, Time, Is_CH, who_CH, Dist_To_CH, ADV_S, ADV_R, JOIN_S, JOIN_R, ADV_SCH_S, ADV_SCH_R, Rank, DATA_S, DATA_R, Data_Sent_BS, Dist_CH_BS, Send_code, Consumed_Energy, and Attack_Type. The number of samples within the WSN-DS dataset is 374,662. These samples are distributed among five main groups of which four of them are types of DoS attack which are labeled as attacks including Blackhole, Grayhole, Flooding, and Scheduling attacks and Normal. The description of the attacks is as follows (see Table 1):Blackhole attack: type of DoS attack where the attacker advertises itself at the beginning of the round to affect the LEACH protocolGrayhole attack: type of DoS attack where the attacker advertises itself as a CH for other nodes to affect the LEACH protocolFlooding attack: type of DoS attack where the attacker advertises itself by sending a large number of advertising CH messages to affect the LEACH protocolScheduling attack: type of DoS attack where the attacker acts as a CH and assigns all nodes the same time slot to send data during the setup phase of the LEACH protocolNormal: it means no threat

Furthermore, Table 2 shows a set of descriptive statistics of the WSN-DS dataset using a set of statistical functions including count, mean, std, min, and max. We have ignored Id because it has been used to provide a unique symbolized number of the sensor node and no sense for compute statistics for it. Therefore, 18 features will be used in the next phases.

3.1.2. KDD Dataset

The second dataset is KDD Cup network intrusion dataset (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.htmachine learning). The data comes from DARPA 98 Intrusion Detection Evaluation by Lincoln laboratory at MIT. According to [21], these datasets were collected using multiple computers connected to the Internet to model a small US Air Force base of qualified personnel by using several simulated intrusions. According to this work, we have used KDD dataset to study the machine learning and deep learning models’ performances concerning the sensor nodes that can detect attack patterns from the normal traffic. We then compared the machine learning and deep learning models’ performances to study the impact of trust in machine learning and deep learning models’ IDS. There are 42 attributes used in this dataset. The number of samples in the KDD dataset is 311,029. These samples are distributed among five main groups. Four of them are labeled as attacks including Denial of Service, User to Root Remote-to-local, and PROBING and Normal. The description of the attacks is as follows (see Table 3):Denial-of-service (DOS) attack: it is done by illegal users causing resource constraint for the targeted systems. Consequently, the targeted system is being unable to provide efficient services to the legal users.User to root (U2R) attack: the attacker belongs to the same group which tries to access the root of the system using a normal account within the network.Remote to local (R2L) attack: the remote user has no account to access a specific node within the network. The attacker tries to gain local access by sending packets to explore any vulnerabilities within the network.Probe attack (Probe): the attacker collects data about the network configuration to discover vulnerabilities and then accesses the network by loopholes.Normal: it means no threat.

Furthermore, Table 4 shows a set of descriptive statistics of the KDD dataset using a set of statistical functions including count, mean, std, min, and max.

3.2. Splitting Dataset

In this step, the WSN-DS and KDD datasets are split into 30% training dataset and 70% testing dataset. The training set is fed into the machine learning/deep learning models to let models learn from this data, while the unseen test set is used to evaluate machine learning/deep learning models. Table 5 and Table 6 present the number of instances in these two sets for WSN-DS and KDD datasets, respectively.

3.3. Feature Extraction

The key benefit of using feature selection methods is determining the relevant feature in the dataset. Therefore, feature selection is necessary for the machine learning and deep learning processes since sometimes irrelevant features affect the models’ performance. According to this work context, feature selection enhances the classification accuracy of the attack types and reduces the model execution time. Also, we have used correlation matrix and value to reduce the features that have less significance in the classified attack and affect the models’ performances.

3.3.1. WSN-DS Dataset

For WSN-DS dataset, we have used a correlation matrix for feature analysis to calculate each feature’s relations with other features within the dataset as depicted in Figure 2. It can be seen that the features within the WSN-DS datasets do not have high correlations. We have removed only one feature, and then, we have calculated the p values for the rest 17 features.

As the attack type will be classified by machine learning and deep learning models, Table 7 presents the values of the 17 features to choose the high correlated features for machine learning and deep learning models. Consequently, the features which have the correlation with attack type above 0.005 have been selected to be fitted in machine learning and deep learning models. In particular, 6 features have been selected based on their high correlations such as Time, Dist_To_CH, JOIN_R, Rank, DATA_S, and send_code and their values are 7.00E 93, 1.47E − 24, 5.93E 06, 0.009842, 1.31E − 32, and 2.44E − 125, respectively.

3.3.2. KDD Dataset

For KDD dataset, we have also used a correlation matrix for feature analysis to calculate each feature’s relations with other features within the dataset. We found that the features within the WSN-DS datasets do not have high correlations. We have removed 12 features, and then, we have calculated the values for the rest 30 features.

As the attack type will be classified by machine learning and deep learning models, Table 8 presents the values of the 30 features to choose the high correlated features for machine learning and deep learning models. Consequently, the features which have the correlation with attack type above 0.005 have been selected to be fitted in machine learning and deep learning models. In particular, 14 features have been selected based on their high correlations such as duration, service, src_ bytes, land, urgent, hot, num_compromised, su_attempted, num_file_creations, num_shells, num_access_files, num outbound_cmds, is_host_login, and srv_diff_host_rate and their values are 2.33E − 60, 0.604708, 1.34E − 68, 9.51E − 128, 0.402631, 5.01E − 230, 2.69E − 43, 2.45E − 21, 3.48E − 11, 9.85E − 43, 6.50E − 26, 0.043069, 0.039867, and 1.56E − 70, respectively.

3.4. Machine Learning and Deep Learning Models

Regular machine learning models used in this paper are decision tree (DT), K nearest neighbour (KNN), random forest (RF), and naïve Bayes (NB). Moreover, among deep learning algorithms, we analyze the performance of LSTM (one and two layers) and GRU (one and two layers).

3.5. Optimization and Training Models

In this section, two categories of optimization and training models will be presented, including machine learning and deep learning.

3.5.1. Regular Machine Learning Models

(1) K-Fold Cross-Validation. The dataset is divided into k equal size of the sections in which the k-1 group is used to train the classifiers, and the remaining part is used to test the performance in each stage. The validation process is repeated k times. The output of the classifier is estimated based on the k tests. Various k values are selected for CV. In our analysis, we used k = 10, the 10-fold CV process, 70% of the data for training, and 30% of the data for testing purposes.

(2) Hyperparameter Tuning. It is used to pass various parameters to the model. Grid search is the most widely used method for hyperparameter tuning. Initially, the user defines a set of values for each hyperparameter. The model then tests all values for each hyperparameter and selects the best value to achieve the best performance result.

3.5.2. Deep Learning Models

For hyperparameters optimization, we have used a Keras Tuner library to pick the optimal set of hyperparameters in hidden layers (LSTM or GRU) and dropout layers. We set different values for different parameters, which are the number of neurons, reg_rate for l2 regularization technique [22], and the dropout rate for the dropout layers [23]. For this, we have applied the Keras Tuner on the training dataset to select the best parameters, as shown in Table 9.

3.6. Evaluating Models

Seven standard metrics were utilized to evaluate the models’ accuracy, precision, recall, and F1-score. TP is true positive, TN is true negative, FP is false positive, and FN is a false negative. We will consider four metrics for our experimental results, including accuracy, precision, recall, and F1-score.

4. Experiments and Results

This section describes the results of applying machine learning models (DT, KNN, RF, and NB) and four deep learning models (LSTM and GRU, using one and two layers), including cross-validation results and testing results. Each model performance is discussed using two datasets, including WSN-DS and KKD.

4.1. Experiment Setup

The machine learning models and deep learning models which are applied on the collected datasets has been developed in Python3 using Anaconda Python 3. The experiments have been conducted using a laptop with a specification of 20 GB of RAM, 7 cores, and 100 GB disk. The machine learning models and deep learning models have been trained using 70% dataset, while the rest of 30% of dataset has been used for testing. The machine learning models have been implemented using Sklearn library, while the deep learning models have been implemented using Tensorflow and Keras packages.

4.2. WSN-DS Dataset

This section presents the results of applying machine learning models (DT, KNN, RF, and NB) and four deep learning models (LSTM and GRU, using one and two layers), and cross-validation results and testing results are described. Each machine learning model and deep learning model performance is discussed using full features and selected features to classify five classes of attack types, including Normal, Grayhole, Blackhole, Scheduling, and Flooding. All the positive and negative rates results of the cross-validation and testing performances used to compute accuracy, precision, recall, and F-score matrices are presented in Tables 1013.

4.2.1. Regular Machine Learning Using All Features

(1) Cross-Validation Results. Table 10 shows the machine learning models’ performance using the unseen testing WSN-DS dataset. For the normal class, RF is the highest performance model (accuracy of 99.94%, precision of 98.45%, recall of 99.48%, and F-score 98.96%). DT, KNN, and NB models have achieved the second, third, and fourth ranks on the average of accuracy over unseen data by 99.93%, 99.46%, and 98.20%, respectively. For the Grayhole class, DT is the highest performance model (accuracy of 99.95%, precision of 97.32%, recall of 96.89%, and F-score 97.1%). At the same time, NB is the worst performing model (accuracy of 92.95%, precision of 8.51%, recall of 80.00%, and F-score 15.38%). Similar to Blackhole class, RF is the highest performance model (accuracy of 99.87%, precision of 98.3%, recall of 98.47%, and F-score 98.39%). Regarding scheduling and flooding classes, RF obtained the highest performance by accuracy of 99.74% and 99.89%, respectively. However, NB is the worst performing model for both classes, including accuracy of 84.10% for scheduling and 90.15% for flooding. Yet, the NB model using Scheduling classes has the lowest accuracy performance among all models and classes for unseen data in terms of accuracy, 84.65%.

(2) Testing Results. As shown in the results, for the normal class, the RF model has achieved the highest performances among other models (accuracy of 99.94%, precision of 98.45%, recall of 99.48%, and F-score 98.96%). However, the NB model has recorded the worst performances among other models (accuracy of 98.20%, precision of 97.07%, recall of 34.69%, and F-score 51.11%). For the grayhole class, DT and RF have achieved the highest performances among other models (accuracy of 99.95%). However, the NB model has recorded the worst performances among other models (accuracy of 92.95%, precision of 8.51%, recall of 80.0015.38%, and F-score 15.38%). Like the blackhole class, the RF model has achieved the highest performances among other models (accuracy of 99.87%, precision of 98.3%, a recall of 98.47%, and F-score of 98.39%). However, the NB model has recorded the worst performances among other models (accuracy of 94.40%, precision of 34.91%, recall of 47.76%, and F-score of 40.33%). Regarding scheduling class, the RF model has achieved the highest performances among other models (accuracy of 99.74%, precision of 99.84%, recall of 99.88%, and F-score of 99.86%). However, the NB model has recorded the worst performances among other models (accuracy of 84.10%, precision of 99.89%, recall of 82.58%, and F-score of 90.41%). Flooding class like the other classes, the RF model, has achieved the highest performances among other models (accuracy of 99.89%, precision of 99.78%, recall of 93.93%, and F-score of 96.77%). However, the NB model has recorded the worst performances among other models (accuracy of 90.15%, precision of 13.37%, recall of 84.01%, and F-score of 23.07%). Based on these results, RF and DT models for the grayhole class are the best performing models with respect to other models, while the NB model for scheduling class is the worst performing model.

4.2.2. Regular Machine Learning Using Selected Features

In this section, the machine learning performance results of applying feature selection using WSN-DS dataset are presented.

(1) Cross-Validation Results. This section discusses the 10-fold CV results of four machine learning models (DT, KNN, RF, and NB) over the WSN-DS dataset with selected features, as shown in Table 11. As shown in the normal class results, RF has achieved the highest performances among other models and other classes (accuracy of 99.93%, precision of 98.16%, recall of 99.09%, and F-score of 98.62%). The DT model has achieved the second one, KNN is the third one, and the fourth one is NB. For the grayhole class, RF has achieved the highest performances among other models (accuracy of 99.92%, precision of 92.81%, recall of 99.04%, and F-score of 95.82%). In contrast, NB has done the worst performances (accuracy of 91.47%, precision of 7.39%, recall of 74.91%, and F-score of 13.44%). Like blackhole class, RF has achieved the highest performances among other models (accuracy of 99.86%, precision of 98.24%, recall of 98.11%, and F-score of 98.17%). In contrast, NB has done the worst performances (accuracy of 93.45%, precision of 23.26%, recall of 29.6%, and F-score of 26.04%). Regarding scheduling class, RF has achieved the highest performances among other models (accuracy of 99.69%, precision of 99.82%, recall of 99.83%, and F-score of 99.83%). In contrast, NB has done the worst performances (accuracy of 84.61%, precision of 95.14%, recall of 87.52%, and F-score of 91.17%). Flooding class like the other classes, RF has achieved the highest performances among other models (accuracy of 99.84%, precision of 97.84%, recall of 92.96%, and F-score of 95.33%). In contrast, NB has done the worst performances (accuracy of 99.23%, precision of 95.83%, recall of 59.07%, and F-score of 73.08%). Based on these results, the RF model using normal class is the best performing model concerning other models, while NB using Scheduling class is the worst performing model.

(2) Testing Results. Table 10 shows the performance of the machine learning models using the unseen testing WSN-DS dataset. For the normal class, RF is the highest performance model (accuracy of 99.94%, precision of 98.45%, recall of 99.48%, and F-score of 98.96%). DT, KNN, and NB models have achieved the second, third, and fourth ranks on the average of accuracy over unseen data by 99.93%, 99.46%, and 98.20%, respectively. For the grayhole class, DT is the highest performance model (accuracy of 99.95%, precision of 97.32%, recall of 96.89%, and F-score of 97.1%). At the same time, NB is the worst performing model (accuracy of 92.95%, precision of 8.51%, recall of 80.00%, and F-score of 15.38%). Similar to blackhole class, RF is the highest performance model (accuracy of 99.87%, precision of 98.3%, recall of 98.47%, and F-score of 98.39%). Regarding scheduling and flooding classes, RF obtained the highest performance by accuracy of 99.74% and 99.89%, respectively. However, NB is the worst performing model for both classes, including accuracy of 84.10% for scheduling and 90.15% for flooding. Yet, the NB model using scheduling classes has the lowest accuracy performance among all models and classes for unseen data in terms of accuracy, 84.65%.

4.2.3. Deep Learning Using All Features

(1) Cross-Validation Results. This section discusses the 10-fold CV results of four deep learning models (LSTM and GRU, using one and two layers) over the WSN-DS dataset with all features, as shown in Table 12. As shown in the results, for the normal class, LSTM using one layer model has achieved the highest performances among other models and other classes (accuracy of 99.99%, precision of 100%, recall of 100%, and F-score of 100%). LSTM using two layers’ model has achieved the second one, and GRU with one layer and two layers’ models have almost similar performances which achieved the third one. For the grayhole class, LSTM using one layer model has achieved the highest performances among other models (accuracy of 99.98%, precision of 99.92%, recall of 99.95%, and F-score of 99.93%). In contrast, GRU with two layers has done the second one, and LSTM using one layer and GRU with two layers has done similar performances ranked as a third one. Like blackhole class, LSTM using one layer model has achieved the highest performances among other models and other classes (accuracy of 100%, precision of 100%, recall of 99.74%, and F-score of 99.87%). In contrast, GRU with one layer has done the worst performances (accuracy of 97.5%, precision of 75.67%, recall of 53.16%, and F-score of 62.23%). Regarding scheduling class, LSTM using one layer has achieved the highest performances among other models (accuracy of 99.98%, precision of 96.76%, recall of 96.81%, and F-score of 96.75%). Flooding class like the other classes, LSTM using one layer has achieved the highest performances among other models (accuracy of 100% and precision of 95%), while other models include LSTM using two layers, and GRU using one layer and two layers has achieved approximated performances in terms of accuracy such as 99.67%, 99.68%, and 99.7%, respectively. Based on these results, LSTM using one layer model for blackhole and flooding classes is the best performing model with respect to other models, while GRU using one and two layers for normal class is the worst performing model.

(2) Testing Results. Table 12 shows the performance of the deep learning models using the unseen testing WSN-DS dataset. As shown in the results, for the normal class, LSTM using one layer model has achieved the highest performances among other models and other classes (accuracy of 98.55%, precision of 66.47%, recall of 92.95%, and F-score of 77.51%). LSTM using two layers and GRU has one layer models have achieved the second one, and GRU with two layers has achieved the third one. For the grayhole class, GRU using two layers’ model has achieved the highest performances among other models (accuracy of 99.85%, precision of 86.93%, recall of 97.22%, and F-score of 91.79%). LSTM using one layer has performed the second one, LSTM using two layers and GRU with one layer have done the third one, and LSTM using one layer and GRU with two layers are ranked as a third one. Similar to blackhole class, GRU using two layers model has achieved the highest performances among other models and other classes (accuracy of 98.01%, precision of 79.48%, recall of 65.94%, and F-score 72.08%), while LSTM with one layer has done the worst performances (accuracy of 97.67%, precision of 77.93%, recall of 56.23%, and F-score 65.33%). Regarding scheduling class, GRU using two layers has achieved the highest performances among other models (accuracy of 99.23%, precision of 99.64%, recall of 99.52%, and F-score of 99.58%). For flooding class, GRU using one layer has achieved the highest performances among other models (accuracy of 99.75%, precision of 96.11%, recall of 89.34%, and F-score of 92.60%). Based on these results, GRU using two layers’ model for grayhole is the best performing model with respect to other models, while GRU using one layer for blackhole class is the worst performing model.

4.2.4. Deep Learning Using Selected Features

(1) Cross-Validation Results. As shown in the result in Table 13, four deep learning models were over the WSN-DS dataset with selected features, and for the normal class, GRU using one layer model has achieved the highest performances based on its accuracy among other models and other classes (accuracy of 98.52%, precision of 72.69%, recall of 78.29%, and F-score of 72.62%). LSTM using one layer, GRU with two layers, and LSTM using two layers have been ranked as the second, third, and fourth models, respectively, based on their accuracy. For the grayhole class, GRU using one layer model has achieved the highest performances among other models (accuracy of 99.88%, precision of 90.36%, recall of 96.74%, and F-score of 93.43%). In contrast, LSTM with one layer, GRU with one layer, and GRU with two layers have recorded the second, third, and fourth models based on their accuracy. Like the blackhole class, GRU using one layer model has achieved the highest performances among other models and other classes (accuracy of 98.4%, precision of 83.72%, recall of 76.12%, and F-score of 78.58%). In contrast, GRU with two-layer has done the worst performances (accuracy of 97.21%, precision of 72.01%, recall of 46.96%, and F-score of 56.78%).

Regarding scheduling class, GRU using one layer has achieved the highest performances among other models (accuracy of 99.61%, precision of 99.78%, recall of 99.79%, and F-score of 99.78%). Flooding class like the other classes, GRU using one layer has achieved the highest performances among other models (accuracy of 99.85%, precision of 99.27%, recall of 92.27%, and F-score of 95.64%). Based on these results, GRU using one layer model for grayhole is the best performing model with respect to other models, while GRU using two layers for the blackhole class is the worst performing model.

(2) Testing Results. Table 13 shows the performance of the machine learning models using the unseen testing WSN-DS dataset. As shown in the results, for the normal class, GRU using one layer model has achieved the highest performances among other models and other classes (accuracy of 98.57%, precision of 65.87%, recall of 96.58%, and F-score of 78.32%). However, LSTM using two layers’ model has recorded the worst performances among other models and other classes (accuracy of 98.38%, precision of 63.25%, recall of 94.90%, and F-score of 75.91%). For the grayhole class, LSTM using one layer model has achieved the highest performances among other models (accuracy of 99.88%, precision of 91.49%, recall of 94.81%, and F-score of 93.12%). GRU using two layers’ model has recorded the worst performances among other models and other classes (accuracy of 99.83%, precision of 90.33%, recall of 90.22%, and F-score of 90.27%). Like the blackhole class, GRU using one layer model has achieved the highest performances among other models and other classes (accuracy of 98.46%, precision of 92.49%, recall of 65.77%, and F-score of 76.87%). In contrast, LSTM with two layers has the worst performances (accuracy of 97.62%, precision of 75.66%, recall of 57.50%, and F-score of 65.34%). Regarding the scheduling class, GRU using one layer model has achieved the highest performances among other models and other classes (accuracy of 99.60%, precision of 99.79%, recall of 99.78%, and F-score 99.78%). In contrast, LSTM with two layers has done the worst performances (accuracy of 98.76%, precision of 99.44%, recall of 99.20%, and F-score 99.32%). Flooding class like the other classes, GRU using one layer model has achieved the highest performances among different models and other classes (accuracy of 99.85%, precision of 99.61%, recall of 92.11%, and F-score of 95.7%). In contrast, LSTM with two layers has the worst performances (accuracy of 99.53%, precision of 92.62%, recall of 80.12%, and F-score of 85.92%). Based on these results, LSTM using one layer model for grayhole is the best performing model with respect to other models, while LSTM using one layer for the blackhole class is the worst performing model.

4.3. KDD Dataset

In this section, the results of applying four machine learning models (DT, KNN, RF, and NB) and four deep learning models (LSTM and GRU, using one and two layers), including cross-validation results and testing results, are described. Each machine learning model and deep learning model performance is discussed using full features and selected features to classify five classes of attack types including DOS, R2L, U2R, Probe, and Normal. All the positive and negative rates results of the cross-validation and testing performances which are used to compute accuracy, precision, recall, and F-score matrices are presented in Tables 1417.

4.3.1. Regular Machine Learning Using All Features

(1) Cross-Validation Results. This section discusses the 10-fold CV results of four machine learning models (DT, KNN, RF, and NB) over the KDD dataset with all features, as shown in Table 14. As shown in the results, for the Dos class, RF has achieved the highest performances among other models and other classes (accuracy of 100%, precision of 100%, recall of 100%, and F-score of 100%). The DT model has achieved the second one and KNN is the third one. NB has reached the lowest performances (accuracy of 92.44%, precision of 93.93%, recall of 96.93%, and F-score 95.26%). For the normal class, RF has achieved the highest performances among other models (accuracy of 99.98%, precision of 92.9%, recall of 99.98%, and F-score of 95.94%). In contrast, NB has the worst performances (accuracy of 92.72%, precision of 99%, recall of 59.7%, and F-score 74.48%). Like the probe class, RF has achieved the highest performances among other models (accuracy of 100%, precision of 99.97%, recall of 99.59%, and F-score of 99.78%). In contrast, NB has the worst performances (accuracy of 96.74%, precision of 0.3%, recall of 6.96%, and F-score of 5.7%). Regarding R2l class, RF has achieved the highest performances among other models (accuracy of 99.99%, precision of 99.36%, recall of 96.77%, and F-score of 98.04%). In contrast, NB has the worst performances (accuracy of 98.77%, precision of 0.25%, recall of 1.35%, and F-score of 4.22%). U2r class classes, RF has achieved the highest performance (accuracy of 100%, precision of 93.75%, and F-score of 77.8%), while DT has achieved the highest recall of 96.99%. KNN and NB have the worst performances. Based on these results, the RF model using the DoS class is the best performing model with respect to other models.

(2) Testing Results. Table 14 shows the performance of the machine learning models using the unseen testing KDD dataset. For the RF model, DOS class has achieved the highest accuracy among other models and classes (accuracy of 100%, precision of 100%, recall of 100%, and F-score of 100%). However, NB has the DOS class’s worst performances (accuracy of 94.17%, precision of 93.82%, recall of 99.37%, and F-score of 96.52%). For the normal class, RF is the highest performance model (accuracy of 99.98%, precision of 99.89%, recall of 99.99%, and F-score of 98.94%). DT, KNN, and NB models have achieved the second, third, and fourth ranks based on accuracy over unseen data by 99.97%, 99.85%, and 92.65%, respectively. Similar to the probe class, DT is the highest performance model (accuracy of 99.99%, precision of 100%, recall of 99.28%, and F-score of 96.64%). RF, KNN, and NB models have achieved the second, third, and fourth ranks on the average of accuracy over unseen data by 99.99%, 99.85%, and 98.65%, respectively. Regarding R2l and U2r classes, RF obtained the highest performance, while NB is the worst performing model for both classes. However, the NB model using normal classes has the lowest accuracy performance among all models and classes for unseen data in terms of accuracy, 92.65%.

4.3.2. Regular Machine Learning Using Selected Features

(1) Cross-Validation Results. This section discusses the 10-fold CV results of four machine learning models (DT, KNN, RF, and NB) over the KDD dataset with all features, as shown in Table 15. As shown in the results, for the DOS class, RF and DT have achieved the highest performances among other models and other classes (accuracy of 99.72%, precision of 99.71%, recall of 99.94%, and F-score of 99.83%). The KNN model has achieved the second one. NB has achieved the lowest performances (accuracy of 65.13%, precision of 80.44%, recall of 75.3%, and F-score of 77.78%). For the normal class, RF and DT have achieved the highest performances among other models (accuracy of 99.93%, precision of 99.76%, recall of 99.86%, and F-score of 99.8%), while NB has the worst performances (accuracy of 82.46%, precision of 66.47%, recall of 2.65%, and F-score of 5.08%). The probe class, RF has achieved the highest performances among other models (accuracy of 99.99%, precision of 99.74%, and F-score of 98.3%), while DT has achieved the highest recall of 97.6%. NB has the worst performances (accuracy of 96.87%, precision of 0.29%, recall of 7.03%, and F-score of 5.61%). Regarding R2l class, RF has achieved the highest performances among other models (accuracy of 100%, precision of 87.04%, recall of 60%, and F-score of 71.85%), while NB has the worst performances (accuracy of 99.24%, precision of 0.48%, recall of 1.36%, and F-score 3.56%). U2R classes, RF and DT, have achieved similar the highest performance. NB has the worst performances (accuracy of 79.47%, precision of 0.09%, recall of 79.17, and F-score of 0.19). Based on these results, RF and DT models for each class have the best performing model with respect to other models, while NB has the worst performances.

(2) Testing Results. Table 15 shows the performance of the machine learning models using the unseen testing KDD dataset. For the RF and DT model, DOS class has achieved the highest accuracy among other models and classes (accuracy of 99.74%, precision of 99.74%, recall of 99.94%, and F-score of 99.84%). However, NB has the DOS class’s worst performances (accuracy of 64.87%, precision of 80.27%, recall of 75.10%, and F-score of 77.60%). For the normal class, DT and RF are the highest performance model (accuracy of 99.93%, precision of 99.77%, recall of 99.82%, and F-score of 99.80%). KNN and NB models have achieved the second and third ranks based on accuracy over unseen data by 99.83% and 82.48%, respectively. Similar to the probe class, RF is the highest performance model (accuracy of 99.78%, precision of 96.15%, recall of 77.69%, and F-score of 85.94%). DT, KNN, and NB models have achieved the second, third, and fourth ranks on the average of accuracy over unseen data by 99.78%, 99.76%, and 98.87%, respectively. Regarding R2L and U2R classes, RF obtained the highest performance, while NB is the worst performing model for both classes.

4.3.3. Deep Learning Using All Features

(1) Cross-Validation Results. As shown in the result in Table 16, four deep learning models were over the KDD dataset with selected features; for the DOS class, LSTM using two layers’ model has achieved the highest performances based on its accuracy among other models and other classes (accuracy of 99.83%, precision of 99.97%, recall of 99.82%, and F-score of 99.9%). LSTM using one layer, GRU with one layer and two layers, and LSTM using one layer has been ranked as the second, third, and fourth models, respectively, based on their accuracy. For the normal class, LSTM using two layers’ model has achieved the highest performances among other models (accuracy of 99.73%, precision of 98.85%, recall of 99.65%, and F-score of 99.25%). In contrast, LSTM with one layer has achieved the lowest performance (accuracy of 99.79%, precision of 86.16%, recall of 90.36%, and F-score of 88.2%). GRU with one layer and GRU with two layers have recorded the second, third, and accuracy. Like the probe class, LSTM using two layers’ model has achieved the highest performances among other models and other classes (accuracy of 99.97%, precision of 99.58%, recall of 97.34%, and F-score of 98.44%). In contrast, LSTM with one layer has the worst performances (accuracy of 97.47%, precision of 74.48%, recall of 53.84%, and F-score of 62.28%). Regarding R2l class, GRU using one layer has achieved the highest performances among other models (accuracy of 99.92%, precision of 84.57%, recall of 85.72%, and F-score of 85.05%). U2R class like the other classes, GRU using one layer has achieved the highest performances among other models accuracy of 99.68%, precision of 93.07%, recall of 88.33%, and F-score of 90.62%). Based on these results, LSTM using two layers’ model for DOS is the best performing model with respect to other models.

(2) Testing Results. As shown in the result in Table 16, four deep learning models were over the KDD dataset with selected features; for the DOS class, LSTM using two layers’ model has achieved the highest performances based on its accuracy among other models and other classes (accuracy of 99.83%, precision of 99.97%, recall of 99.82%, and F-score of 99.9%). LSTM using one layer, GRU with one layer and two layers, and LSTM using one layer has been ranked as the second, third, and fourth models, respectively, based on their accuracy. For the normal class, LSTM using two layers’ model has achieved the highest performances among other models (accuracy of 99.73%, precision of 98.85%, recall of 99.65%, and F-score of 99.25%). In contrast, LSTM with one layer has achieved the lowest performance accuracy of 99.79%, precision of 86.16%, recall of 90.36% and F-score of 88.2%). GRU with one layer and GRU with two layers have recorded the second and third and accuracy. Like the probe class, LSTM using two layers’ model has achieved the highest performances among other models and other classes (accuracy of 99.97%, precision of 99.58%, recall of 97.34%, and F-score of 98.44%). Table 16 shows the performance of the deep learning models using the unseen testing KDD dataset. As shown in the results, for the DOS class, LSTM using one layer model has achieved the highest performances among other models and other classes (accuracy of 99.99%, precision of 100%, recall of 99.99%, and F-score of 100%). LSTM using two layers’ model and GRU models have similar performance. For the normal class, LSTM using one layer model has achieved the highest performances among other models (accuracy of 99.95%, precision of 99.80%, recall of 99.94%, and F-score of 99.87%). LSTM with two layers and GRU using one layer and two layers model have recorded the similar performances. Like the probe class, LSTM using one layer model has achieved the highest performances among other models and other classes (accuracy of 99.99%, precision of 99.55%, recall of 98.76%, and F-score of 99.15%). Regarding the scheduling class, GRU using one layer model has achieved the highest performances among other models and other classes (accuracy of 99.60%, precision of 99.79%, recall of 99.78% and F-score99.78%). In contrast, LSTM with two layers has done the worst performances (accuracy of 98.76%, precision of 99.44%, recall of 99.20%, and F-score of 99.32%). Flooding class like the other classes, GRU using one layer model has achieved the highest performances among different models and other classes (accuracy of 99.85%, precision of 99.61%, recall of 92.11%, and F-score of 95.7%). In contrast, LSTM with two layers has the worst performances (accuracy of 99.53%, precision of 92.62%, recall of 80.12%, and F-score of 85.92%). Based on these results, LSTM using one layer model for grayhole is the best performing model for other models, while LSTM using one layer for the blackhole class is the worst performing model.

4.3.4. Deep Learning Using Selected Features

(1) Cross-Validation Results. As shown in the result, in Table 17, four deep learning models were over the KDD dataset with selected features; for the DOS class, GRU using two layers, model has achieved the highest performances based on its accuracy among other models and other classes (accuracy of 96.14%, precision of 96.13%, recall of 99.23%, and F-score of 97.65%). LSTM using one layer model has achieved the lowest performances based on its accuracy among other models and other classes (accuracy of 92.98%, precision of 92.91%, recall of 99.05%, and F-score of 95.84%). For the normal class, GRU using two layers’ model has achieved the highest performances among other models (accuracy of 96.56%, precision of 94.62%, recall of 85.6%, and F-score 89.88%). In contrast, LSTM with one layer has reached the lowest performance accuracy of 93.51%, precision of 92.35%, recall of 69.37%, and F-score of 77.59%). Similar to the probe class, GRU using two layers’ model has achieved the highest performances among other models and other classes (accuracy of 99.24%, precision of 99.58%, recall of 13.17%, and F-score of 35.28%). In comparison, LSTM with one and two layers and GRU with one layer have registered similar performance. Regarding R2l class, GRU using two layers has achieved the highest performances among other models (accuracy of 99.8%, precision of 80.4%, recall of 26.66%, and F-score of 39.92%). U2R class is like the other classes; all deep learning models have recorded the same performance (accuracy of 99.99%, precision of 0%, recall of 0%, and F-score 0%). Based on these results, all models for U2R are the worst performance concerning other models.

(2) Testing Results. Table 17 shows the performance of the deep learning models using the unseen testing KDD dataset. As shown in the results, for the DOS class, GRU using two layers’ model has achieved the highest performances among other models and other classes (accuracy of 96.45%, precision of 96.20%, recall of 99.54%, and F-score of 97.84%). In contrast, LSTM with one layer has registered the lowest performance (accuracy of 94.98%, precision of 95.06%, recall of 98.94%, and F-score of 96.96%). For the normal class, GRU using two layers’ model has achieved the highest performances among other models (accuracy of 96.93%, precision of 96.59%, recall of 85.89%, and F-score of 90.92%). LSTM with one layer has recorded the worst performances (accuracy of 95.27%, precision of 92.09%, recall of 80.43%, and F-score of 85.87%). Like the probe class, GRU using two layers’ model has achieved the highest performances among other models and other classes (accuracy of 99.24%, precision of 91.67%, recall of 23.91%, and F-score of 38.05%). Regarding R2L class, GRU using two layers’ model has achieved the highest performances among other models and other classes (accuracy of 99.79%, precision of 83.33%, recall of 23.46%, and F-score of 36.61%). In comparison, LSTM with two layers has the worst performances (accuracy of 98.76%, precision of 99.44%, recall of 99.20%, and F-score of 99.32%). For U2R class, all models have achieved the worst performance. Based on these results, GRU using two layers’ model have registered the best performance.

5. Discussion

We examine the four machine learning models (DT, KNN, RF, and NB) and four deep learning models (LSTM and GRU, using one and two layers), including cross-validation results and testing using WSN-DS and KKD datasets. Table 18 describes the summary of the used datasets including the number of all samples, the number of trained samples, the number of testing samples, the number of all features, the number of selected features, and the number of classified classes.

5.1. Regular Machine Learning

The results of machine learning cross-validation performance based on the accuracy for WSN dataset using all features and selected features are depicted in Figure 3. Considering all feature results for cross-validation performance, the normal class using RF has the best performance (accuracy of 99.95%), while the NB model using the scheduling class has achieved the worst performance among all models and classes (accuracy of 84.47%) (see Figure 3(a)). Similar to the selected feature results, the normal class using RF has the best performance (accuracy of 99.93%), while the scheduling class, using the NM model, has achieved the worst performance among all models and classes (accuracy of 84.61%) (see Figure 3(b)). The results of machine learning testing performance based on the accuracy for WSN dataset using all features and selected features are depicted in Figure 4. Considering all features’ results for unseen dataset, the grayhole class using RF has the best performance (accuracy of 99.95%), while the scheduling class using the NB model has achieved the worst performance among all models and classes (accuracy of 84.1%) (see Figure 4(a)). Similar to the selected features’ results, the grayhole class using RF has the best performance (accuracy of 99.94%), while the scheduling class using the NB model has achieved the worst performance among all models and classes (accuracy of 84.65%) (see Figure 4(b)).

The results of machine learning cross-validation performance based on the accuracy for KDD dataset using all features and selected features are depicted in Figure 5. Considering all features results for cross-validation performance, DOS and U2R classes using RF have the best performance (accuracy of 100%), while the NB model using the DOS class has achieved the worst performance among all models and classes (accuracy of 92.44%) (see Figure 5(a)). Similar to the selected features’ results, U2R class using RF has the best performance (accuracy of 100%), while the DOS class using the NB model has achieved the worst performance among all models and classes (accuracy of 65.13%) (see Figure 3(b)). The results of machine learning testing performance based on the accuracy for the WSN dataset using all features and selected features are depicted in Figure 6. Considering all features’ results for the unseen dataset, DOS, normal, and U2R classes using RF have the best performance (accuracy of 100%), while the normal class using the NB model has achieved the worst performance among all models and classes (accuracy of 92.65%) (see Figure 6(a)). Similar to the selected features’ results, R2L and U2R classes, using RF, DT, and KNN, have the best performance (accuracy of 99.99%), while the DOS class, using the NB model, has achieved the worst performance among all models and classes (accuracy of 64.87%) (see Figure 6(b)).

5.2. Deep Learning

The results of deep learning cross-validation performance based on the accuracy for WSN dataset using all features and selected features are depicted in Figure 7. Considering all features results for cross-validation performance, the flooding class, using LSTM with one layer, has the best performance (accuracy of 100%), while the blackhole class, using GRU with the one layer model, has achieved the worst performance among all models and classes (accuracy of 97.5%) (see Figure 7(a)). Similar to the selected features’ results, the grayhole class, using LSTM with one layer, has the best performance (accuracy of 99.84%), while the blackhole class, using GRU with the two layers’ model, has achieved the worst performance among all models and classes (accuracy of 97.21%) (see Figure 7(b)). The results of deep learning testing performance based on the accuracy for WSN dataset using all features and selected features are depicted in Figure 8. Considering all features’ results for unseen dataset, the grayhole class, using GRU with the two layers’ model, has the best performance (accuracy of 99.85%), while the blackhole class, using GRU with the two layers’ model, has achieved the worst performance among all models and classes (accuracy of 98.01%) (see Figure 8(a)). Similar to the selected features results, the grayhole class, using LSTM with the one layer model, has the best performance (accuracy of 99.88%), while the blackhole class, using LSTM with the two layers’ model, has achieved the worst performance among all models and classes (accuracy of 97.62%) (see Figure 8(b)).

The results of deep learning cross-validation performance based on the accuracy for KDD dataset using all features and selected features are depicted in Figure 9. Considering all features results for cross-validation performance, U2R class using LSTM with two layers and GRU using one and two layers have the best performance (accuracy of 99.99%), while the DOS class using LSTM with one layer model has achieved the worst performance among all models and classes (accuracy of 98.46%) (see Figure 9(a)). Similar to the selected features results, U2R class using LSTM with one and two layers and GRU using one and two layers have the best performance (accuracy of 99.99%), while the normal class using LSTM with the one layer model has achieved the worst performance among all models and classes (accuracy of 93.51%) (see Figure 9(b)). The results of deep learning testing performance based on the accuracy for KDD dataset using all features and selected features is depicted in Figure 10. Considering all features’ results for unseen dataset, U2R and Dos classes using LSTM with one and two layers and GRU using one and two layers have the best performance (accuracy of 99.99%), while the normal class using GRU with the two layers’ model has achieved the worst performance among all models and classes (accuracy of 99.86%) (see Figure 10(a)). Similar to the selected features’ results, the U2R class using LSTM with one and two layers and GRU using one and two layers have the best performance (accuracy of 99.99%), while the DOS class using GRU with the one layer model has achieved the worst performance among all models and classes (accuracy of 95.72%) (see Figure 10(b)).

5.3. Summary

Many research studies in AI-based IDS area have used machine learning and deep learning models. Each of these models possesses its strengths and weaknesses, making them suitable for a particular attack type. Regarding this work, not only do we use machine learning and deep learning models, but an in-depth investigation of performance analysis has been carried out based on the chosen datasets. The performance analysis and comparison of these models on IDS datasets show no superiority of one model among the chosen datasets using all features and selected features. Furthermore, these findings lead to better knowledge and understand the interpretability for choosing the right model to enhance IDS trust. In particular, the authors in [24] have addressed the explainable artificial intelligence (XAI) concept to improve the trust management by exploring the decision tree model in the area of IDS. Compared to our work, we have provided a performance-based comparison using machine learning and deep learning models to investigate the trust impact based on the accuracy of the trusted AI-based systems regarding IDs’ malicious data.

6. Conclusion and Future Work

In this paper, a comparison study is introduced to investigate intrusion detection’s trust impact, including the data, methodology, and expert accountability by analyzing machine learning and deep learning models’ performance. The developed phases of the comparison study have two-folds. The first comparison is made using four regular machine learning models, including DT, KNN, RF, and NB. The second comparison is made using four traditional deep learning models, including LSTM (one and two layers) and GRU (one and two layers). Two datasets are used to classify the attack type in IDS, including WSN-DS and KDD. The experimental results are significantly demonstrated, considering the data, methodology, and expert accountability causes misleading predictions, making the system vulnerable to attacks, and leading to zero-trust security for critical systems. Therefore, for future work, we plan to use XAI concept to enhance trust management by exploring machine learning models and deep learning in IDS.

Data Availability

The KDD dataset used to support the study is available at http://kdd.ics.uci.edu/databases/kddcup99/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by Taif University researchers, supporting Project no. TURSP-2020/254, Taif University, Taif, Saudi Arabia and Science Foundation Ireland SFI.