In the open network environment, industrial control systems face huge security risks and are often subject to network attacks. The existing abnormal detection methods of industrial control networks have the problem of a low intelligence degree of adaptive detection and recognition. To overcome this problem, this article makes full use of the advantages of deep reinforcement learning in decision-making and builds a learning system with continuous learning ability. Specifically, industrial control network and deep reinforcement learning characteristics are applied to design a unique reward and learning mechanism. Moreover, an industrial control anomaly detection system based on deep reinforcement learning is constructed. Finally, we verify the algorithm on the gas pipeline industrial control dataset of Mississippi State University. The experimental results show that the convergence rate of this model is significantly higher than that of traditional deep learning methods. More importantly, this model can get a higher F1 score.

1. Introduction

Industrial control systems (ICSs) are designed to monitor real basic equipment [1, 2]. These systems use dedicated communication equipment, operating system, and hardware equipment, and their network is independent of other networks. These systems play an important role in the monitoring of critical infrastructure such as smart grid, oil and gas, aerospace, and transportation [3]. Therefore, the safety and security of ICSs are essential to national security.

However, with the development of intelligence and networking, traditional ICSs can no longer meet the requirements of production [4]. Most industrial control networks are gradually connected to the Internet. This migrates huge security risks to ICSs [5]. For example, in March and July 2019, the computer control center of the Venezuelan hydropower station was attacked by a cyber-attack, causing a nationwide power outage. This power outage caused huge economic losses to Venezuela. On October 30 of the same year, a nuclear power plant in India was maliciously attacked. As a result, India was forced to shut down nuclear power plants. Similar network security issues for ICSs frequently occur [6].

Various infrastructures have been considered the problem of safety equipment at the beginning of the design [7]. However, because of the openness and complexity of ICSs, the current solutions to the safety problems of ICSs are not perfect. The intrusion detection problem for industrial control networks is still an urgent problem to be solved [8]. Intrusion detection technology can effectively protect the information security of ICSs by performing a real-time online monitoring of the system to discover possible attacks [9].

Many anomaly detection models have been designed to recognize malignant behavior in ICSs. For example, Yan et al. conducted modeling from the physical layer. And they use physical knowledge to intelligently derive significant features from a large number of noisy physical measurements [10]. Machine learning methods were used to identify the difference between normal activities and attack activities [11]. Wang et al. [12] proposed a network attack identification algorithm based on temporal causality Bayesian network for the extraction of cooperative attack modes of physical systems in power networks. A rule-based anomaly detection system was proposed which significantly improves the accuracy of the identification of attacks [13]. However, these methods are mainly effective for existing attacks, and cannot well identify new attack types. In order to attack ICSs, the attacker will continue to change the way of attacking, thereby avoiding a detection by the detection system. So the above-mentioned traditional detection systems are mostly offline systems, and are not suitable for solving many industrial control anomaly detection problems that require high real-time performance. Therefore, it is necessary to constantly update the abnormal flow detection system.

Deep reinforcement learning (DRL) is well known as an online system with a high decision-making ability [14]. DRL can continuously learn new knowledge from the environment by interacting with the environment. So, it can solve the problem of continuous learning in an industrial control system. Too little work has been devoted to the abnormal detection problem of ICS. This motivates us to use DRL to solve the problem of industrial control anomaly detection.

In this article, we use the DRL method to build an industrial control anomaly detection model which is based on the traffic. The main work is summarized as follows.(1)A unique environmental mechanism and reward mechanism of DRL are designed for ICSs.(2)A new convergence mechanism of DRL is put forward for ICS.(3)An industrial control anomaly detection system is build based on deep Q learning.(4)The industrial control anomaly detection system is applied on a gas pipeline dataset. In the testing phase, the accuracy of the model can reach 100.

The security of traditional ICSs has been protected because it has been in an independent environment for a long time [15, 16]. But with the development of the network, ICSs are gradually connected to the Internet. Therefore, ICSs are vulnerable to various attacks from the network [17, 18].

Most of the existing ICSs network attack detection technologies are based on traditional anomaly detection technologies [19]. There are mainly two commonly used anomaly detection models [20]: (1) A method which is based on signature. The compact aggregate signature was designed for ICS in Ref. [21]. Signature-based methods use fixed signatures to detect existing attacks. However, this method is less efficient when detecting unknown or new attacks [19]. (2) A method which is based on learning. The learning-based anomaly detection system mainly includes statistical analysis, machine learning, deep learning and other methods. Those methods were used to identify the characteristics of traffic data, so as to identify, whether the traffic is normal or attack traffic based on the characteristics. For example, Wu et al. [22] studied cyber-physical attacks by establishing a machine learning method on physical data. However, those anomaly detection technologies are mainly based on existing dataset. Once the dataset is expanded, a new data model needs to be retrained. In ICSs, the types of attacks are varied, and new attacks will put forward to ICSs. At this time, the original anomaly detection model cannot protect the safety of the industrial control system. Therefore, a model with continuous learning ability is necessary in industrial control anomaly detection.

DRL can learn the characteristic information of traffic by continuously interacting with the environment [23]. A deep Q network (DQN) is a classic deep reinforcement learning algorithm [24]. It is composed of agents and environments. The agent continuously learns new knowledge by interacting with the environment. Since the agent is constantly learning, the model is constantly updated. In this way, the model improves its ability to detect new attacks.

Inspired by the above articles, a unique anomaly detection model is designed for ICSs based on DQN (ICSDQN). For some special situations of ICSs, we design a unique environment, reward and convergence mechanism. Considering the complex and changeable industrial control data, a deep neural network is designed to fit the Q function in DQN. By building an industrial control anomaly detection system based on DQN, we can achieve continuous learning of anomaly detection models. This solves the difficult problems faced by the information security of industrial control systems, and realizes successful industrial control security.

3. Main Approach

3.1. The Description of DQN

The reinforcement learning (RL) model is mainly composed of two parts: agent and environment. The agent continuously interacts with the environment, generates an action through the Q function, then performs the action, and enters a new environment. The model will reward the agent based on the actions taken by the agent. The agent makes decisions by maximizing rewards. DQN mainly uses a deep neural network to fit the Q function on the basis of traditional reinforcement learning.

The main network structure of DQN is shown in Figure 1. There, is the current environment, s’ represents the next environment, a the action information taken under the current environment, and r the reward obtained by performing action under the current environment. It is worth noting that the selection of action is explored according to the greedy strategy, which refers to how likely the current sampling is to make decisions based on the Q value generated by the current training network. If the decision is not made according to the Q value, then an action is randomly generated for exploration.

Experience pool is used to store experience information after an agent interacts with the environment. DQN solves the problem of correlation and nonstatic distribution by using experience replay. Specifically, RL stores the samples which are obtained from the interaction between the agent and the environment at each time step in the experience pool.

In Figure 1, pre_network is the current value network, and tar_network represents the target value network, which generates the Q value of the label. The training network outputs the currently predicted Q value. The value of the current Q network is updated iteratively during the training. The target network periodically synchronizes with the current Q network, and performs a copy every iterations, where . When training the current value network, some samples are randomly selected from the experience pool as training data to break the correlation between samples.

The loss function of DQN is about the timing difference between the current value network and the target value network as follows:where is the discount factor, is the parameter value of the current value network, is the parameter value of the current value network, and represents the reward information. The objective of the loss function is that the current Q value approaches the target Q value gradually.

3.2. ICS Based on DQN

In this article, DQN is used to solve the problem of anomaly detection in ICSs. The reward, action, and environment in DQN are designed inimitably because of the uniqueness of ICSs. Then, a special objective function is designed on the basis of reward and action.

3.2.1. The Improved Action Mechanism in ICSDQN

In the ICSs, the network traffic is mainly divided into normal and abnormal traffic. Therefore, there are only two actions in the DQN model. One is alert, which means that the model predicts that the current traffic is a malicious attack traffic. The other is normal, i.e., the current traffic is a normal one. The formula of the action is as follows:

ICSDQN passes the network traffic to ICS if the action is normal. If the network traffic is a malicious attack, ICSDQN will intercept the data so that it cannot harm ICS. In addition, all the real-time data will be put into the experience pool for the next training.

3.2.2. The Improved Reward Mechanism in ICSDQN

According to action mechanism in Section 3.2.1, the reward is designed as follows:where is the action and is the label of the samples. The sample is normal data if , otherwise the sample is an attack data. ICSDQN gives the agent a reward for every action the agent performs. The ultimate goal of ICSDQN is to achieve the highest reward per training session.

3.2.3. The Environment in ICSDQN

In the industrial control network, the data are all processed characteristic data. Therefore, we regard each sample as the environment in which the current industrial control network is located. ICSDQN uses the Q network to generate an action in the agent’s current environment and gives the agent a reward message.

3.2.4. The Improved Loss Function in ICSDQN

A new loss function is designed to enable the DRL designed in this article to converge. The improved formula is as follows:where is the result of replacing the position of the action in with the reward .

In this way, we not only maintain the original feature of DRL to maximize rewards but also because we have designed a unique reward mechanism, it speeds up the convergence of the model.

3.2.5. The Deep Q Network in ICSDQN

In DQN, the most important aspect is to design a neural network fitting the Q function. The structure of the Q network is shown in Figure 2. Its input is the environment s. The output in Figure 2 is the probability of actions which contains “alarm” and “normal”. The deep neural network gives an action for each environment by fitting the Q function. And considering that there may be new types of attack data later, the network must be a sufficiently deep neural network. We design the network as a five-layer neural network to solve the problem of complex and variable traffic ICSs. The number of hidden layer neurons in this network is 64, 128, 64, 32, and 16. Both the current value network and the target value network are of this structure.

ICSDQN continuously adjusts the parameters of the current value network and the target value network through the loss function (4). The training process of ICSDQN is shown in Algorithm 1.

Require: Initialize the experience pool , the current value network, the target value network, and the Q network. Train data , label , the interval of the parameter replacement n, epoch and size.
(1)For in :
(2)Select the initial environment .
(3)For in :
(4)Enter into the Q network to get the probability of each action. Select the action value corresponding to the maximum action probability.
(5)Use greedy strategy to choose an action ;
(6)Execute the action . The ICSDQN will enter the next environment , and reward will be given to the ;
(7)Set ;
(8)Store in the experience pool;
(9)Randomly sample samples as training set from the experience pool;
(10)Calculate the loss function and use the gradient descent algorithm to update the network parameters of the current value network by using ;
(11)The parameters of the current value network are assigned to the target value network every times of training.

In the testing phase, the traffic input to the ICSs will be detected by ICSDQN. If the action performed by ICSDQN on the current traffic is “alarm”, it indicates that the traffic is an attack. If the action performed by ICSDQN is “normal”, it means that this flow is a normal type of flow. The traffic flow of normal will be allowed to input to the ICS.

4. Experiment

4.1. The Preparation of Experiment

In this section, gas pipeline dataset which was collected by Mississippi State University is utlilized to evaluate our proposed method. The gas pipeline dataset comprise of normal data and seven kinds of abnormal data, i.e., native malicious response injection (NMRI), complex malicious response injection (CMRI), and so on. There are 274, 628 pieces of data in the dataset. But 210, 528 data in the dataset are incomplete. The data will lack authenticity if these missing features are populated with averages or other methods. Therefore, in this article, we remove the missing feature record. Then, the size of the filtered dataset is 64, 100. In each test, we extract 1000 pieces of data as test set, and the rest as training set. The hyperparameters are set according to experience, and some methods of automatic learning hyperparameters are used to dynamically adjust the hyperparameters [25], and finally the model hyperparameters of this article are established.

To enable the model to recognize the data in the dataset, we first map string-type data to numeric data by using a dictionary. Then the data is encoded by using one-hot encoding technology. Finally, each column of numerical data in the dataset is standardized and normalized.

The formulas for standardization and normalization are shown as follows:where represents the minimum value of the dataset, is the maximum value in the dataset, is the average value of the data, and represents its variance. Then, is the normalized value of this data set and is the standardized value of the dataset.

In the task of classification, it is often inaccurate to describe the performance of a classifier with overall precision. In this case, classes with small sample size tend to have a poor classification effect. Therefore, we use the precision, recall, and F1 score for comprehensive evaluation.

True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) are used to represent the number of normal samples predicted as normal samples, the number of abnormal samples predicted as normal samples, the number of normal samples predicted as abnormal samples, and the number of abnormal samples predicted as abnormal samples, respectively.

Precision represents the ratio of the number of correct predictions in the total number of positive predictions. The definition of the precision is

The recall rate is a measure of coverage, which is mainly used to measure how many positive cases are considered as positive cases. The formula of recall rate is

Precision and recall rates sometimes contradict each other, so we need to take that into consideration. As a comprehensive evaluation index, the score is the weighted harmonic average of recall rate and precision, and it can be expressed asand a higher score always indicates the corresponding model which has a better performance.

4.2. Contract Experiment

In this section, the gas pipeline dataset is used to verify the performance of the ICSDQN algorithm. In order to train ICSDQN, we divide the processed dataset described in Section 4.1 into two parts: 1000 samples for testing and the other samples for training. In addition, in order to evaluate the performance of the model, the experimental results are average values obtained after ten repeated experiments (Figure 3).

It can be seen from Figure 4 that ICSDQN is constantly exploring in the initial stage to find an optimal decision. So the value of reward is low in the beginning. But in the 70th training, ICSDQN found a better decision and got a higher reward value. On this basis, ICSDQN continuously adjusts network parameters to obtain the highest reward. Specifically, we train the network every 1000 times, and we obtain the corresponding reward value through ICSDQN’s actions on these 1000 samples. So the maximum reward for each training is 1000. However, the current value network in ICSDQN assigns the network to the target value network every steps. It will cause the network parameters to change constantly. So the reward curve will fluctuate around the highest reward, when the network reaches the convergence stage. Figure 5 shows the cost of ICSDQN changes with training. In Figure 5, the cost value gradually decreases as the number of beatings increases. In the end, the cost value converges to near 0. In the convergence stage, the reason for the small change in the value of cost is the same as that of reward.

Figures 69 show the performance of ICSDQN on the test dataset. They demonstrate the changes of accurate, recall, precision, and F1 score with the number of training, respectively. From Figures 6 to 9, we can see that the four performance indicators all increase as the number of training increases. At the time of convergence, the four detection indicators all reach 1.00. This gives evidence that ICSDQN can fully identify the attack and normal data. It indicates that ICSDQN has a superior performance.

In order to further verify the effectiveness of the ICSDQN algorithm, the 6-fold cross-validation method was used to draw the receiver operating characteristic (ROC) curve. As shown in Figure 9, the area of the each ROC curve is 1.00, i.e., the algorithm is efficient and stable.

Table 1 shows the detection effect of ICSDQN algorithm training 20 times under different scale training datasets. It can be seen from Table 1 that with an increase of the training data, the recognition performance of the ICSDQN model is getting higher and higher. When the data size is only 5,000, the model can also achieve better detection results. The F1 score has reached 0.98 when the sample size is 10,000. When the sample size is 30,000, the model can fully identify the categories of all test samples. This shows that the ICSDQN model has a high decision-making ability. ICSDQN can make full use of sample information to make the best decision.

At the same time, the result of ICSDQN is compared with DNN, RF, DT, and AdaBoost-based classifiers along with multiple peer approaches in the current literature. In Table 2, accuracy, precision, recall, and F1 score are used to evaluate the performance of the different algorithms. By comparison, we clearly find that ICSDQN performs significantly better than the other algorithms on the gas pipeline dataset. The four detection indicators all reach 1.00 on ICSDQN. It is indicated that this algorithm will be able to fully identify attacks in the gas pipeline environment.

To further verify the computational cost of the proposed method, we compare the algorithm with LSTM, and the comparison results are shown below.

It can be seen from Table 3 that the training time of ICSDQN model is significantly higher than that of LSTM, which is caused by the learning strategy of deep reinforcement learning, that is, the network model is updated at an interval of steps, resulting in a long training time. However, ICSDQN model needs less time in testing, because its network structure is simpler and fewer parameters than LSTM, so it is faster. ICSDQN has more real-time performance in terms of online cost.

5. Conclusion

Industrial control systems play an important role in monitoring critical infrastructure such as smart grids, oil and gas, aerospace, and transportation. The safety of industrial control systems is of vital importance to a country. According to the research topic, in order to effectively detect the abnormal flow in the industrial control system, combined with the decision-making ability of perceptual deep learning and reinforcement learning, this article designs a detection model of reinforcement learning industrial control system based on deep abnormal flow, preprocesses the data set with neural network for feature extraction, then improves the decision-making ability through reinforcement learning, and adjusts the learning strategy according to the special advantages of feedback. In the experimental part, this article uses the natural gas pipeline dataset to verify the proposed industrial control anomaly detection model based on DQN. The results show that the model not only has high learning speed, but also has high detection accuracy for the detection effect of industrial control data. In addition, because the model is a real-time system, it can constantly learn to adapt to changing new attacks, thus improving the detection performance of abnormal behavior.

Data Availability

The data supporting the research results are from https://sites.google.com/a/uah.edu/tommy-morris-uah/ics-data-sets.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This work was supported in part by the National Natural Science Foundation of China under Grant no. 81961138010, in part by the Key Funding of Science and Technology Department of Jilin Province under Grant no. 20200401144GX, in part by the Fundamental Research Funds for the Central Universities under Grant no. QNXM20210036, in part by the Technological Innovation Foundation of Shunde Graduate School, USTB under Grant no. BK19BF006, in part by the Science and Technology Innovation Special Foundation of Foshan Municipal People’s Government under Grant no. BK21BF001, and in part by the Innovation and Transformation Foundation of Peking University Third Hospital under Grant no. BYSYZHKC2021107.