Abstract

Aiming at the existing network attack and defense stochastic game models, most of them are based on the assumption of complete information, which causes the problem of poor applicability of the model. Based on the actual modeling requirements of the network attack and defense process, a network defense decision-making model combining incomplete information stochastic game and deep reinforcement learning is proposed. This model regards the incomplete information of the attacker and the defender as the defender’s uncertainty about the attacker’s type and uses the Double Deep Q-Network algorithm to solve the problem of the difficulty of determining the network state transition probability, so that the network system can dynamically adjust the defense strategy. Finally, a simulation experiment was performed on the proposed model. The results show that, under the same experimental conditions, the proposed method in this paper has a better convergence speed than other methods in solving the defense equilibrium strategy. This model is a fusion of traditional methods and artificial intelligence technology and provides new research ideas for the application of artificial intelligence in the field of cyberspace security.

1. Introduction

In recent years, with the rapid development of information technology, network attacks have increased. Many new attack methods have been presented. Information network security has always been a hot issue [1], especially in some traditional networks. In order to reduce the complexity of the network and make the network equipment show strong homogeneity, network equipment has to be more vulnerable to network attacks. Once one network node is destroyed, the entire network system will be paralyzed. When a malicious attacker launches an attack by exploiting the loopholes behind the network equipment, the normal operation of the network will be disturbed. Besides, the leakage of major information will be caused. If the situation gets even worse, the security of the entire network system will be endangered [2]. Due to the complexity of the network system and the concealment and stochasticity of the attack means, it is hard for existing network defense technology to meet the security requirements of the network system, which makes it harder for the defender of the network system to guarantee the absolute security of the system. Therefore, there is a need for a new technology that can analyze network attack and defense events, so that network system defenders can implement dynamic and adaptive adjustment of defense strategies [3].

Primarily, due to the huge amount of similar characteristics of game theory and network offensive and defensive events, such as the antagonism of participants’ goals, noncooperation of strategies, and behavioral dependence, relevant research on game theory in network information security is increasingly emerging [4, 5]. Furthermore, the stochastic game is a dynamic game with state transition and is composed of a series of stages. It performs well because of describing multiple states. Therefore, stochastic game has quickly become a hot spot in the current research on network offense and defense [6]. Wang et al. [7] proposed a method for quantitative analysis of network security based on stochastic game model, which analyzes and evaluates the security of the target network based on the model generated by the simulation. Fu et al. [8] broke through the difficulties in the process of network offense and defense from the perspective of stochastic games. They used the quantification of the benefits of both offense and defense to propose a new selection algorithm to cope with the difficulty of responding to changes in attack intent and strategy in the process of network offense and defense. Huang and Zhang [9] aimed at the defect that the traditional deterministic game model could not accurately describe the offensive and defensive process in the real network environment and proposed a security defense strategy selection model based on the stochastic offensive and defensive evolutionary game model. The security defense strategy uses Gaussian white noise and stability judgment theorem of stochastic differential equations has made a breakthrough in the direction of the offensive and defensive strategies. Hu et al. [10] embedded noncooperative signal game theory in network attack and defense and simulated the dynamic network attack and defense confrontation process with the aid of the dynamic attenuation effect of network deception signals. New research ideas were proposed in the direction of network security active defense. Wei et al. [11] applied game theory to the maintenance of power grid security, developed a new model framework through stochastic game theory on the interaction between offense and defense, and introduced new algorithms to enhance the power grid’s ability to defend against attacks. Lei et al. [12] proposed a new mobile target defense strategy generation method based on incomplete information Markov game theory in view of the fact that the classic game theory and the complete information hypothesis cannot describe the mobile target defense confrontation problem well. In addition, with the development of emerging technologies such as artificial intelligence and machine learning in recent years, more and more intelligent algorithms have been applied to the field of network security [13, 14]. In order to enable the network to provide people with efficient and fast services, Hua et al. [15] integrated artificial intelligence technology into the field of network security and proposed a system detection algorithm based on the concept of artificial intelligence. This work has played a very supportive role for artificial intelligence technology to perform security inspections on network systems. In order to analyze the influence of bounded rationality on the stochastic game of network offense and defense, Zhang and Liu [16] aimed at the problem of state explosion when the number of network nodes increases; an offensive graph and a defensive graph have been designed to compress the state space and extract the network state and defense strategy. On this basis, the introduction of intelligent learning algorithms and the design of defense decision-making algorithms with online learning capabilities would help to select the optimal defense strategy with the greatest benefit from the set of candidate strategies. However, this method is similar to the Q-learning algorithm and is prone to “overestimation.”

Although the above studies provide solutions for the analysis of network attack and defense events, there are still some shortcomings: (1) Most studies are based on the assumption of complete information. However, in a real network attack and defense event, due to the concealment of the attacker, the defender cannot fully grasp the relevant information of the attacker. (2) The income functions of the above literatures are all based on the known transfer model. But, in many cases, the defender cannot grasp the probability of system state transition. Therefore, the above two points make the applicability of the model proposed in the abovementioned literature poor.

In response to the above problems and to improve the applicability of the stochastic game model in the analysis of network offensive and defensive events, this paper proposes a defense strategy selection model based on incomplete information stochastic game. In the meantime, we draw on the idea of reinforcement learning using the Double Deep Q-Network algorithm to conduct game analysis on the stochastic game model. Thus, the defender’s income can be dynamically updated, and the defense strategy can be adaptively adjusted. It is not necessary to set the system state transition probability in advance to obtain the Nash equilibrium of the two sides of the game. Lastly, the validity of the proposed model is verified through experiments.

The main contributions of this article are as follows:(1)Improve the existing network offensive and defensive stochastic game model and regard the incomplete information between people in the game as the defender’s uncertainty about the attacker’s type, so that the model can meet the real network offensive and defensive scenarios.(2)The deep reinforcement learning algorithm, Double DQN, is introduced into the model, so that the defender’s income can be learned and updated online, and the accuracy of the model’s solution to the defense strategy is improved.

This paper is organized as follows: Section 1 introduces the background and related work. Stochastic game model with incomplete information is presented in Section 2. Deep reinforcement learning method is discussed in Section 3 and experimental analysis in Section 4. In the end, comparative analysis of related work is discussed in Section 5 and the conclusion is given in Section 6.

2. Stochastic Game Model with Incomplete Information

Due to factors such as the network environment and network entities, network attack and defense is a complicated and stochastic process. The attacker and the defender have a relationship of target opposition and behavior dependence. Therefore, the network attack and defense process can be described as a stochastic game process between the attacker and the defender. In addition, for their own interests, both the offensive and defensive parties will always hide their own information from each other. Therefore, the network offensive and defensive process in the actual environment is a stochastic game process with incomplete information [17].

2.1. Discrete Processing of Network Attack and Defense Process

In order to facilitate the modeling and analysis of the network attack and defense process, we firstly discretize the network attack and defense process [18]. The whole process is regarded as a series of time slices ; each time slice contains only one network state, and each time slice is an offensive and defensive game process. The process can be described in Figure 1.

With the occurrence of cyberattacks, the network system transfers from one state to another under the interaction of the entities, as shown in Figure 2. The state transition of the network system is stochastic. In addition to being affected by the actions of malicious users, it is also affected by some other complex factors in the network. Our research goal is to find the defense strategy that the defender can obtain higher returns in the network attack and defense stochastic game.

2.2. Network Attack and Defense Stochastic Game Model

Due to the stochasticity and dynamics of the network state transition, the state transition of the network system is constantly changing in the long term. In the meanwhile, the next transition state is only related to the current state. In conclusion, the transition of the network state can be regarded as a Markov decision process: ; in the short term, the state transition of the network system is a fixed value [19].

Definition 1. The incomplete information-attack and defense stochastic game model is a 9-tuple: , where(1) is the collection of the attacker and the defender, that is, the player in the game(2) is the state set of the network system, that is, the set of stochastic game states(3) is the behavior set of the attacker, where is the behavior strategy set of the attacker in the system state si(4) is the behavior set of the defender, where is the behavior strategy set of the defender in system state (5) is the set of attacker types(6) represents the set of probabilistic judgments of the defender on the type of attacker when the system is in (7) represents a set of offensive and defensive strategies, where represents the strategy of the -shaped attacker in the system state, and is the probability of the attacker's choice of behavior ; similarly, represents the defense strategy of the defender when the system is in , and is the probability of choosing behavior for the defender(8) means that, in the system state and the attacker type , the offensive and defensive parties take an immediate return of action (9) is the set of revenue functions for both offensive and defensive parties, where represents the state-behavior revenue function set of both offensive and defensive parties, and represents the revenue function of the defender after the offensive and defensive parties take actions when the system state is and the attacker type is ; represents the state-strategy value revenue function set of the offensive and defensive parties; represents the revenue function of the defender after the offensive and defensive parties adopt the strategy when the attacker type is in the system state According to the analysis of the network attack and defense process and the definition of the above model, the defender’s profit function is expected to accumulate. Therefore, the defender’s state-strategy value profit function is expressed as

3. Deep Reinforcement Learning and Bayesian Equilibrium Solution

3.1. Bayesian Nash Equilibrium in the Process of Network Attack and Defense

Since network offense and defense can be regarded as a stochastic game process with incomplete information, the equilibrium solution is Bayesian Nash equilibrium. That is, both offense and defense can no longer unilaterally change their strategies to improve their own profits.

Definition 2. Bayesian Nash equilibrium of network offense and defense. There are (1) all attack strategies of the attacker and (2) all defense strategies of the defender, satisfyingThen strategy is a Bayesian Nash equilibrium in the state of the system, and the equilibrium return at this time is denoted as .
The equilibrium solution of II-ADSGM mentioned in this paper is the set of Bayesian Nash equilibrium solutions for each state of the system. This problem can be regarded as a secondary planning problem:In actual network offensive and defensive incidents, the decisions of both sides are often continuous. Therefore, the decision will affect not only current but also future benefits of both parties. As in the above definition, the benefits obtained by the offensive and defensive parties in the game process should include current benefits and future benefits, and the benefits change dynamically with the strategy. Therefore, in the network attack and defense stochastic game type, the revenue function of the offense and defense is defined as represents the current reward, represents the future reward, and is the discount factor. Obviously, when is grater, the revenue function is more affected by the future return, and when is less, the revenue function is more affected by the current return; represents the probability of the system transitioning from state to state under the influence of action . Because in the actual network attack and defense process the system state transition probability is dynamically changing, it is difficult to determine the value of parameter , which is the follow-up solving the Nash equilibrium of the system has brought great inconvenience. In most of the existing studies, the value of parameter is set in advance to facilitate subsequent calculations, which is obviously not in line with the actual situation.
In our research, we aim to solve the Bayesian Nash equilibrium of the system when the unknown parameter changes dynamically. In constructed offensive and defensive stochastic game model, the equilibrium income needs to meet the conditions that can be updated online with the offensive and defensive process. Therefore, in terms of network system security requirements, defenders are required to adapt to their defense strategies.

3.2. Q-Learning Algorithm

Q-learning is the basic algorithm in the reinforcement learning [20, 21], where the revenue function represents the revenue expectation that the agent can obtain by taking behavior in state at time t. is related to returns and the network environment, and the specific formula is as follows:

is the learning rate, which could dynamically adjust the income. After a dynamic adjustment process, the solution of income would not depend on the system state transition probability. It would also make up for the deficiencies of the existing model.

The offensive and defensive confrontation between network entities is a complicated process. In view of defenders’ strategy selection problem, many existing studies have simplified the offensive and defensive process as necessary [2224]. Considering that, the behaviors of various entities affect each other, and the transition of the system state caused by the interactive behaviors would provide a reference for the entities to select behaviors again. Therefore, the behavior learning mechanism of entities in a network attack and defense event is shown in Figure 3.

3.3. Solving State Transition Probability Parameters Based on Double Deep Q-Network Algorithm

Although the Q-learning algorithm is widely used in network attack and defense event analysis [25], it also has some shortcomings: because the Q-learning algorithm uses a Q-table to store the Q-value of each state-behavior, when the state and behavior spaces are discrete and the dimensionality is not high, the algorithm is effective. If the state and behavior space is continuous and the dimensionality is higher, the resulting behavior space and state space would be too large. Under these circumstances, it is difficult to solve the value of all state actions. Thus Q-learning cannot maintain such a large Q-table in memory.

In response to the above problems, some researchers have proposed using a model to represent the relationship between the state and the action to the value function. Deep Q-Network (DQN) is an algorithm formed by deep learning and reinforcement learning [26]. Compared with the Q-learning algorithm, it uses a neural network to approximate the behavior value function, transforms the Q-table update into a function fitting problem, and then fits a function to replace the Q-value generated in the Q-table, as shown in formula (6). In this way, the similar state can get the similar output behavior. In addition, the DQN algorithm also introduces a target value Q-Network that is independent and slower than the current value of Q-Network, as well as a playback memory unit. The structure of the DQN algorithm is shown in Figure 4. Therefore, the DQN algorithm has a better effect on the extraction of complex features than the Q-learning algorithm.

Although the DQN algorithm is more suitable for the analysis of network attack and defense events than the Q-learning algorithm, it also has shortcomings: the DQN algorithm cannot overcome the inherent shortcomings of Q-learning itself, overestimation; that is, the estimated value function is larger than the true value function, and its root is mainly in the maximization operation in Q-learning. From formula (6), it can be concluded that the target in the action selection is , where the operation makes the estimated value function larger than the true value of the value function.

However, considering the real network offensive and defensive events, the strategy of offensive and defensive parties does not always choose the action that maximizes the Q-value in a given state. In general, real strategy is a stochastic strategy; thus, the direct selection of the Q-value with the largest action for the target value will often cause the target value to be greater than the true value.

In order to solve this shortcoming, Hasselt proposed the Double DQN method [27], the definition of which is to use different value functions for the selection of actions and the evaluation of actions. The calculation formula is

In addition, because the revenue of the traditional DQN algorithm is only related to the environment and behavior and only one type of participants is involved in the algorithm, in a cyberattack event, there are two types of participants: attacker and defender. Therefore, the revenue function in the Double DQN algorithm needs to be expanded from one type of participants to two types of participants, and formula (7) needs to be improved. If we take the benefit of the defender as an example, the improved benefit would be

In this way, using formula (8), the value of the model's state-action gain no longer depends on parameter . Then, according to the learning rate , achieves the equilibrium gain through the learning mechanism.

In addition, this paper uses the algorithm to solve the exploration and utilization problem in the Double DQN algorithm; that is, the algorithm stochastically selects the behavior of the next time slice with the probability of and uses the probability of to obtain the Nash equilibrium strategy.

3.4. Cyber Adaptive Defense Countermeasure Algorithm

For each time slice in a cyberattack event, the algorithm uses to model and analyze the offensive and defensive process, solves the Bayesian Nash equilibrium according to the participants’ , and makes defense decisions and then uses the improved Double DQN algorithm to conduct the offensive and defensive confrontation process. Online learning is performed and is updated. The specific method is as follows.

Let be the number of states of the network system, let be the number of measures that the attacker can implement in each state, and let be the number of measures that the defender can implement in each state. The space complexity of Algorithm 1 is mainly concentrated on the storage of , , , and ; thus, the space complexity is . The time complexity of Algorithm 1 is mainly focused on the update of after the strategy is selected. We use the Lebg-plex algorithm to calculate it. The average time complexity of the Lebg-plex algorithm is .

Input: ; learning rate ; reward discount factor ; exploration probability ; convergence accuracy ; stable duration ;
 Output: optimal defense strategy .
 Begin
(1)Initialization:
(2)Solve Bayesian Nash equilibrium:
(3)Network defense strategy revenue function:
(4)Get current network status:
(5)repeat:
(6)Select defensive actions through algorithm:
(7)Output
(8)Get a new network status:
(9)Update and learn Q according to the phased results:
(10)Update Bayesian Nash Equilibrium:
(11)
(12)
(13)Until
(14)Output
(15)End

4. Simulation Experiment and Analysis

In order to verify the correctness and rationality of the model proposed in this paper and the method of solving the equilibrium, the network environment used in the simulation experiment draws on the typical experimental network constructed by [25], and the topology is shown in Figure 5.

The experimental data used in this paper comes from the offensive and defensive behavior database of MIT Lincoln Laboratory. In the case of known system state transition probability, the detailed process of solving the Nash equilibrium strategy is shown in Appendix B. We use Python 2.7 to implement the algorithm for selecting defense strategies under the condition that the system state transition probability is unknown. The performance of the algorithm is shown in Figure 6.

From the Figure 6, we find that, under the parameter settings shown in Table 1, the probability values of the two different strategies selected by the defender can converge to the Bayesian Nash equilibrium, which is consistent with the calculation results in Table 2, indicating that this article improves the accuracy of the model. At the same time, we found that the convergence speeds of the defense decision values corresponding to the three groups of different parameters are different. It can be clearly seen from the figure that the third group reaches equilibrium when the number of defenses is about 140 times, which is significantly faster than the convergence speed of the other two groups; therefore, in this example, the performance of the model using the third group of parameters is better.

In addition, in order to verify the performance of the model proposed in this article, we compared the method in this article with the DQN algorithm to solve the state transition probability of a stochastic game model. The results are shown in Figure 7.

From Figure 7, we find that the defense strategy value obtained by using the DQN algorithm in the model often performs better than the defense strategy value calculated by using the Double DQN algorithm. That is because the DQN algorithm is likely to cause overestimation. The result is consistent with our expectations. We also found that the model proposed in this paper calculates that the convergence rate of the probability value of the defense strategy is faster than that of the model using the DQN algorithm. Therefore, the above results show that the performance of our proposed model is better.

The comparison results between the model proposed in this paper and some typical methods are shown in Figure 8 and Table 3.

Figure 8 shows the probability of selecting defense actions in four different methods. From the figure, it can be found that the defense strategy of the method proposed in this article can converge to an equilibrium strategy after about 140 times of learning. The state transition of the method proposed in [8] is a fixed value, which leads to a certain deviation between its result and the objective equilibrium strategy value. In addition, this method assumes complete information, which also makes the calculation result slightly larger than other methods. The method proposed in [9] is an improvement over the method in [8] in certain ways, but the defense strategy obtained by this method is unstable in the early stage and fluctuates greatly. The results are also biased. The result of the method proposed in [28] is better than that of [8, 9] in the early stage of defense. However, as the number of defenses increases, it began to show the disadvantage of this method in processing high-dimensional data. In addition, the Q-learning algorithm is more suitable for dealing with pure strategy problems. When it comes to deal with the mixed strategy selection problem, the algorithm does not perform well, and the defense strategy would keep oscillating.

As shown in Table 3, we have summarized the four methods. The basic theory of stochastic game used in [8, 9] is relatively simple and the model applicability is general; the basic theory used in [28] is stochastic game and reinforcement learning, and the applicability of the model has been improved. The basic theory used in this paper is to combine stochastic game and deep reinforcement learning. Compared with [8], this article assumes incomplete information as the premise, and the applicability of the model is better. Compared with [9], the model we proposed no longer needs to fix the system state transition probability in advance, which is more in line with the actual situation of network system state transition. Compared with the method proposed in [28], not only is our method in this paper more suitable for processing high-dimensional network system state space, but also the accuracy is higher. In summary, it is more suitable for modeling and analyzing the actual network attack and defense process.

6. Conclusions

This paper analyzes the actual network offensive and defense process modeling requirements and proposes a defense strategy decision model based on incomplete information stochastic game and deep reinforcement learning. This model is aimed at the game problem between the offensive and defensive sides of the network and uses deep reinforcement learning to solve the benefits of the game entities. We have verified that our model is more in line with the modeling requirements of the network offensive and defensive process in the real environment. The model proposed in this paper improves the existing model theoretically. The selected deep reinforcement learning algorithm is more suitable for processing high-latitude game state space, and experiments show that, under the same experimental conditions, the method proposed in this paper has a better convergence rate than other methods in solving the defense equilibrium strategy. Therefore, the research results of this paper provide new research ideas for the selection of network security defense strategies.

In the future, our research will focus on two aspects: (1) to improve the accuracy of the recurring network attack and defense process and (2) to improve the applicability of the model to meet the needs of more complex network environments

Appendix

A. The Meanings of the Math Notations Used in This Paper Are Shown in Table 4

B. Example Calculation

Assume that the attacker type is , where represents a high-level attacker and represents a low-level attacker; the probability distribution of attacker type . The state set of the network system ; the network state transition is shown in Figure 9.

The attacker's behavior set , where , , , and . The defender's behavior set , where , , , and .

Next, the authors use the method proposed in this article, taking system state as an example to solve the Nash equilibrium strategy. Known conditions are as follows:(1)The defender’s immediate return when in state :(2)Considering that in actual network attack and defense events, due to the stochasticity of attack behavior, assume that the attacker's initial strategy in system state is .(3)For parameter settings, see Table 1.(4)The transition probability of the system from state :

Then, according to the above known conditions (1)–(4), using formula (1) and formula (2), the Bayesian Nash equilibrium in the experimental scene can be obtained, and the results are shown in Table 2.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by National Key Research and Development Program (no. 2018YFC0808306), Hebei Province Key Research and Development Program (19270318D), Hebei Province Internet of Things Monitoring Engineering Technology Research Center (no. 3142018055), and Qinghai Province Internet of Things Key Laboratory (no. 2017-ZJ-Y21) project.