Security and Communication Networks

Volume 2018, Article ID 2058429, 12 pages

https://doi.org/10.1155/2018/2058429

## A Dynamic Hidden Forwarding Path Planning Method Based on Improved Q-Learning in SDN Environments

School of Software, Beijing Institute of Technology, Beijing, China

Correspondence should be addressed to Kun Lv; nc.ude.tib@vlnuk

Received 10 January 2018; Accepted 12 March 2018; Published 23 April 2018

Academic Editor: Zheng Yan

Copyright © 2018 Yun Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Currently, many methods are available to improve the target network’s security. The vast majority of them cannot obtain an optimal attack path and interdict it dynamically and conveniently. Almost all defense strategies aim to repair known vulnerabilities or limit services in target network to improve security of network. These methods cannot response to the attacks in real-time because sometimes they need to wait for manufacturers releasing corresponding countermeasures to repair vulnerabilities. In this paper, we propose an improved Q-learning algorithm to plan an optimal attack path directly and automatically. Based on this path, we use software-defined network (SDN) to adjust routing paths and create hidden forwarding paths dynamically to filter vicious attack requests. Compared to other machine learning algorithms, Q-learning only needs to input the target state to its agents, which can avoid early complex training process. We improve Q-learning algorithm in two aspects. First, a reward function based on the weights of hosts and attack success rates of vulnerabilities is proposed, which can adapt to different network topologies precisely. Second, we remove the actions and merge them into every state that reduces complexity from to . In experiments, after deploying hidden forwarding paths, the security of target network is boosted significantly without having to repair network vulnerabilities immediately.

#### 1. Introduction

A defense strategy represents a series of defense methods in the target information system network that can reduce the attack success rate of attackers. Currently, many methods are available to generate a defense strategy. The most important problem is the game between cost and performance. The defense strategy may own excellent performance, but defenders scan and recapture the information system in most instances, which is very uneconomic.

Generally speaking, whether it is SDN or traditional network, we can plan defense strategy through locate optimal attack path. Regarding this method, a majority of previous papers specify generating a complete attack graph [1–3]; however, in a very large computer cluster, the state explosion problem tends to affect the attack graph generation. Thus, the optimal attack path cannot be modeled quickly, and in extreme cases, it may not be possible to determine the optimal attack path. In [4], authors use the ant colony optimization (ACO) approach to search the optimal attack path based on the minimal attack path [5], but ACO can easily fall into a local optimum. Reference [6] proposes a HMM-based attack graph generation method, and then authors use ACO-based algorithm to compute the optimal attack path. Based on this path, evaluating the security of target network can be evaluated and corresponding countermeasures can be planned. But this method primarily handles the known vulnerabilities. Reference [7] proposes a malicious nodes-based security model enacting method, but its performance on handling zero-day vulnerability is not strong enough.

Reinforcement learning [8] is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize a cumulative reward. It differs from standard supervised learning in that correct input/output pairs are never presented, nor are suboptimal actions explicitly corrected. Further, the focus is on online performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

To determine a method that can model the optimal defense strategy in many conditions, the algorithm of the method must not depend on all states of the target network and it should be able to decide which atomic attack should be picked to be the next state dynamically. These abilities depend on the characteristics of the specific algorithm.

In this paper, the optimal attack path between source node and target node is computed by the improved Q-learning algorithm. Concretely, in the first stage, we collect information of known vulnerabilities and corresponding type of hosts from national vulnerability database (NVD) [9]. Then, a fuzzy neural network will be used to train these samples to gain the host weight. After getting host weight, the reward function of improved Q-learning can be built. Using this function, the optimal attack path can be located between source node and target node.

In Section 2, we will introduce Q-learning, which is followed by an overview of the main contributions of this paper. In Section 3, the definition of a network model will be discussed. We propose a reward function, and the optimal forwarding path will be built based on this function. In Section 4, we will discuss how to implement this method in a real information system network.

Section 5 provides the experimental results for the optimal protective path and discusses how to improve the Q-learning algorithm. The paper concludes with a summary and future work in Section 6.

##### 1.1. Related Work

There are several works in defense strategy planning in recent years. Currently, many methods are available to generate a defense strategy and these methods can be classified into three categories. First, we can compute the optimal attack path using attack graph and enact policies to destroy the path. Second, we can locate the not-trust nodes in target network and plan countermeasures to prevent these nodes from being exploited by attackers. Third, special strategies can be designed to aim to specific network environments specific attack types. But all of these methods have inherent defects.

Regarding the first method, the intrinsic of this method is that system generates attack graph of target network and then finds and interdicts the optimal attack path in the attack graph. Wang et al. [6] propose a HMM-based attack graph generation method and ACO-based algorithm to evaluate the security of target network and plan corresponding countermeasures. This method can compute the transition probability between each two states. Based on the probability and ACO-based algorithm, the shortest attack path can be found, which can be used to evaluate security metrics. But this method owns some defects. Firstly, the complexity of ACO algorithm is ( is the iteration number, is the number of vertices, and is the number of ants in ant colony) that is too high to computation if we process computer cluster. Secondly, this technique’s performance is not good enough when it deals with APT and zero-day vulnerabilities, because the interval of time series of HMM is slight less than the interval of APT and this method uses Common Vulnerability and Exposures (CVE) [10]. Ghosh et al. [4] proposes an ACO-based defense strategy planning method. This method is similar as [6]. It uses minimal attack graph to locate the optimal attack path, but this path may not be the global optimal and this attack graph will also show state explosion issue if it is used in very large computer clusters.

For the second technique, the core of this technique is to find the malicious nodes. Akbar et al. [7] propose a Support Vector Machine (SVM) and rough set-based security model building method. In that paper, authors use SVM and rough set to classify the nodes in target network as trust nodes, strange nodes, and malicious nodes. This technique can also acquire the transaction success rate. This method can handle zero-day vulnerabilities in some conditions, but it needs a large number of sample data to training SVM that is impossible to obtain enough data set in some network environments because the data need to spend a lot of time to collect.

Regarding the third method, the key of this method is to handle specific attack types or vulnerabilities. Hu et al. [11] characterize the interaction between defender and APT attacker and an information-trading game among insiders as a two-layer game model. Through their analysis, the existence of Nash Equilibrium for both games is certified and the security metric can be evaluated. But this method can only process APT; the generalization of it is limited. Same as [11], Wang et al. [12] propose a -zero-day safety method. It starts with the worst case assumption that this is not measurable which unknown vulnerabilities are more likely to exist in the target network and ends to the number of zero-day vulnerabilities that can destroy the network asset. But the complexity of computing this metric is exponential in the size of the zero-day attack graph. Furthermore, the zero-day attack graph cannot reflect the condition of known vulnerabilities related work.

##### 1.2. Contribution

In this paper, we use an improved Q-learning algorithm to generate the optimal attack path. In Q-learning [8], which action will be selected is based on a reward function. In other words, a large number of sample data are not required, as is the case in many other machine learning algorithms. Compared to temporal difference learning, Q-learning can directly iterate an optimal policy, which in this paper is the optimal attack path. Defining the reward function is the key issue in Q-learning. In this paper, we use the host weight and attack success rate of atomic attacks to build a reward function. Specifically, the host weight is decided by the position the host stays in and services the host offers. Besides, we improve the structure of state matrix in Q-learning. The dimension of the matrix is reduced, which can lower the space complexity. Furthermore, the network model that reflects the configuration of the target network will be used to analyze the result of Q-learning.

Our ultimate goal is to build a hidden forwarding path. In this path, we create virtual hosts that provide specific defense strategies in SDN to filter specific attacks. These hosts can be created or deleted dynamically, which can ensure the computation of hidden forwarding paths will occupy the SDN controller’s minimal memory space when we want to change the routing path. Furthermore, through using the hidden forwarding paths, vulnerabilities are filtered, which can guarantee the system’s security without repairing vulnerabilities or limiting services on hosts.

The other contribution of this paper is to render the defense strategy to be economical. Our method does not need scan or monitor hosts at all time. The hosts will be scanned only if our algorithm thinks it is not-trust node.

#### 2. Preliminary

##### 2.1. Description of Q-Learning Algorithm

Q-learning is a model-free reinforcement learning technique and it derives from policy iteration. The flow diagram of policy iteration is shown in Figure 1. Specifically, Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process (MDP). Therefore, the definition of Q-learning is given in This formula denotes that the state transfers to using action , and the cost of this process is . Above, represents the discrete time sequence.