Research Article

DQfD-AIPT: An Intelligent Penetration Testing Framework Incorporating Expert Demonstration Data

Algorithm 2

DQfD.
Input:: the experience replay area built by the sum tree, : expert demonstration data area in , : interactive data area in , : weights for the policy network (randomly generated), : weights for the target network (randomly generated), : update target network frequency of pretraining, : update target network frequency of formal training, : batch size, : number of pretraining gradient updates, E: episode number of training, and S: max steps per episode
Output: An agent trained with expert knowledge
(1) Push expert transition data into and initialize their priority
(2)for steps do
(3)  Sample a batch size of transitions from with prioritization
(4)  Calculate loss using the target network
(5)  Perform a gradient descent step to update the weights for the policy network
(6)  ifthenend if
(7)end for
(8)for episode do
(9)  for step do
(10)   Sample action A from the behaviour policy
(11)   The environment performs A and gives back R(reward), and the agent observes
(12)   Push the transition into , overwriting oldest interaction transition if over capacity of
(13)   Sample a batch size of transitions from with prioritization
(14)   Calculate loss using the target network
(15)   Perform a gradient descent step to update the weights for the policy network
(16)   , the state transitions from to
(17)  end for
(18)  ifthenend if
(19)end for