DQfD-AIPT: An Intelligent Penetration Testing Framework Incorporating Expert Demonstration Data
Algorithm 2
DQfD.
Input:: the experience replay area built by the sum tree, : expert demonstration data area in ,: interactive data area in ,: weights for the policy network (randomly generated), : weights for the target network (randomly generated), : update target network frequency of pretraining, : update target network frequency of formal training, : batch size, : number of pretraining gradient updates, E: episode number of training, and S: max steps per episode
Output: An agent trained with expert knowledge
(1)
Push expert transition data into and initialize their priority
(2)
for steps do
(3)
Sample a batch size of transitions from with prioritization
(4)
Calculate loss using the target network
(5)
Perform a gradient descent step to update the weights for the policy network
(6)
ifthenend if
(7)
end for
(8)
for episode do
(9)
for step do
(10)
Sample action A from the behaviour policy
(11)
The environment performs A and gives back R(reward), and the agent observes
(12)
Push the transition into , overwriting oldest interaction transition if over capacity of
(13)
Sample a batch size of transitions from with prioritization
(14)
Calculate loss using the target network
(15)
Perform a gradient descent step to update the weights for the policy network