Security and Communication Networks

Research Article

DQfD-AIPT: An Intelligent Penetration Testing Framework Incorporating Expert Demonstration Data

DQfD.

	Input:: the experience replay area built by the sum tree, : expert demonstration data area in , : interactive data area in , : weights for the policy network (randomly generated), : weights for the target network (randomly generated), : update target network frequency of pretraining, : update target network frequency of formal training, : batch size, : number of pretraining gradient updates, E: episode number of training, and S: max steps per episode
	Output: An agent trained with expert knowledge
(1)	Push expert transition data into and initialize their priority
(2)	for steps do
(3)	Sample a batch size of transitions from with prioritization
(4)	Calculate loss using the target network
(5)	Perform a gradient descent step to update the weights for the policy network
(6)	ifthenend if
(7)	end for
(8)	for episode do
(9)	for step do
(10)	Sample action A from the behaviour policy
(11)	The environment performs A and gives back R(reward), and the agent observes
(12)	Push the transition into , overwriting oldest interaction transition if over capacity of
(13)	Sample a batch size of transitions from with prioritization
(14)	Calculate loss using the target network
(15)	Perform a gradient descent step to update the weights for the policy network
(16)	, the state transitions from to
(17)	end for
(18)	ifthenend if
(19)	end for