Research Article

Learning Diverse Policies with Soft Self-Generated Guidance

Algorithm 1

Policy optimization with soft self-generated guidance and diverse exploration (POSE).
Input: number of agents , learning rate , and on-policy training buffer for each agent, highly rewarded trajectory buffer for each agent, sequence length , and number of epochs .
(1)Initialize policy weights of each agent.
(2)Initialize the prior good trajectory buffer .
(3)for to do
(4) Collect rollouts and store them in their own on-policy training for each agent.
(5) Update the soft self-imitation training batches for each agent.
(6) Compute advantage estimates , .
(7) Estimate the distance between current trajectories and highly rewarded trajectories in soft self-imitation replay buffer for each agent.
(8) Estimate the gradient in equation (11), .
(9) Perform policy improvement step: for each agent.
(10) Estimate and .
(11) Perform policy exploration step by update policy parameters according to equation (15).
(12)end for