Learning Diverse Policies with Soft Self-Generated Guidance
Algorithm 1
Policy optimization with soft self-generated guidance and diverse exploration (POSE).
Input: number of agents , learning rate , and on-policy training buffer for each agent, highly rewarded trajectory buffer for each agent, sequence length , and number of epochs .
(1)
Initialize policy weights of each agent.
(2)
Initialize the prior good trajectory buffer .
(3)
for to do
(4)
Collect rollouts and store them in their own on-policy training for each agent.
(5)
Update the soft self-imitation training batches for each agent.
(6)
Compute advantage estimates ,.
(7)
Estimate the distance between current trajectories and highly rewarded trajectories in soft self-imitation replay buffer for each agent.