International Journal of Intelligent Systems

Research Article

Learning Diverse Policies with Soft Self-Generated Guidance

Policy optimization with soft self-generated guidance and diverse exploration (POSE).

	Input: number of agents , learning rate , and on-policy training buffer for each agent, highly rewarded trajectory buffer for each agent, sequence length , and number of epochs .
(1)	Initialize policy weights of each agent.
(2)	Initialize the prior good trajectory buffer .
(3)	for to do
(4)	Collect rollouts and store them in their own on-policy training for each agent.
(5)	Update the soft self-imitation training batches for each agent.
(6)	Compute advantage estimates , .
(7)	Estimate the distance between current trajectories and highly rewarded trajectories in soft self-imitation replay buffer for each agent.
(8)	Estimate the gradient in equation (11), .
(9)	Perform policy improvement step: for each agent.
(10)	Estimate and .
(11)	Perform policy exploration step by update policy parameters according to equation (15).
(12)	end for