Learning Diverse Policies with Soft Self-Generated Guidance

<div>The framework of POSE. The diverse exploration in POSE leverages multiple agents to sample training batches and use the measurement of diversity to encourage agents to collect diverse trajectories, while the traditional RL, e.g., PPO [<a href="/journals/ijis/2023/4705291/#B20">20</a>] or SAC [<a href="/journals/ijis/2023/4705291/#B45">45</a>], uses a single agent to collect data. In the meantime, every agent maintains a replay buffer and stores specific trajectories in this buffer. These past good trajectories can be used to guide the agents to revisit the region where the agents can obtain rewards with a higher probability.</div>

International Journal of Intelligent Systems

fig1

Figure 1

Figure 1: Learning Diverse Policies with Soft Self-Generated Guidance