Research Article

Learning Diverse Policies with Soft Self-Generated Guidance

Figure 1

The framework of POSE. The diverse exploration in POSE leverages multiple agents to sample training batches and use the measurement of diversity to encourage agents to collect diverse trajectories, while the traditional RL, e.g., PPO [20] or SAC [45], uses a single agent to collect data. In the meantime, every agent maintains a replay buffer and stores specific trajectories in this buffer. These past good trajectories can be used to guide the agents to revisit the region where the agents can obtain rewards with a higher probability.