Learning Diverse Policies with Soft Self-Generated Guidance

<div>A collection of environments with discrete state-action spaces that we use. (a) Huge grid world with sparse rewards: key-door-treasure domain. The agent should pick up the key (+2) in the right-down room in order to open the blue door (+4) and collect the treasure (+4) in the middle-up room to maximize the reward. (b) Huge grid world with deceptive rewards. There is an apple in the left-up room that gives small rewards (+2) and a treasure in the middle-up room, which generates higher rewards (+10).</div>

International Journal of Intelligent Systems

fig2

Figure 2

Figure 2: Learning Diverse Policies with Soft Self-Generated Guidance