Research Article

Learning Diverse Policies with Soft Self-Generated Guidance

Figure 6

State-visitation counts of the grid world for different algorithms: (a) PPO, (b) A2C, (c) Div-A2C, (d) PPO-SIL, (e) PPO-EXP, and (f) ours. PPO, A2C, and PPO-SIL are easily trapped into local optimum. Div-A2C could visit different regions where the apple does not locate, but it cannot arrive at the treasure and obtain the optimal reward. PPO + EXP enables the agent to reach the treasure and obtain the highest rewards; however, it spends amounts of computation to visit parts of state-action space where it cannot obtain any useful rewards. Our algorithm can explore the state-action space systematically and collect the optimal rewards quickly, which enables the agent to learn the optimal policy at a high learning rate..
(a)
(b)
(c)
(d)
(e)
(f)