Research Article

Count-Based Exploration via Embedded State Space for Deep Reinforcement Learning

Table 1

Atari 2600: average total reward after training for 50 M time steps. Boldface numbers indicate best results. Italic numbers are the best among count-based exploration methods.

FreewayFrostbiteGravitarMontezumaSolarisVenture

TRPO (baseline)16.5286948602758121
Double-DQN33.316834120306898
Dueling network0467258802251497
Gorila11.760510544N/A1245
DQN Pop-Art33.43469483045441172
A3C+27.350724614221750
TRPO+AE33.55214482754467445
TRPO+BASS28.431506042381201616
TRPO+OURS3455377121964860983