Research Article

Reducing Entropy Overestimation in Soft Actor Critic Using Dual Policy Network

Table 1

Hyperparameters.

ParameterValue

OptimizerAdam ([21])
Max time steps1 million
Learning rate
Discount 0.99
Replay buffer size
# of hidden layers2
# of hidden units per layer256
# of samples per minibatch256
Entropy target (e.g. for Ant-v2)
NonlinearityReLU
Target smoothing coefficient 0.005
Target update interval1
Gradient steps critic1
Reward scaling1.0
Initial alpha0.2
Alpha learning rate3-4
Evaluation frequency53
Evaluation episodes10
Exploration time steps253 for Ant-v2 and HalfCheetah
13 for Hopper-v2 and Walker2d