Research Article
Reducing Entropy Overestimation in Soft Actor Critic Using Dual Policy Network
| Parameter | Value |
| Optimizer | Adam ([21]) | Max time steps | 1 million | Learning rate | | Discount | 0.99 | Replay buffer size | | # of hidden layers | 2 | # of hidden units per layer | 256 | # of samples per minibatch | 256 | Entropy target | (e.g. for Ant-v2) | Nonlinearity | ReLU | Target smoothing coefficient | 0.005 | Target update interval | 1 | Gradient steps critic | 1 | Reward scaling | 1.0 | Initial alpha | 0.2 | Alpha learning rate | 3-4 | Evaluation frequency | 53 | Evaluation episodes | 10 | Exploration time steps | 253 for Ant-v2 and HalfCheetah | 13 for Hopper-v2 and Walker2d |
|
|