Research Article
Reducing Entropy Overestimation in Soft Actor Critic Using Dual Policy Network
Table 2
Max evaluation reward achieved during learning.
| Environment | Algorithm | Max value | S.D |
| Ant-v2 | SAC | 4536.18 | 1425.31 | TD3 | 4360.79 | 2081.05 | Ours | 5151.27 | 1600.29 |
| HalfCheetah-v2 | SAC | 8308.59 | 4298.95 | TD3 | -1.65 | 0 | Ours | 9158.21 | 4414.13 |
| Hopper-v2 | SAC | 2905.15 | 1388.34 | TD3 | 2622.66 | 1245.89 | Ours | 2812.76 | 1311.20 |
| Walker2d-v2 | SAC | 3611.99 | 1670.66 | TD3 | 3513.84 | 1635.82 | Ours | 4357.33 | 2090.74 |
|
|