Wireless Communications and Mobile Computing

Research Article

Reducing Entropy Overestimation in Soft Actor Critic Using Dual Policy Network

Proposed Soft Actor Critic Algorithm.

1: Set initial policy parameters , , -function parameters , , empty replay buffer , discount reward
2: Set target parameters equal to main parameters ,
3: repeat
4: {Observe stateand select action against minimum entropy}
5:
6: {is the index of the policy with minimum entropy}
7: Execute in the environment
8: Observe next state , reward , and done signal to indicate whether is terminal
9: Store in replay buffer
10: if is terminal
11: reset environment state
12: end if
13: if it is time to update then
14: for in (no. of updates required) do
15: Randomly sample a batch of transitions, from
16: Compute targets for the functions:
17:
18: {is the index of the policy with minimum entropy}
19:
20: Update -functions by one step of gradient descent using
21: for
22: Update policy by one step of gradient ascent using
23:
24: {is the index of the policy with minimum entropy}
25: ,
26: {whereis a sample fromwhich is differentiable w.r.t.via the reparametrization trick.}
27: Update target networks with:
28:
29: {whereis polyak. (Always between 0 and 1, usually close to 1.)}
30: end for
31: end if
32: until convergence