Research Article

Reducing Entropy Overestimation in Soft Actor Critic Using Dual Policy Network

Algorithm 1

Proposed Soft Actor Critic Algorithm.
1: Set initial policy parameters , , -function parameters , , empty replay buffer , discount reward
2: Set target parameters equal to main parameters ,
3: repeat
4: {Observe stateand select action against minimum entropy}
5: 
6:  {is the index of the policy with minimum entropy}
7: Execute in the environment
8: Observe next state , reward , and done signal to indicate whether is terminal
9: Store in replay buffer
10: if is terminal
11:  reset environment state
12: end if
13: if it is time to update then
14:  for in (no. of updates required) do
15:   Randomly sample a batch of transitions, from
16:   Compute targets for the functions:
17:   
18:    {is the index of the policy with minimum entropy}
19:   
20:   Update -functions by one step of gradient descent using
21:   for
22:   Update policy by one step of gradient ascent using
23:   
24:    {is the index of the policy with minimum entropy}
25:   ,
26:   {whereis a sample fromwhich is differentiable w.r.t.via the reparametrization trick.}
27:   Update target networks with:
28:   
29:   {whereis polyak. (Always between 0 and 1, usually close to 1.)}
30:  end for
31: end if
32: until convergence