1: Set initial policy parameters , , -function parameters , , empty replay buffer , discount reward |
2: Set target parameters equal to main parameters , |
3: repeat |
4: {Observe stateand select action against minimum entropy} |
5: |
6: {is the index of the policy with minimum entropy} |
7: Execute in the environment |
8: Observe next state , reward , and done signal to indicate whether is terminal |
9: Store in replay buffer |
10: if is terminal |
11: reset environment state |
12: end if |
13: if it is time to update then |
14: for in (no. of updates required) do |
15: Randomly sample a batch of transitions, from |
16: Compute targets for the functions: |
17: |
18: {is the index of the policy with minimum entropy} |
19: |
20: Update -functions by one step of gradient descent using |
21: for |
22: Update policy by one step of gradient ascent using |
23: |
24: {is the index of the policy with minimum entropy} |
25: , |
26: {whereis a sample fromwhich is differentiable w.r.t.via the reparametrization trick.} |
27: Update target networks with: |
28: |
29: {whereis polyak. (Always between 0 and 1, usually close to 1.)} |
30: end for |
31: end if |
32: until convergence |