(1) | Input parameters M, T, N, θ, γ mmin, mmax |
(2) | Randomly initialize critic network and actor with weights and |
(3) | Initialize target network Q and μ′ with weights |
(4) | Initialize replay buffer |
(5) | For episode = 0 to M do |
(6) | Initialize a random process for action exploration |
(7) | Initialize the AUV simulation environment |
(8) | Receive initial observation state s1 from the AUV simulation environment |
(9) | For step = 0 to T do |
(10) | Select action according to the current policy and exploration noise |
(11) | Execute action at in the AUV simulation environment |
(12) | If then |
(13) | Sample a random minibatch of N transitions (si, ai, ri, si+1) from |
(14) | Set |
(15) | Update critic by minimizing the loss: |
(16) | Update the actor policy using the sampled policy gradient: |
(17) | Update the target Q networks: |
(18) | Update the target policy networks: |
(19) | {End if} |
(20) | If |
(21) | Remove the oldest stored data from the reply buffer |
(22) | End if |
(23) | Obtain the new state st+1 |
(24) | Obtain reward rt |
(25) | Store transition (st, at, rt,st+1) in |
(26) | End for |
(27) | End for |
(28) | Output parameters and with weights θQ and θμ |