Initialize ,,,,,,,,. |
Initialize experience pool and mini-batches . |
Initialize the parameter of the estimation network as . |
Initialize the parameter of the target network . |
1: For episode do |
2: For time-slot do |
3: Input into the estimation network and output the ; |
4: Select the action using the adaptive policy algorithm |
And update according to the equation (19); |
5: Execute action and generate the observation and ; |
6: Compute from ,; |
7: Store into the experience-replay pool . |
8: Ifthen |
9: Randomly generate an index subset ; |
10: Sample from ; |
11: For each sample in do |
12: Compute the and obtain . |
13: End for |
14: Calculate the loss function according to the equation (16) and update according to the equation (17); |
15: Minimize the loss function with learning rate . |
17: End if |
18: Every time slots: Update by setting . |
19: End for |
20: End for |