Input: Historical data set H, replay memory D, maximum training episodes N, a constant Z, initialization evaluation network Q and target network |
Output: Evaluate network Q |
1: For step from 1 to N do |
2: Initialized worker state = |
3: while is not the termination state do |
4: if is the initial state then |
5: Take the task with period equal to in the historical data set H as the |
action set of the current state s |
6: else |
7: Obtain the action set of the current state in the historical data set H |
according to the spatiotemporal constraints of Equation (6) and Equation (7) |
8: end if |
9: if the action set of is empty then |
10: The worker executes the virtual task , the state transitions to , and |
the reward is 0 |
11: Store in the cache memory, where is the next state, |
is the reward, and is whether is the termination state |
12: else |
13: Take as input to get the value for each state-action |
pair |
14: Use the method to select the corresponding action in the |
output of the current value |
15: Get , , according to action . and store in the |
replay memory |
16: end if |
17: if replay memory D is full then |
18: Cover a piece of data in D and extract a mini-batch to randomly sample |
for learning |
19: Calculate the target value y according to Equation (12) |
20: Gradient descent update of evaluation network Q parameters according |
to the loss function of Equation (11) |
21: Update the target network parameters every Z step |
22: end if |
23: = |
24: end while |
25: end for |