Input: , , , , , , , , , , , |
(1) Initialize: , , , , |
(2) Loop |
(3) , , , |
(4) Repeat all episodes |
(5) Choose according to |
(6) Execute the action: |
(7) Observe the reward and the next state: |
% Update the global model |
(8) Predict the next state: |
|
(9) Predict the reward: |
(10) Update the parameters : |
|
(11) Update the parameter : |
% Update the local model |
(12) If |
(13) Insert the real sample into the memory |
(14) Else If |
(15) Replace the oldest one in with the real sample |
(16) End if |
(17) Select L-nearest neighbors of the current state from to construct and |
(18) Predict the next state and the reward: |
(19) Update the parameter : |
(20) Compute the local error: |
|
(21) If |
(22) Call Local-model planning () (Algorithm 3) |
(23) End If |
% Update the value function |
(24) Update the eligibility: |
(25) Estimate the TD error: |
(26) Update the value-function parameter: |
% Update the policy |
(27) Update the policy parameter: |
(28) |
(29) Update the number of samples: |
(30) Until the ending condition is satisfied |
(31) Call Global-model planning () (Algorithm 4) |
(32) End Loop |
Output: , |