(1) Initialize parameter vector , eligibility trace vector , discount factor , step-size parameter |
(2) Repeat(for every episode): |
(3) ← initial state |
(4) According to (10), compute , |
(5) According to -greedy policy, select activation action , |
(6) According to (13), select action when state is |
(7) According to (16), compute , , |
(8) According to (17) and (18), compute |
(9) Repeat(for each step of episode) |
(10) Update eligibility trace: , , |
(11) Take action , receive next state and reward |
(12) |
(13) According to -greedy policy, select activation action , |
(14) According to (13), select action when state is |
(15) According to (16), compute , , |
(16) According to (10), compute , |
(17) According to (17) and (18), compute |
(18) |
(19) |
(20) |
(21) Until is the terminal state |
(22) Until preset episode number or other terminal condition meets |