Research Article
Deep Q-Network with Predictive State Models in Partially Observable Domains
| Input: learning rate | | Output: network parameters | (1) | Randomly select actions to generate N trajectories | (2) | Compute the sufficient features of every trajectory , , , , (n denotes the trajectory) | (3) | Establish the recurrent predictive state representation: | (4) | Initialize PSR: two-stage regression | (5) | Use kernel Bayes rule to estimate | (6) | Apply least squares method to formula (2) to compute | (7) | Set to the average of | (8) | Local optimization: | (9) | for i = 1, N do | (10) | Initialize state | (11) | for t = 1, m do | (12) | Predictive observation | (13) | Perform a gradient descent step on | (14) | Update state | (15) | end for | (16) | end for | (17) | Optimization policy network: | (18) | Initialize reactive policy randomly | (19) | for episode = 1, M do | (20) | Initialize state | (21) | for t = 1, T do | (22) | With probability , select , otherwise random | (23) | Execute action in emulator and observe reward and observation | (24) | Set filter | (25) | Store transition () in | (26) | Sample random mini batch of transitions () from | (27) | Set . | (28) | Perform a gradient descent step on | (29) | end for | (30) | end for |
|