Research Article

Deep Q-Network with Predictive State Models in Partially Observable Domains

Algorithm 1

RPSR-DQN.
Input: learning rate
Output: network parameters
(1)Randomly select actions to generate N trajectories
(2)Compute the sufficient features of every trajectory , , , , (n denotes the trajectory)
(3)Establish the recurrent predictive state representation:
(4)Initialize PSR: two-stage regression
(5)Use kernel Bayes rule to estimate
(6)Apply least squares method to formula (2) to compute
(7)Set to the average of
(8)Local optimization:
(9)for i = 1, N do
(10)Initialize state
(11)for t = 1, m do
(12)Predictive observation
(13)Perform a gradient descent step on
(14)Update state
(15)end for
(16)end for
(17)Optimization policy network:
(18)Initialize reactive policy randomly
(19)for episode = 1, M do
(20)Initialize state
(21)for t = 1, T do
(22)With probability , select , otherwise random
(23)Execute action in emulator and observe reward and observation
(24)Set filter
(25)Store transition () in
(26)Sample random mini batch of transitions () from
(27)Set .
(28)Perform a gradient descent step on
(29)end for
(30)end for