Abstract
While deep reinforcement learning (DRL) has achieved great success in some large domains, most of the related algorithms assume that the state of the underlying system is fully observable. However, many realworld problems are actually partially observable. For systems with continuous observation, most of the related algorithms, e.g., the deep Qnetwork (DQN) and deep recurrent Qnetwork (DRQN), use history observations to represent states; however, they often make computationexpensive and ignore the information of actions. Predictive state representations (PSRs) can offer a powerful framework for modelling partially observable dynamical systems with discrete or continuous state space, which represents the latent state using completely observable actions and observations. In this paper, we present a PSR modelbased DQN approach which combines the strengths of the PSR model and DQN planning. We use a recurrent network to establish the recurrent PSR model, which can fully learn dynamics of the partially continuous observable environment. Then, the model is used for the state representation and update of DQN, which makes DQN no longer rely on a fixed number of history observations or recurrent neural network (RNN) to represent states in the case of partially observable environments. The strong performance of the proposed approach is demonstrated on a set of robotic control tasks from OpenAI Gym by comparing with the technique with the memorybased DRQN and the stateoftheart recurrent predictive state policy (RPSP) networks. Source code is available at https://github.com/RPSRDQN/papercode.git.
1. Introduction
For agents operating in stochastic domains, how to determine the (near) optimal policy is a central and challenge issue. While (deep) reinforcement learning has provided a powerful framework for decisionmaking and control and has achieved great success in recent years in some largescale applications, e.g., AlphaGo [1], most of the related approaches rely on the strong assumption that the agent can completely know the environment surrounded it, i.e., the environment is fully observable. However, for many realworld applications, the problem is actually partially observable Markov decision process (POMDP) where the state of the environment may be partially observable or even unobservable [2, 3].
Much effort has been devoted to planning in partially observable environments. Some of the work aims for learning the complete model of the underlying system. Huang et al. [4–6] propose the planning methods based on the PSR model. Song et al. [7] and Somani et al. [8] propose the planning method based on the POMDP model. However, these methods are only suitable for systems with discrete observations. In this paper, we mainly focus on systems with continuous observations, and there are two main approaches for dealing with the partially observable problem in such domain. One relies on recurrent neural networks to summarize the past, and then the neural network is trained in a modelfree reinforcement learning manner [2, 9, 10]. However, it will be a heavy burden for the training of networks when everything relies on it. The other approach for dealing with the partially observable problem is directly using the past histories, i.e., the past observations (frames), for the state representation, and the main problem of this approach is that the number of observations (frames) used for the state representation can only be determined empirically. Also, too many observations for the state representation may be computationexpensive, but few observations may not be sufficient statics of the past. And neither method considers the effect of action information on state representation.
Predictive state representations (PSRs) provide a general framework for modelling partially observable systems, and unlike the latentstate based approaches, such as POMDPs, the core idea of PSRs is to work only with the observable quantities, which leads to easier learning of a more expressive model [11–13]. PSRs can also combine with the recurrent network for the modelling and planning in partially observable dynamic systems with continuous state space [14, 15].
In this paper, with the benefits of the PSR approach and the great success of deep Qnetwork in some realworld applications, we propose the RPSRDQN approach; firstly, a recurrent PSR model of the underlying partially observable systems is built, then the truly state, namely, the PSR state or the belief state, can be updated and provide the sufficient information for DQN planning, and finally, the tuple of , where is the information of the current state and is the information of the next state obtained by taking action under the current state, is stored and used as the data for the training of the deep Qnetwork. The performance of our proposed approach is firstly demonstrated on a set of robotic control tasks from OpenAI Gym by comparing with the deep recurrent Qnetwork (DRQN) algorithm which uses current observation as the input and plans based on memory. Then, we compare our approach with the stateoftheart recurrent predictive state policy (RPSP) networks [14]. Experiment results show that with the benefits of the DQN framework and the dividing of the learning of the model and the training of the policy, our approach outperforms the stateoftheart baselines.
2. Related Work
A central problem in artificial intelligence is for agents to find optimal policies in stochastic, partially observable environments, which is an ubiquitous and challenging problem in science and engineering [16]. The commonly used technique for solving such partially observable problems is to model the dynamics of the environments by using the POMDP approach or the PSR approach firstly [3, 12] and then the problem can be solved using the obtained model. Although POMDPs provide general frameworks to solve partially observable problems, it relies heavily on a known and accurate model of the environment [17]. Therefore, in realworld applications, it is extremely difficult to build an accurate model [18]. Also, most of the POMDPbased approaches have difficulties to be extended to some largerscale realworld applications.
As mentioned previously, PSR is an effective method for modelling partially observable environment and many related works were proposed based on the idea of running a fully observable RL method on the PSR state. In the work of Boots et al. [19], the main idea of it is firstly building accurate enough transformed PSRs with indicative and characteristic features and then the pointbased value iteration technique [20] is used for finding the planning solution, where a state subset in the state space is firstly selected under some strict conditions that is both sufficiently small to reduce the computational difficulty and sufficiently large to obtain a good approximation function. In the work of Liu and Zheng [5, 21], the learned PSR model has been combined with MonteCarlo tree search both online and offline, which achieves the stateoftheart performances on some environments. However, the application of these proposed approaches is limited to domains with discrete state and action spaces.
For partially observable systems with continuous state space, most work relies on recurrent neural networks to summarize the past and then the neural network is trained in a modelfree reinforcement learning manner. In order to solve the customer relationship management (CRM) problem that is considered to be partially observable, Li et al. [22] proposed a hybrid recurrent reinforcement learning approach (SLRNN + RLDQN) which uses the RNN to calculate the hidden states of the CRM environment. While our method was tested on some control environments as shown in the experiments and takes into account both the past observations and actions for the representation of the underlying states, for SLRNN + RLDQN, both the proposed approach and the related experiments focus on the CRM problem. Also, SLRNN + RLDQN does not consider the effect of action value when calculating the state representation, which may incur the inaccurate representations of the underlying states. Moreover, while RPSRDQN tries to build the model of the underlying system, which makes the related approaches be easily extended to modelbased reinforcement learning approaches, SLRNN + RLDQN can only be combined with the modelfree reinforcement learning frameworks. In the work of Hausknecht and Stone [9], recurrence is added to a deep Qnetwork (DQN) by replacing the first fully connected layer with a recurrent LSTM by considering all historical information. Igl et al. [2] extended the RNNbased approach to explicitly support belief inference. However, while in our approach, with suitable features, the mapping between the predictive state and the prediction of the observations given the actions can be fully known and simple to be learned consistently, the main problem of these RNNbased approaches with latent states is in these recurrent models, nonconvex optimization is used, which usually leads to more difficult training than those using convex optimization [14].
Recently, some works have been proposed by using the PSR state for the replacement or quality improvement of the internal state of the RNN. In the work of Venkatraman et al. [15], recurrent neural networks are combined with predictive state decoders (PSDs), which add supervision to the network internal state representation to target predicting future observations. Hefny et al. [14] proposed recurrent predictive state policy (RPSP) networks, which consist of a recursive filter for the tracking of a belief about the state of the environment, and a reactive policy that directly maps beliefs to actions, to maximize the cumulative reward. While RPSP networks show some promising performances on some benchmark domains, the recursive filter and the reactive policy are trained simultaneously by defining a joint loss function in an online manner. However, how to balance the loss of the recursive filter and the loss of the reactive policy is difficult, and in many cases, as also shown in the experiments, the simultaneously training of two objective functions may lead to a worse final performance.
3. Background
This section is divided into three parts. In the first part, we briefly review predictive state representations (PSRs) [12]. Then, we introduce the recurrent PSRs that can be applied to continuous observation systems. Finally, we briefly describe the DQN algorithm.
3.1. Predictive State Representations
Predictive state representations (PSRs) offer a powerful framework for modelling partially observable and stochastic systems without prior knowledge by using completely observable events to represent states [23]. For discrete systems with a finite set of observations and actions , at time , the observable state representation of the system is a prediction vector composed of the probability of test occurrence conditioned on current history, where a test is a sequence of actionobservation pairs that starts from time , a history at time is a sequence of actionobservation pairs that starts from the beginning of time and ends at time , and the prediction of a length and test at history is defined as [24].
Given the set of tests , if the prediction vector satisfies that for any test , there exists a function such that , then is considered to constitute a PSR. The set is called the test core, and the prediction vector is called the PSR state. In this paper, we only consider linear PSRs, so the function can be represented as the weight vector . When the action is performed from the history and the observation is obtained, the next PSR state can be updated from as follows [12]:
In formula (1), is the transposing operation, the is a weight vector of the test , and the is a matrix with the th column corresponding to weight vector .
3.2. Recurrent Predictive State Representation
The PSR model obtained by using the substatespace method [25] or spectral learning algorithm [26] can only be applied to the modelling of the discrete observation system. More recently, Ahmed et al. [27] proposed the recurrent predictive state representation (RPSR) which treats predictive state models as the recurrent network. It is able to represent systems with continuous observations. Similar to PSR, the RPSR state is the conditional distribution of future observations, so the mapping between the RPSR state and the predictive observation obtained for the given action can be fully known or easy to learn by selecting of features. This characteristic makes the process of learning networks become the supervised learning, which makes the modelling be simple and efficient [28, 29].
The state update process of RPSR can be divided into two steps. As can be seen from Section 3.1, if is the test core, the is a sufficient state representation at time . Then, establishing an extended test core ensures that the is the sufficient statistic of the distribution for any . When the estimate of is given, the can be obtained in the case of getting . The is called the extended state . The steps of state update are as follows [14]:(i)State extension: the state transforms to the extended state through the linear map . is a parameter that needs to be learned:(i)Conditioning: given and , the next state can be calculated from the current extended state by the conditioning function , where the kernel Bayes rule with the Gaussian RBF kernel is used [30]:where the calculation detail is as follows: as the extended feature is a Kronecker product of the immediate feature matrix and the future feature matrix, the extended state can be further divided into two parts, which are derived from the skipped future observation and the present observation, respectively. Then, firstly, the feature vectors and are extracted for a given action and observation . Secondly, and the second part of the expanded state are multiplied to calculate the observation covariance after is executed, and the inverse observation covariance is multiplied by the first part of the expanded state to change “predicting observation” into “conditioning on observation”, which is transformed from the joint expectation of immediate and to the conditional expectation from immediate to . Finally, the conditional expectation is multiplied by and to obtain the next state .
The RPSR model can be seen as a recursive filter which is implemented by transforming formulas (2) and (3) to the recurrent network. The output of the recurrent network is a predictive observation , where the is the predictive observation function that needs to be learned. The and are represented in terms of observation quantities and can be estimated by supervised regression. The follows from linearity formula (2). So, in the process of network training, the twostage regression method [28] is used to initialize the state , extended state , and the linear map .
3.3. Deep QNetwork
DQN is a method combining deep learning and Qlearning, which has succeeded in handling environments with highdimensional perception input [31]. It is a multilayered neural network which outputs a predicted future reward for each possible action, where are the network parameters. In other words, DQN uses a neural network as an approximation of the action value.
In DQN, the last four frames of the observations are directly input to the CNN as the first layer of DQN to compute the current state information. Then, the state information is mapped to a vector of action values for the current state through the full connection layer [32]. DQN optimizes the action value function by updating the network weights to minimize a differentiable loss function [9]:
4. RPSRDQN
With the benefits of the RPSR approach and the great success of deep network, we propose a modelbased method, which combines the RPSR with the deep learning. Firstly, we use the recurrent network to build a PSR model of the partially observable dynamic systems. Then, the truly state , namely, the RPSR state, can provide sufficient information for selecting best action and be updated with the new action executed and the new observation received. Finally, the tuple of , where represents the return reward for taking action in the current state , is stored and used as the data for the training of the deep network (DQN).
As depicted in Figure 1, the architecture of our method consists of the RPSR model part and the valuebased policy part. In the RPSR model, the state transforms to the extended state through the extended part, i.e., a liner map. Then, extended state updates to the next state according to the action and observation . The total state update process can be represented as formula (5). For the policy part, the deep Qnetwork is used to select the action which can get better longterm reward according to the current state information calculated by the RPSR model:
The learning process is divided into two stages: building the model and training the policy network. In the first stage, an exploration strategy is used to collect training data to build the model of the environment. We use the dataprocessing method proposed by Ahmed et al. [27]. We use the 1000 random Fourier features (RFFs) [33] as approximate features of observations and actions. Then, we apply principal component analysis (PCA) [34] to features to project into 25 dimensions. Here, the number of features and dimensions depends on the complexity of the environment. We denote the feature function as . The linear map , states , and extend states in the RPSR model are initialized by using a twostage regression algorithm [28]. Use , , , and to denote sufficient features of future observations, future actions, extended future observations, and extended future actions at time , respectively. Because the and are represented in terms of observable quantities and follow from linearity of expectation, they are computed by using the kernel Bayes rule (stage1 regression). Whereafter, the state extension function is , so we can linearly regress the extended state from the state , using a least squares approach (stage2 regression), to compute . After initialization, the parameters of the RPSR model can be optimized by using backpropagation through time [35] to minimize prediction error (see formula (6)), where is the predictive observation of the RPSR model:
After the model is established for the dynamic environment, the current state information of the partially observable environment can be expressed by the model, and the policy part is trained on this basis. In the process of policy training, we build the evaluation network and target network, which are both composed of two fully connected layers. We use the experience replay [32] to train networks. When the agent interacts with the environment, we store transitions () in the data set . Then, sample random transitions to train the policy network by minimizing the value difference between the target network and evaluation network. These losses are backpropagated into the weights of both the encoder and the Qnetwork. The value of the target network is , where denotes the parameters of an earlier Qnetwork. The details are shown in Algorithm 1.

5. Experiments
We select the following three gym environments for evaluating the RPSRDQN performance (see Figure 2): the traditional control environment CartPolev1, the MuJoCo robot environment Swimmerv1, and Reacherv1. These environments provide qualitatively different challenges. Due to the setting of experimental conditions, we make some changes to the three environments.
(a)
(b)
(c)
CartPolev1: this task is controlled by applying left or right force to the cart to move the cart to the left or right. A reward of +1 is provided for every time step that the angle of the pole is less than 15 degrees. The episode is terminated when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center. The goal is to prevent the pole which is attached by an unactuated joint to a cart from falling over. There are two action values in this environment, that is, the direction of the force applied to the cart. To make the environment partially observable, we remove the observations that represent the velocity, changing the original four observations to two observations which are the position of the cart and the angle of the pole. So, it requires the ability to calculate speed based on positions.
Swimmerv1: the environment involves a 3link swimming robot in a viscous fluid, where the goal is to make it swim forward as fast as possible, by actuating the two joints. To make the environment partially observable, we remove the observations that represent the velocity, changing the original five observations to three observations which are the angles of three links. The action in the environment is a vector with three elements which represent the control of three links. Each element is a number in the arithmetic progression with an interval of 0.2 in the range [−1.4, 1.4].
Reacherv1: this environment involves a 2link robot arm which is connected to a central point. The goal of this task is to move the endpoint of the robot arm to the target location. Each step reward is the negative of the sum of the distance between the endpoint of the robot arm and the target point and the control cost. The action in this task is a vector with two elements which represent the force applied to the arm. Each element is a number in the arithmetic progression with an interval of 0.2 in the range [−1.4, 1.4]. To make the environment partially observable, we change the original six observable values to four, respectively, which represent the angles of two links and the relative distance between the link and the target position. In this task, it requires to find a balance between exploration and exploitation.
In this section, we compare methods using two metrics: the best reward is the best value for return rewards on all iterations, where is the total return reward for the iteration, and the mean reward is the mean return reward for the last 25 iterations.
Comparison to modelfree methods: we compared the performance of RPSRDQN with the modelfree methods including the DQN1frame and DRQN. The result is shown in Figure 3. Compared with the DQN1frame which selects the best action only by the current observation, RPSRDQN can be shown that the predictive state model can achieve the great effect of tracking and updating the state of the environment. Because RPSRDQN has a model learning process, it learns faster than DRQN and can converge to a more stable state with fewer iterations. And even with sufficient iterations of the update, RPSPDQN can still get better rewards than DRQN in the final stable situation. The first three rows of Tables 1 and 2 show the numerical result which includes the performance of three methods in all tasks.
(a)
(b)
(c)
Comparison to policybased methods: Figure 4 shows the results of comparing the RPSRDQN with the policybased method RPSR [14]. Note that as a policybased method, RPSP can be applied to both continuous and discrete environments. In the action discrete environment, our method can get better mean rewards in the final stable situation than the policygradient method RPSP. In the Reacherv1 task, the reasons for the ineffective RPSP may be as follows: the initial random weight tends to output highly positive or negative value outputs, which means that most initial actions make the link have the maximum or minimum acceleration. It causes a problem, which is that this link manipulator cannot stop rotary movement as long as putting the most force in the joint. In this case, once the robot has started training, this meaningless state will cause it to deviate from its current strategy. The RPSR may make not enough exploration to select the action to stop the link manipulator from rotating. The last two rows of Tables 1 and 2 show the numerical result which includes the performance of two methods in all tasks.
(a)
(b)
(c)
6. Conclusion
In this paper, we propose RPSRDQN, a method that can learn the model and make a decision in partially observable environment. Combining the predictive state model with a valuebased approach results in good performance in a partially observable environment. We compare RPSRDQN with DRQN in different partially observable environments and show that our method can get better performance in terms of learning speed and expected rewards. Also, we compare our approach with the stateoftheart recurrent predictive state policy (RPSP) networks, where the PSR model and a reactive policy are simultaneously trained in an endtoend manner. Experiment results show that with the benefits of the DQN framework and the dividing of the learning of the model and the training of the reactive policy, our approach outperforms the stateoftheart baselines.
Data Availability
The experimental data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (nos. 61772438 and 61375077). This work was also supported by the China Scholarship Council (201906315049).