Abstract

While deep reinforcement learning (DRL) has achieved great success in some large domains, most of the related algorithms assume that the state of the underlying system is fully observable. However, many real-world problems are actually partially observable. For systems with continuous observation, most of the related algorithms, e.g., the deep Q-network (DQN) and deep recurrent Q-network (DRQN), use history observations to represent states; however, they often make computation-expensive and ignore the information of actions. Predictive state representations (PSRs) can offer a powerful framework for modelling partially observable dynamical systems with discrete or continuous state space, which represents the latent state using completely observable actions and observations. In this paper, we present a PSR model-based DQN approach which combines the strengths of the PSR model and DQN planning. We use a recurrent network to establish the recurrent PSR model, which can fully learn dynamics of the partially continuous observable environment. Then, the model is used for the state representation and update of DQN, which makes DQN no longer rely on a fixed number of history observations or recurrent neural network (RNN) to represent states in the case of partially observable environments. The strong performance of the proposed approach is demonstrated on a set of robotic control tasks from OpenAI Gym by comparing with the technique with the memory-based DRQN and the state-of-the-art recurrent predictive state policy (RPSP) networks. Source code is available at https://github.com/RPSR-DQN/paper-code.git.

1. Introduction

For agents operating in stochastic domains, how to determine the (near) optimal policy is a central and challenge issue. While (deep) reinforcement learning has provided a powerful framework for decision-making and control and has achieved great success in recent years in some large-scale applications, e.g., AlphaGo [1], most of the related approaches rely on the strong assumption that the agent can completely know the environment surrounded it, i.e., the environment is fully observable. However, for many real-world applications, the problem is actually partially observable Markov decision process (POMDP) where the state of the environment may be partially observable or even unobservable [2, 3].

Much effort has been devoted to planning in partially observable environments. Some of the work aims for learning the complete model of the underlying system. Huang et al. [46] propose the planning methods based on the PSR model. Song et al. [7] and Somani et al. [8] propose the planning method based on the POMDP model. However, these methods are only suitable for systems with discrete observations. In this paper, we mainly focus on systems with continuous observations, and there are two main approaches for dealing with the partially observable problem in such domain. One relies on recurrent neural networks to summarize the past, and then the neural network is trained in a model-free reinforcement learning manner [2, 9, 10]. However, it will be a heavy burden for the training of networks when everything relies on it. The other approach for dealing with the partially observable problem is directly using the past histories, i.e., the past observations (frames), for the state representation, and the main problem of this approach is that the number of observations (frames) used for the state representation can only be determined empirically. Also, too many observations for the state representation may be computation-expensive, but few observations may not be sufficient statics of the past. And neither method considers the effect of action information on state representation.

Predictive state representations (PSRs) provide a general framework for modelling partially observable systems, and unlike the latent-state based approaches, such as POMDPs, the core idea of PSRs is to work only with the observable quantities, which leads to easier learning of a more expressive model [1113]. PSRs can also combine with the recurrent network for the modelling and planning in partially observable dynamic systems with continuous state space [14, 15].

In this paper, with the benefits of the PSR approach and the great success of deep Q-network in some real-world applications, we propose the RPSR-DQN approach; firstly, a recurrent PSR model of the underlying partially observable systems is built, then the truly state, namely, the PSR state or the belief state, can be updated and provide the sufficient information for DQN planning, and finally, the tuple of , where is the information of the current state and is the information of the next state obtained by taking action under the current state, is stored and used as the data for the training of the deep Q-network. The performance of our proposed approach is firstly demonstrated on a set of robotic control tasks from OpenAI Gym by comparing with the deep recurrent Q-network (DRQN) algorithm which uses current observation as the input and plans based on memory. Then, we compare our approach with the state-of-the-art recurrent predictive state policy (RPSP) networks [14]. Experiment results show that with the benefits of the DQN framework and the dividing of the learning of the model and the training of the policy, our approach outperforms the state-of-the-art baselines.

A central problem in artificial intelligence is for agents to find optimal policies in stochastic, partially observable environments, which is an ubiquitous and challenging problem in science and engineering [16]. The commonly used technique for solving such partially observable problems is to model the dynamics of the environments by using the POMDP approach or the PSR approach firstly [3, 12] and then the problem can be solved using the obtained model. Although POMDPs provide general frameworks to solve partially observable problems, it relies heavily on a known and accurate model of the environment [17]. Therefore, in real-world applications, it is extremely difficult to build an accurate model [18]. Also, most of the POMDP-based approaches have difficulties to be extended to some larger-scale real-world applications.

As mentioned previously, PSR is an effective method for modelling partially observable environment and many related works were proposed based on the idea of running a fully observable RL method on the PSR state. In the work of Boots et al. [19], the main idea of it is firstly building accurate enough transformed PSRs with indicative and characteristic features and then the point-based value iteration technique [20] is used for finding the planning solution, where a state subset in the state space is firstly selected under some strict conditions that is both sufficiently small to reduce the computational difficulty and sufficiently large to obtain a good approximation function. In the work of Liu and Zheng [5, 21], the learned PSR model has been combined with Monte-Carlo tree search both online and offline, which achieves the state-of-the-art performances on some environments. However, the application of these proposed approaches is limited to domains with discrete state and action spaces.

For partially observable systems with continuous state space, most work relies on recurrent neural networks to summarize the past and then the neural network is trained in a model-free reinforcement learning manner. In order to solve the customer relationship management (CRM) problem that is considered to be partially observable, Li et al. [22] proposed a hybrid recurrent reinforcement learning approach (SL-RNN + RL-DQN) which uses the RNN to calculate the hidden states of the CRM environment. While our method was tested on some control environments as shown in the experiments and takes into account both the past observations and actions for the representation of the underlying states, for SL-RNN + RL-DQN, both the proposed approach and the related experiments focus on the CRM problem. Also, SL-RNN + RL-DQN does not consider the effect of action value when calculating the state representation, which may incur the inaccurate representations of the underlying states. Moreover, while RPSR-DQN tries to build the model of the underlying system, which makes the related approaches be easily extended to model-based reinforcement learning approaches, SL-RNN + RL-DQN can only be combined with the model-free reinforcement learning frameworks. In the work of Hausknecht and Stone [9], recurrence is added to a deep Q-network (DQN) by replacing the first fully connected layer with a recurrent LSTM by considering all historical information. Igl et al. [2] extended the RNN-based approach to explicitly support belief inference. However, while in our approach, with suitable features, the mapping between the predictive state and the prediction of the observations given the actions can be fully known and simple to be learned consistently, the main problem of these RNN-based approaches with latent states is in these recurrent models, nonconvex optimization is used, which usually leads to more difficult training than those using convex optimization [14].

Recently, some works have been proposed by using the PSR state for the replacement or quality improvement of the internal state of the RNN. In the work of Venkatraman et al. [15], recurrent neural networks are combined with predictive state decoders (PSDs), which add supervision to the network internal state representation to target predicting future observations. Hefny et al. [14] proposed recurrent predictive state policy (RPSP) networks, which consist of a recursive filter for the tracking of a belief about the state of the environment, and a reactive policy that directly maps beliefs to actions, to maximize the cumulative reward. While RPSP networks show some promising performances on some benchmark domains, the recursive filter and the reactive policy are trained simultaneously by defining a joint loss function in an online manner. However, how to balance the loss of the recursive filter and the loss of the reactive policy is difficult, and in many cases, as also shown in the experiments, the simultaneously training of two objective functions may lead to a worse final performance.

3. Background

This section is divided into three parts. In the first part, we briefly review predictive state representations (PSRs) [12]. Then, we introduce the recurrent PSRs that can be applied to continuous observation systems. Finally, we briefly describe the DQN algorithm.

3.1. Predictive State Representations

Predictive state representations (PSRs) offer a powerful framework for modelling partially observable and stochastic systems without prior knowledge by using completely observable events to represent states [23]. For discrete systems with a finite set of observations and actions , at time , the observable state representation of the system is a prediction vector composed of the probability of test occurrence conditioned on current history, where a test is a sequence of action-observation pairs that starts from time , a history at time is a sequence of action-observation pairs that starts from the beginning of time and ends at time , and the prediction of a length and test at history is defined as [24].

Given the set of tests , if the prediction vector satisfies that for any test , there exists a function such that , then is considered to constitute a PSR. The set is called the test core, and the prediction vector is called the PSR state. In this paper, we only consider linear PSRs, so the function can be represented as the weight vector . When the action is performed from the history and the observation is obtained, the next PSR state can be updated from as follows [12]:

In formula (1), is the transposing operation, the is a weight vector of the test , and the is a matrix with the th column corresponding to weight vector .

3.2. Recurrent Predictive State Representation

The PSR model obtained by using the substate-space method [25] or spectral learning algorithm [26] can only be applied to the modelling of the discrete observation system. More recently, Ahmed et al. [27] proposed the recurrent predictive state representation (RPSR) which treats predictive state models as the recurrent network. It is able to represent systems with continuous observations. Similar to PSR, the RPSR state is the conditional distribution of future observations, so the mapping between the RPSR state and the predictive observation obtained for the given action can be fully known or easy to learn by selecting of features. This characteristic makes the process of learning networks become the supervised learning, which makes the modelling be simple and efficient [28, 29].

The state update process of RPSR can be divided into two steps. As can be seen from Section 3.1, if is the test core, the is a sufficient state representation at time . Then, establishing an extended test core ensures that the is the sufficient statistic of the distribution for any . When the estimate of is given, the can be obtained in the case of getting . The is called the extended state . The steps of state update are as follows [14]:(i)State extension: the state transforms to the extended state through the linear map . is a parameter that needs to be learned:(i)Conditioning: given and , the next state can be calculated from the current extended state by the conditioning function , where the kernel Bayes rule with the Gaussian RBF kernel is used [30]:where the calculation detail is as follows: as the extended feature is a Kronecker product of the immediate feature matrix and the future feature matrix, the extended state can be further divided into two parts, which are derived from the skipped future observation and the present observation, respectively. Then, firstly, the feature vectors and are extracted for a given action and observation . Secondly, and the second part of the expanded state are multiplied to calculate the observation covariance after is executed, and the inverse observation covariance is multiplied by the first part of the expanded state to change “predicting observation” into “conditioning on observation”, which is transformed from the joint expectation of immediate and to the conditional expectation from immediate to . Finally, the conditional expectation is multiplied by and to obtain the next state .

The RPSR model can be seen as a recursive filter which is implemented by transforming formulas (2) and (3) to the recurrent network. The output of the recurrent network is a predictive observation , where the is the predictive observation function that needs to be learned. The and are represented in terms of observation quantities and can be estimated by supervised regression. The follows from linearity formula (2). So, in the process of network training, the two-stage regression method [28] is used to initialize the state , extended state , and the linear map .

3.3. Deep Q-Network

DQN is a method combining deep learning and Q-learning, which has succeeded in handling environments with high-dimensional perception input [31]. It is a multilayered neural network which outputs a predicted future reward for each possible action, where are the network parameters. In other words, DQN uses a neural network as an approximation of the action value.

In DQN, the last four frames of the observations are directly input to the CNN as the first layer of DQN to compute the current state information. Then, the state information is mapped to a vector of action values for the current state through the full connection layer [32]. DQN optimizes the action value function by updating the network weights to minimize a differentiable loss function [9]:

4. RPSR-DQN

With the benefits of the RPSR approach and the great success of deep -network, we propose a model-based method, which combines the RPSR with the deep -learning. Firstly, we use the recurrent network to build a PSR model of the partially observable dynamic systems. Then, the truly state , namely, the RPSR state, can provide sufficient information for selecting best action and be updated with the new action executed and the new observation received. Finally, the tuple of , where represents the return reward for taking action in the current state , is stored and used as the data for the training of the deep -network (DQN).

As depicted in Figure 1, the architecture of our method consists of the RPSR model part and the value-based policy part. In the RPSR model, the state transforms to the extended state through the extended part, i.e., a liner map. Then, extended state updates to the next state according to the action and observation . The total state update process can be represented as formula (5). For the policy part, the deep Q-network is used to select the action which can get better long-term reward according to the current state information calculated by the RPSR model:

The learning process is divided into two stages: building the model and training the policy network. In the first stage, an exploration strategy is used to collect training data to build the model of the environment. We use the data-processing method proposed by Ahmed et al. [27]. We use the 1000 random Fourier features (RFFs) [33] as approximate features of observations and actions. Then, we apply principal component analysis (PCA) [34] to features to project into 25 dimensions. Here, the number of features and dimensions depends on the complexity of the environment. We denote the feature function as . The linear map , states , and extend states in the RPSR model are initialized by using a two-stage regression algorithm [28]. Use , , , and to denote sufficient features of future observations, future actions, extended future observations, and extended future actions at time , respectively. Because the and are represented in terms of observable quantities and follow from linearity of expectation, they are computed by using the kernel Bayes rule (stage-1 regression). Whereafter, the state extension function is , so we can linearly regress the extended state from the state , using a least squares approach (stage-2 regression), to compute . After initialization, the parameters of the RPSR model can be optimized by using backpropagation through time [35] to minimize prediction error (see formula (6)), where is the predictive observation of the RPSR model:

After the model is established for the dynamic environment, the current state information of the partially observable environment can be expressed by the model, and the policy part is trained on this basis. In the process of policy training, we build the evaluation network and target network, which are both composed of two fully connected layers. We use the experience replay [32] to train networks. When the agent interacts with the environment, we store transitions () in the data set . Then, sample random transitions to train the policy network by minimizing the value difference between the target network and evaluation network. These losses are backpropagated into the weights of both the encoder and the Q-network. The value of the target network is , where denotes the parameters of an earlier Q-network. The details are shown in Algorithm 1.

Input: learning rate
Output: network parameters
(1)Randomly select actions to generate N trajectories
(2)Compute the sufficient features of every trajectory , , , , (n denotes the trajectory)
(3)Establish the recurrent predictive state representation:
(4)Initialize PSR: two-stage regression
(5)Use kernel Bayes rule to estimate
(6)Apply least squares method to formula (2) to compute
(7)Set to the average of
(8)Local optimization:
(9)for i = 1, N do
(10)Initialize state
(11)for t = 1, m do
(12)Predictive observation
(13)Perform a gradient descent step on
(14)Update state
(15)end for
(16)end for
(17)Optimization policy network:
(18)Initialize reactive policy randomly
(19)for episode = 1, M do
(20)Initialize state
(21)for t = 1, T do
(22)With probability , select , otherwise random
(23)Execute action in emulator and observe reward and observation
(24)Set filter
(25)Store transition () in
(26)Sample random mini batch of transitions () from
(27)Set .
(28)Perform a gradient descent step on
(29)end for
(30)end for

5. Experiments

We select the following three gym environments for evaluating the RPSR-DQN performance (see Figure 2): the traditional control environment CartPole-v1, the MuJoCo robot environment Swimmer-v1, and Reacher-v1. These environments provide qualitatively different challenges. Due to the setting of experimental conditions, we make some changes to the three environments.

CartPole-v1: this task is controlled by applying left or right force to the cart to move the cart to the left or right. A reward of +1 is provided for every time step that the angle of the pole is less than 15 degrees. The episode is terminated when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center. The goal is to prevent the pole which is attached by an unactuated joint to a cart from falling over. There are two action values in this environment, that is, the direction of the force applied to the cart. To make the environment partially observable, we remove the observations that represent the velocity, changing the original four observations to two observations which are the position of the cart and the angle of the pole. So, it requires the ability to calculate speed based on positions.

Swimmer-v1: the environment involves a 3-link swimming robot in a viscous fluid, where the goal is to make it swim forward as fast as possible, by actuating the two joints. To make the environment partially observable, we remove the observations that represent the velocity, changing the original five observations to three observations which are the angles of three links. The action in the environment is a vector with three elements which represent the control of three links. Each element is a number in the arithmetic progression with an interval of 0.2 in the range [−1.4, 1.4].

Reacher-v1: this environment involves a 2-link robot arm which is connected to a central point. The goal of this task is to move the endpoint of the robot arm to the target location. Each step reward is the negative of the sum of the distance between the endpoint of the robot arm and the target point and the control cost. The action in this task is a vector with two elements which represent the force applied to the arm. Each element is a number in the arithmetic progression with an interval of 0.2 in the range [−1.4, 1.4]. To make the environment partially observable, we change the original six observable values to four, respectively, which represent the angles of two links and the relative distance between the link and the target position. In this task, it requires to find a balance between exploration and exploitation.

In this section, we compare methods using two metrics: the best reward is the best value for return rewards on all iterations, where is the total return reward for the iteration, and the mean reward is the mean return reward for the last 25 iterations.

Comparison to model-free methods: we compared the performance of RPSR-DQN with the model-free methods including the DQN-1frame and DRQN. The result is shown in Figure 3. Compared with the DQN-1frame which selects the best action only by the current observation, RPSR-DQN can be shown that the predictive state model can achieve the great effect of tracking and updating the state of the environment. Because RPSR-DQN has a model learning process, it learns faster than DRQN and can converge to a more stable state with fewer iterations. And even with sufficient iterations of the update, RPSP-DQN can still get better rewards than DRQN in the final stable situation. The first three rows of Tables 1 and 2 show the numerical result which includes the performance of three methods in all tasks.

Comparison to policy-based methods: Figure 4 shows the results of comparing the RPSR-DQN with the policy-based method RPSR [14]. Note that as a policy-based method, RPSP can be applied to both continuous and discrete environments. In the action discrete environment, our method can get better mean rewards in the final stable situation than the policy-gradient method RPSP. In the Reacher-v1 task, the reasons for the ineffective RPSP may be as follows: the initial random weight tends to output highly positive or negative value outputs, which means that most initial actions make the link have the maximum or minimum acceleration. It causes a problem, which is that this link manipulator cannot stop rotary movement as long as putting the most force in the joint. In this case, once the robot has started training, this meaningless state will cause it to deviate from its current strategy. The RPSR may make not enough exploration to select the action to stop the link manipulator from rotating. The last two rows of Tables 1 and 2 show the numerical result which includes the performance of two methods in all tasks.

6. Conclusion

In this paper, we propose RPSR-DQN, a method that can learn the model and make a decision in partially observable environment. Combining the predictive state model with a value-based approach results in good performance in a partially observable environment. We compare RPSR-DQN with DRQN in different partially observable environments and show that our method can get better performance in terms of learning speed and expected rewards. Also, we compare our approach with the state-of-the-art recurrent predictive state policy (RPSP) networks, where the PSR model and a reactive policy are simultaneously trained in an end-to-end manner. Experiment results show that with the benefits of the DQN framework and the dividing of the learning of the model and the training of the reactive policy, our approach outperforms the state-of-the-art baselines.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (nos. 61772438 and 61375077). This work was also supported by the China Scholarship Council (201906315049).