Abstract

Count-based exploration algorithms have shown to be effective in dealing with various deep reinforcement learning tasks. However, existing count-based exploration algorithms cannot work well in high-dimensional state space due to the complexity of state representation. In this paper, we propose a novel count-based exploration method, which can explore high-dimensional continuous state space and combine with any reinforcement learning algorithms. Specifically, by introducing the embedding network to encode the state space and to merge the states with similar key characteristics, we can compress the high-dimensional state space. By utilizing the state binary code to count the occurrence number of states, we generate additional rewards which can encourage the agent to explore the environment. Extensive experimental results on several commonly used environments show that our proposed method outperforms other strong baselines significantly.

1. Introduction

Reinforcement learning (RL), which was aimed at learning an optimal control strategy to maximize the reward from the environment, has achieved great success in various complex tasks, such as video games [1] and robot controlling [2]. One of the core problems of RL methods is how the agent should take trade-off decisions between the exploration of new actions and the selection of the best action based on existing knowledge. Although there are simple and theoretically guaranteed heuristic exploration methods for tabular RL algorithms, such as the ε-greedy strategy [3] and entropy regularization [4], these methods cannot be easily extended to a high-dimensional space environment due to the large state space and the complexity of state representation. Therefore, developing a common and simple exploration method is an important research direction.

In this paper, we propose a novel count-based exploration method via embedded state space for reinforcement learning. The core idea is to compress the high-dimensional state space by extracting the embedded representation of the state space and merging similar states. We use an action prediction model to train the state embedding network for obtaining a better state feature space. Take our human agent as an example. Assume someone is playing a racing game where the screen changes as the car goes ahead, as shown in Figure 1. The track changing in the yellow box will affect the next move, while the sky changing in the red box will not. We should focus on the state characteristics that affect the choice of actions. In summary, the contributions of this paper are as follows: (1) we propose a new count-based exploration method which is suitable for high-dimensional state space. And our method can be directly applied to most different RL algorithms. (2) We design a general mechanism for optimizing feature representations by introducing the embedding network and the action prediction model. (3) We conduct experiments on several games from the Atari [1] and achieve near state-of-the-art results, especially with fewer training epochs.

The rest of this article is organized as follows: in Section 2, we review the related work. In Section 3, we introduce some definitions and propose our novel exploration method. In Section 4, we show the experimental results on different kinds of environments. Finally in Section 5, we conclude the paper and point out the future work.

Many different approaches have been proposed in recent years to address the balance between exploration and exploitation. These methods can be divided into two types: count-based exploration methods and curiosity-based exploration methods. The former methods count the occurrence number of states and convert this number into a reward to encourage exploring states with higher rewards. One of the best-known approaches is the UCB bandit algorithm [5], which selects an action at time to maximize the upper confidence bound , where is the estimated reward and is the occurrence number of action being previously chosen. Model-Based Interval Estimation-Exploration Bonus (MBIE-EB) of [6] has similar structure. It counts state-action pairs with a table and adds a bonus reward of the form to encourage exploring less-visited pairs. It is proved by [7] that square root inverse correlation is optimal. Tang [8] uses hash functions to encode the state space, subsumes similar states into a single counter, and explores based on the counter’s value. Martin et al. [9] generalize a probability density model by the characteristic representation of the state space and use this model to pseudocount. Curiosity-based exploration methods offer additional rewards based on the principle of optimism in the face of uncertainty [10]. These methods encourage the agent to choose actions that increase uncertainty about the value estimate. Classical examples utilize upper confidence bound [11] and Thompson sampling [12] for the stochastic sampling of actions. Recent algorithms combine these ideas with finer uncertainty, making them suitable for large state spaces that require deep exploration [1315]. Dynamic autoencoder (Dynamic-AE) [16] is proposed to compress the state space. The distance between predicted state and real state is computed in this compressed state space. And intrinsic rewards are defined by this distance. Pathak et al. [17] use a self-supervised inverse dynamics model to predict the next state based on the current state-action pair. Then, they use the error between prediction and reality to generate curiosity. Savinov et al. [18] propose a new curiosity definition that marks the novelty of states by reachability. The episodic curiosity module (ECO) uses an episodic memory pool to store part of visited states. To compute the state novelty, ECO compares each state with states in memory. If the current state is far from the states contained in memory, the agent is rewarded an intrinsic reward.

Several recent studies have discussed the generalization of reinforcement learning and designed procedurally generated environments to test the generalization of reinforcement learning [1921]. More recent papers show that traditional exploration methods fall short in procedurally generated environments and address this issue with new exploration methods [22, 23]. [24] proposes a new perspective of exploration bonus in episode-level data and achieves significantly SOTA performance on procedurally generated benchmarks. In the field of multiagent reinforcement learning (MARL), the study on exploration is roughly at the preliminary stage. Most of these exploration methods extend the ideas in the single-agent setting and propose different mechanisms by integrating the characteristics of deep MARL. Compared to the RL exploration, the dimensions of the state-action space increase rapidly as the number of agents increases in MARL. Zhou et al. [25] propose to treat the Q-function as a high-order high-dimensional tensor. Then, they approximate the Q-function with factorized pairwise interactions. [26] adopts a similar factorization approach in the search space to solve this problem.

3. The Proposed Method

In this work, we propose a novel count-based exploration method via embedded state space for deep reinforcement learning. Our method can be divided into two major parts: (1) Embedding Network and Action Prediction Model and (2) Count-Based Extra Bonus Generator. For different RL tasks, we first collect state information by agents randomly interacting with the environment. These data will be used to train the embedding network that can represent state features better. Then, we count the occurrence number of states based on the embedded feature representation. Finally, we generate extra bonus and add it to RL algorithms for training the agent.

3.1. Notations

Reinforcement learning (RL) [3] addresses the task of learning from interactions to achieve goals. It is usually formulated as an MDP , where is the set of states of the environment, is the set of available actions, is the state transition distribution, is the reward function, and is the discount factor. The agent is formally a policy that maps a state to an action. At timestep , the agent is in a state , receives a reward , and takes an action . We seek a policy that maximizes the expected sum of future rewards. The action-value of a state-action pair under a policy is the expected discounted sum of future rewards and follows thereafter: .

3.2. Embedding Network and Action Prediction Model

When MDP states have complex structures, as in the case of image observations, directly measuring their similarities in pixel space does not provide effective metric. Previous works in computer vision [2729] introduce manually designed feature representations of images. These representations are suitable for semantic tasks including detection and classification. More recent methods learn complex features directly from data by training convolutional neural networks [3032]. Considering these researches, it may be difficult to combine similar states using raw pixels or the general state space.

As mentioned in the previous F1 racing game example, some features in the state space are invalid for the agent, so we need to extract the valid features in the state space. To achieve this, we first divide the features of the state into three parts: (1) something that can be controlled by the agent (e.g., the car in the game), (2) things that the agent cannot control but that can affect the agent (e.g., the track), and (3) things out of the agent’s control and not affecting the agent (e.g., the sky). We need to find a good feature space that includes the features of (1) and (2) and excludes the features of (3).

Our goal is to come up with a general mechanism for learning feature representations rather than manually designing for each environment. We propose that such a feature space can be learned by training a deep neural network with two submodules: the first submodule (embedding network) encodes the raw state into a feature vector . The second submodule (prediction network) takes the feature encoding , of two consequent states as inputs and predicts the action taken by the agent to move from to . The whole model is illustrated in Figure 2. Training this neural network is equivalent to learning function defined as where is predicted estimate of the action and the neural network parameters are trained to optimize where is the loss function that measures the discrepancy between the predicted and actual actions. In order to facilitate the subsequent processing of the state encoding, the embedding network takes the state as the input and contains one special dense layer comprised of sigmoid functions. By rounding the sigmoid activation of this layer to their closest binary number , any state can be binarized. A problem with this architecture is that if near 0.5 at a particular dimension, the error will increase while rounding. A solution is forcing the binary code layer to take on binary values. Therefore, we add another loss term into , and the complete loss function is defined as

The tuple is obtained while the agent interacts with the environment using its current policy .

3.3. Count-Based Extra Bonus Generator

We get embedded state space through the embedding network. An exploration bonus is added to the reward function, defined as where is the bonus coefficient. Initially, the counts are set to 0 for the whole range of . For every state encountered at time step , is increased by 1. The agent is trained with rewards , while per performance is evaluated as the sum of rewards without bonuses.

We represent the policy by a deep neural network with parameters . Given the agent in state , it executes the action sampled from the policy. is optimized to maximize the expected sum of rewards:

Because the code dimension often needs to be large for correctly predicting the action, we apply a downsampling procedure to the resulting binary code , which can be done through random projection to a lower-dimensional space via SimHash: where is a matrix drawn from a standard Gaussian distribution . The value for controls the granularity: higher values lead to fewer collisions and clearly distinguish states. Algorithm 1 summarizes our method.

Initialize with entries drawn i.i.d. from the standard Gaussian distribution ;
Initialize a hash table with values ;
Initialize policy network with parameter and embedding network with parameter ;
for each iteration j do {
 Collect a set of state-action samples with policy ;
 Add the state samples to replay buffer;
ifthen {
  Update the embedding network with loss function in Eq.(3). using samples drawn from the replay buffer;
}
 Compute , the D-dim rounded hash code for learned by the embedding network;
 Update the hash table counts as ;
 Update the policy using rewards with any RL algorithm;
}

4. Experiments

4.1. Experimental Setup

We test our method in multiple environments from Rllab to Arcade Learning Environment (ALE) [19]. The experiments in Rllab verify that our method can be used in continuous control tasks. The experiments in ALE have recently become a standard high-dimensional benchmark for RL. The reward signal is computed from the game score. The raw state is a frame of video (a array of 7-bit pixels). There are 18 available actions. The ALE is a particularly interesting testbed in our context, because the difficulty of exploration varies greatly among games. We choose six of these games where exploration is hard. Trust Region Policy Optimization is chosen as the RL algorithm for all experiments, because this algorithm can handle both discrete and continuous action spaces and is relatively insensitive to hyperparameter changes. The hyperparameter is and . All image curves are smoothed.

4.2. Rllab Environment

The Rllab benchmark consists of various control tasks to test deep RL algorithms. We selected several variants of the basic and locomotion tasks that use sparse rewards, as shown in Figure 3. These tasks are all highly difficult to solve with naive exploration strategies, such as adding Gaussian noise to the actions.

Figure 4 shows the results of TRPO (baseline), TRPO-SimHash [8], VIME [34], and our method on the classic tasks CartPoleSwingup, the locomotion task HalfCheetah, and the hierarchical task SwimmerGather. Using count-based exploration with embedded state space is capable of reaching the goal in all environments (which corresponds to a nonzero return), while baseline TRPO with Gaussian n control noise fails completely. Although our method picks up the sparse reward on HalfCheetah and receives better reward than other count-based exploration algorithm, it does not perform as well as VIME. In contrast, the performance of ours is comparable with VIME on CartPoleSwingup, while it outperforms VIME on SwimmerGather.

4.3. Arcade Learning Environment

The Arcade Learning Environment (ALE) [33], which consists of Atari 2600 video games, is an important benchmark for deep RL due to its high-dimensional state space and wide variety of games. In order to demonstrate the effectiveness of the proposed exploration strategy, six games are selected featuring long horizons while requiring significant exploration: Freeway, Frostbite, Gravitar, Montezuma’s Revenge, Solaris, and Venture. The agent is trained for 500 iterations in all experiments, with each iteration consisting of 0.1 M steps (the TRPO batch size corresponds to 0.4 M frames).

We compare our results to double DQN [35], dueling network [36], A3C+ [37], double DQN with pseudocounts [37], Gorila [38], DQN Pop-Art [39], and TRPO-SimHash [8] the “null op” metric. We summarize all results in Table 1.

As observed in Table 1, our approach has performed better on most of the games compared to similar count-based exploration methods. It means the embedded state space after feature extraction by using action predict network can select the part which has more important influence on the decision-making of the agent. Our method achieves near state-of-the-art performance on Freeway, Frostbite, and Solaris. A reason why TRPO+BASS is better than ours on Montezuma is that BASS is a hand-designed feature transformation for images in Atari 2600 games. The hash codes generated by our method distinguish between visually different states but fail to emphasize that the agent needs to explore different rooms. But the hand-designed feature transformation can clearly describe this key information.

4.4. Downsampling

We apply a downsampling process to the generated binary code in equation (6), which can be done using SimHash’s random projection to a lower-dimensional space. However, there may be states that are distinct but fall into the same group after downsampling. Moreover, different downsampling dimensions have different effects on the final experimental results. We conduct more experiments in different game environments. Figure 5 and Table 2 show an overview of the results.

As observed in the results, in the case of low-latitude downsampling, the agent reaches the plateau fastest, but the reward is relatively lower. Conversely, in the case of high-latitude downsampling, the speed of convergence is slower, but the reward obtained is higher. This phenomenon is well explained because downsampling at low latitudes greatly compresses the state space, and the novelty to the agent disappears quickly. And because many different states are assigned to the same state, the agent’s exploration ability is greatly affected, and more effective rewards cannot be obtained. And we found that there is a certain correlation between the complexity of the environment and the optimal dimension of downsampling. For complex games such as Montezuma, a higher-dimensional downsampling representation can better distinguish different states and achieve excellent returns. For games with relatively low complexity, such as Freeway, high-dimensional downsampling will cause overfitting. Overdistinguishing similar states will lead to lower overall rewards for the agent.

5. Conclusion

In this paper, we propose a novel count-based exploration method for deep reinforcement learning. By introducing the embedding network and the action prediction model, the proposed method tends to extract the state features that have positive impacts on the agent and encourage the agent to explore states with higher rewards. Extensive experiments demonstrate that our proposed method can achieve promising performance on different tasks. In future work, we plan to optimize the representation of state features and attempt to apply the state feature extraction framework to other kinds of reinforcement learning exploration methods.

Data Availability

Previously reported environment data were used to support this study and are available at 10.1613/jair.3912. These prior studies (and datasets) are cited at relevant places within the text as references. The experiment data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interests regarding the publication of this paper.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (No. 61972065).