Research Article | Open Access
Deep Ensemble Reinforcement Learning with Multiple Deep Deterministic Policy Gradient Algorithm
Deep deterministic policy gradient algorithm operating over continuous space of actions has attracted great attention for reinforcement learning. However, the exploration strategy through dynamic programming within the Bayesian belief state space is rather inefficient even for simple systems. Another problem is the sequential and iterative training data with autonomous vehicles subject to the law of causality, which is against the i.i.d. (independent identically distributed) data assumption of the training samples. This usually results in failure of the standard bootstrap when learning an optimal policy. In this paper, we propose a framework of m-out-of-n bootstrapped and aggregated multiple deep deterministic policy gradient to accelerate the training process and increase the performance. Experiment results on the 2D robot arm game show that the reward gained by the aggregated policy is 10%–50% better than those gained by subpolicies. Experiment results on the open racing car simulator (TORCS) demonstrate that the new algorithm can learn successful control policies with less training time by 56.7%. Analysis on convergence is also given from the perspective of probability and statistics. These results verify that the proposed method outperforms the existing algorithms in both efficiency and performance.
Reinforcement learning is an active branch of machine learning, where an agent tries to maximize the accumulated reward when interacting with a complex and uncertain environment [1, 2]. Reinforcement learning combining deep neural network (DNN) technique [3, 4] had gained some success in solving challenging problems. One of the most noticeable results was achieved through the deep Q-network (DQN), which exploited deep neural networks to achieve maximum accumulated reward . DQN has performed well over 50 different Atari games and inspired many deep reinforcement learning (DRL) algorithms [6–8].
However, DQN only deals with the tasks with small, discrete state and action spaces while many reinforcement learning tasks have large, continuous, real-valued state and action spaces. Although such tasks could be solved with DQN by discretizing the continuous spaces, the instability of the control system may be increased. For overcoming this difficulty, deterministic policy gradient (DPG) algorithm  with the DNN technique was proposed, producing deep deterministic policy gradient (DDPG) algorithm . Unfortunately, DDPG suffers from inefficient exploration and unstable training . Many existed works attempted to solve the problems. Gu et al. proposed the Q-prop method, a Taylor expansion of the off-policy critic as a control variant to stabilize DDPG . Q-Prop combines the on-policy Monte Carlo and the off-policy DPG; it achieves the advantages of sample efficiency and stability. Mnih et al. proposed A3C to stabilize the training process of DDPG, by training the parallel agents with asynchronously accumulated updates . Interactive learning with the environment in multiple threads is performed at the same time, and each thread summarizes the learning results and stores them in a common place. In this way, A3C avoids the problem of too strong correlation of empirical playback and achieves an asynchronous concurrent learning model. This method consumes considerable computation resources. When the implementation complexity is not a strong limit, we can use any of these policy gradient-related methods to generate subpolicies to further improve our method, where the centralized experience replay buffer stores and shares experiences from all subpolicies, enabling more knowledge gained from the environment.
Additionally, researchers attempted to overcome the disadvantage of unstable training of DDPG and speed up the convergence of DDPG with bootstrap technique recently . Osband et al. developed bootstrapped DQN as the critic of DDPG . Yang et al. employed a multiactor architecture for multitask purpose . DBDDPG  and MADDPG  both used multiactor-critic structure to improve the exploration efficiency and increase the training stability. Shi et al. introduced deep soft policy gradient (DSPG) , an off-policy and stable model-free deep RL algorithm by combining policy and value-based methods under maximum entropy RL framework. The authors discover that the standard bootstrap is likely to fail when learning an optimal policy, since in most reinforcement learning tasks, the sequential and iterative training data subject to the law of causality, which is against the i.i.d. (independent identically distributed) assumption of the training samples. Hence, a novel bootstrap technique is needed for achieving the optimal policy.
In consideration of the above shortcomings of the previous work, this paper introduces a simple DRL algorithm with m-out-of-n bootstrap technique [19, 20] and aggregated multiple DDPG structures. The control policy will be gained by averaging all learned subpolicies. Additionally, the proposed algorithm uses the centralized experience replay buffer to improve the exploration efficiency. Since m-out-of-n bootstrap with random initialization produces reasonable uncertainty estimates at low computational cost, this helps in the convergence of the training. The proposed bootstrapped and aggregated DDPG can substantially reduce the learning time.
The remainder of this paper is organized as follows. Section 2 presents a brief background. Section 3 introduces the proposed method in detail and analyses the convergence of the algorithm. The experimental results of the proposed method are presented in Section 4. The paper is concluded in Section 5.
2.1. Reinforcement Learning
In a classical scenario of reinforcement learning, an agent aims at learning an optimal policy according to the reward function by interacting with the environment in discrete time steps, where policy is a map from the state space to action space . At each time step, the environment state is observed by the agent, and then it executes the action by following the policy . Afterwards, a reward is received immediately. The following equation defines the accumulated reward that the agent receives from step :where is a discount factor. As the agent maximizes the expected accumulated reward from the initial state, the optimal policy will be gained finally.
2.2. Deterministic Policy Gradient Algorithm
Policy gradient (PG) algorithms optimize a policy directly by maximizing the performance function with the policy gradient. Deterministic policy gradient algorithm which is originated from deterministic policy gradient theorem  is one of the policy gradient methods. It learns deterministic policies with the actor-critic framework, while the critic estimates the action-value function and the actor represents the deterministic policy function. The updates for the action-value function and the policy function are given below:where denotes the discounted state distribution . Since full optimization is expensive, stochastic gradient optimization is usually used instead. The following equation shows the deterministic policy gradient  which is used to update the parameter of the deterministic policy:
2.3. DDPG Algorithm
DDPG applies the DNN technique onto the deterministic policy gradient algorithm , which approximates deterministic policy function and action-value function with neural network, as shown in Figure 1.
There are two sets of weights in DDPG. are weights for main networks while are weights for target networks which are introduced in  for generating the Q-learning targets. We use and to denote the main networks while and represent the target networks. As equations (4) and (5) shows, weights of the main networks are updated according to the stochastic gradient, while weights of target networks are updated with “soft” updating rule , as shown in equation (6):
DDPG utilizes the experience replay technique  to break training samples’ temporal correlation, keeping them subject to the i.i.d. (independent identically distributed) assumption. Furthermore, the “soft” updating rule is used to increase the stability of the training process. DDPG updates the main actor network with the policy gradient, while the main critic network is updated with the idea of combining the supervised learning and Q-learning which is used in DQN. After training, the main actor network converges to the optimal policy.
3.1. Structure of Multi-DDPG
Compared with DQN, DDPG is more appropriate for reinforcement learning tasks with continuous action spaces. However, it takes long time for DDPG to converge to the optimal policy. We propose multi-DDPG structure and bootstrap technique to train several subpolicies in parallel so as to cut down the training time.
We randomly initialize main critic networks and main actor networks with weights and , and then, we initialize target networks and with weights and initialize the centralized experience replay buffer .
The structure of multi-DDPG with the centralized experience replay buffer is shown in Figure 2. We name the proposed method which utilizes the multi-DDPG structure and bootstrap technique as bootstrapped aggregated multi-DDPG (BAMDDPG). Figure 3 demonstrates that BAMDDPG averages all action outputs of trained subpolicies to achieve the final aggregated policy. For clarity, the terms agent, main actor network, and subpolicy refer to the same thing and are interchangeable in this paper. Algorithm 1 presents the entire algorithm of BAMDDPG.
In Algorithm 1, “#Env” means the number of environment modules while “#selected DDPG” represents the number of selected DDPG components. During the training process, each DDPG component which exploits the actor-critic framework is responsible for training the corresponding subpolicy. Figure 2 demonstrates the training process of a DDPG component, containing the interaction procedure and the update procedure.
In the interaction procedure, the main actor network which represents an agent interacts with the environment. It receives the current environment state and outputs an action . The environment gives the immediate reward and the next state after executing the action. Then the transition tuple is stored into the central experience replay buffer. To efficiently explore the environment, noise sampled from an Ornstein–Uhlenbeck process is added to the action.
In the update procedure, a random minibatch of transitions used for updating weights is sampled from the central experience replay buffer. The main critic network is updated by minimizing the loss function which is based on the Q-learning method , while the target networks are updated by having them slowly track the main networks. Weights of the main actor network are updated with the policy gradient along which the overall performance increases. By following such an update rule, each subpolicy of BAMDDPG gradually improves. The centralized experience replay buffer stores experiences from all subpolicies.
Figure 3 illustrates the aggregation details of subpolicies. We denote subpolicies approximated by main actor networks with and the outputs of these subpolicies with . In addition, the aggregated policy’s output is denoted as .
In practice, we train multiple subpolicies by setting a maximum number of episodes. Since episodes in BAMDDPG terminate earlier than that of the original DDPG algorithm with less steps, the training time of subpolicies is less than the optimal policy. It can be predicted that the performance of less-trained subpolicies will be worse than the optimal policy to some degree, but we can aggregate the trained subpolicies to increase the performance and get the optimal policy. Furthermore, we use the average method as aggregation strategy in consideration of the equal status and real-valued outputs of all subpolicies. Specifically, the outputs of all subpolicies are averaged to produce the final output.
As Figure 2 demonstrates, the interaction procedure of a DDPG requires an environment component to interact with the agent. Therefore, multi-DDPG structure requires multiple environment modules. However, for some reinforcement learning tasks, the environment module does not support being copied for multiple DDPGs. In such case, the environment component interacts with only one subpolicy in each time step. BAMDDPG supports reinforcement learning tasks with both one environment module and multiple environment modules by choosing one subpolicy or multiple subpolicies to interact with the environments in each time step. All subpolicies are then updated simultaneously with sampled minibatch from the centralized experience replay buffer. In the end, all trained subpolicies are averaged to form the final policy. Algorithm 1 presents the BAMDDPG algorithm.
Additionally, from the perspective of intuition, the centralized experience replay technique exploited in BAMDDPG enables each agent to use experiences encountered by other agents. This makes the training of subpolicies of BAMDDPG more efficient since each agent owns a wider vision and more environment information.
3.2. Analysis on Convergence with Bootstrap and Aggregation
For ease of description, we suppose BAMDDPG trains subpolicies simultaneously and denote these subpolicies with . The aggregated policy is denoted as which can be formulated aswhere represents the aggregation of subpolicies. Let the optimal policy denoted as . Then the following formula holds where means the average bias of subpolicies and the optimal policy while represents bias of the aggregated policy and the optimal policy.
Equation (8) demonstrates that the aggregated policy has better performance than subpolicies and approximates the optimal policy more closely than any subpolicy. Under this conclusion, the aggregated policy approximates the optimal policy quickly as subpolicies are trained to a certain extent .
Further, we analyze the convergence from the perspective of probability and statistics . Assume all policies are from the policy space . The subpolicies are sampled according to a distribution function in . Let denote the empirical cumulative distribution function aswhere is the number of the sampled subpolicies. is an indicator function which outputs 1 when the condition is satisfied, otherwise 0. The operator “” in “” means is a better policy than in , which indicates the agent acting by following the policy is able to gain more reward than those only adopting . According to the rule of Dvoretzky–Kiefer–Wolfowitz inequality , we getwhere represents probability, means upper bound, and is an arbitrary small positive integer.
Equation (10) shows that converges uniformly to the true distribution function exponentially fast in probability. Suppose we are interested in the mean , then the unbiasedness of the empirical measure extends to the unbiasedness of linear functions of the empirical measure. Actually, empirical cumulative distribution can be seen as a discrete distribution with equal probability for each component, which means we can get a policy from the empirical cumulative distribution by averaging multiple policies. Therefore, the aggregating policy subjects to empirical cumulative distribution and it subjects to true distribution. Since is a better policy than in , converges to the optimal policy of .
3.3. The m-out-of-n Bootstrap
Bootstrap  is a significant resample technique in statistics, which generally works by random sampling with the replacement process. In this paper, we try to train multiple DDPG components with bootstrap. It is analyzed that such requirement can be simply attained by initializing the network weights of different DDPG components with different methods . Therefore, we adopt this technique as a prior and multiple DDPG components are trained in parallel on different subdataset from experience replay buffer.
However, standard bootstrap fails as the training data subject to a long-tail distribution, rather than the usual normal distribution, as the i.i.d. assumption implies. A valid technique is m-out-of-n bootstrap method , where the number of bootstrap samples is much smaller than that of the training dataset. More specifically, we draw subsamples without replacement and use these subsamples as new training datasets. Multiple DDPG components are then trained with this newly produced training dataset.
4. Results and Discussion
4.1. 2D Robot Arm
In order to illustrate the effectiveness of aggregation, we use BAMDDPG to learn a control policy for a 2D robot arm task.
4.1.1. Benchmark and Reward Function
As Figure 4 demonstrates, a 2D robot arm contains a two-link arm with one joint which is attempting to get to the blue block. The first link rotates around the root point while the second link rotates around the joint point. The action of an agent consists of two real-valued numbers denoting angular increment. We construct the reward according to the distance between the finger point of the arm (endpoint) and the blue block. The farther away the finger point being from the blue block, the lesser the reward is. Additionally, the reward adds one when the distance is less than the threshold . When the finger point stops within the blue block for a while (more than 50 iterations), the reward adds ten. The following equation presents the reward:where is an indicator function which outputs 1 when the condition is satisfied, otherwise 0.
4.1.2. Performance of Aggregated Policy
During the training process of BAMDDPG, each agent interacts with its corresponding environment, producing multiple learning curves. Figure 5 demonstrates learning curves of 3 subpolicies with shared experience on 2D robot arm benchmark. The curve depicts the moving average of episode reward while the shaded area depicts the moving average ± partial standard deviation. As Figure 5 shows, the training process of BAMDDPG’s subpolicies is better than that of DDPG. The centralized experience replay buffer stores and shares experiences from all subpolicies, enabling more knowledge gained from the environment. Therefore, BAMDDPG’s subpolicies can gain more reward during the training process. After about 1000 episodes, the subpolicies of BAMDDPG and the policy of the original DDPG both converge.
The key of BAMDDPG is the aggregation of subpolicies. In this section, we show the comparison of performance between the aggregated policy and subpolicies so as to illustrate the effectiveness of aggregation. Suppose the action given by the ith subpolicy is , then the immediate reward of the ith subpolicy is given bywhere denotes the distance between the finger point of the arm and the blue block after executing action while it is an implicit function. The immediate reward of the aggregated policy can be expressed in the same way:where represents the action taken by the aggregated policy.
Table 1 shows the performance comparison of subpolicies and aggregated policy of BAMDDPG. The result demonstrates that reward gained by the aggregated policy is 10%∼50% better than those gained by subpolicies.
4.2.1. Benchmark and Reward Function
The Open Racing Car Simulator (TORCS) is a car driving simulation software with high portability, which takes the client-server architecture [24, 25]. It realistically simulates real cars by modeling the physical dynamic models of the car engines, brakes, gearboxes, clutches, etc. It is a commonly used DRL benchmark and is appropriate for test of self-driving techniques. Using TORCS, a developer is able to easily access a simulated car’s sensor information. Therefore, the controller of the simulated car is able to get the current environment state and follow a policy to send controlling instructions, including control of steering, brake, and throttle. Figure 6 presents TORCS’s client-server architecture. The controller connects to the race server through the user datagram protocol (UDP). At each time step, the information of the current driving environment state is perceived by the simulated car and is sent to the controller. The server then waits for an instruction from the controller for 10 ms. The simulated car executes the corresponding actions according to the current instruction, or last instruction if no new instruction is sent.
Designing a suitable reward function is a key for using TORCS as the platform to test BAMDDPG, which helps to learn a good policy to control the simulated car. We describe the details of designing the reward function in this section. As the driving environment state of TORCS can be perceived by various sensors of the simulated car, we can create the reward function using these sensor data which is shown in Table 2.
Equation (14) presents our constructed reward function, which restricts the behavior of the simulated car in TORCS. Each time the simulated car interacts with the driving environment of TORCS, we expect to gain as large reward as possible through the following equation:where the term represents the car is expected to run as fast as possible so as to maximize the reward. The terms and mean is expected as zero so that the car can run along the track all the time. The term represents the car is on the track axis. represents an indicator function whose value is 1 or 0 depending on whether the condition is met or not. The following equation reformulates the first term of equation (14):
Equation (15) takes into account the speed constraints of the car whether the car encounters a turn or not. The car slows down when a turn is encountered and drives as fast as possible along a straight route. Here, is set to be the threshold of encountering a turn. The car is at a turn when and the corresponding reward is a quadratic function with respect to the speed of the car. Note that and are hyper parameters needing to fine-tune. Figure 7 illustrates the graph of the quadratic function when . The quadratic function reaches the maximum value when , which means the expected speed of the car at a turn is 90.5 km/h and the car will decelerate automatically when it encounters a turn.
Equation (16) reformulates the last term in equation (14). It restricts the distance between the track edge ahead and the car. This term means that the turn should be observed by the car in advance and the steering angles should be adjusted according to the turn:
4.2.2. Learning Curve and Training Time
We successfully achieve the optimal self-driving policy with BAMDDPG by aggregating multiple subpolicies in TORCS. During one episode of the training process, one subpolicy is selected. The corresponding agent perceives the driving environment state through various sensors and executes the action by following the selected subpolicy. Table 3 presents the detailed description of the action commands, including steering, brake, and throttle.
After the interaction, all subpolicies were updated using the minibatch from the centralized experience replay buffer. We have argued that less training time is demanded by BAMDDPG than DDPG. Figure 8(a) illustrates the comparison of learning curve between BAMDDPG and DDPG while Figure 8(b) demonstrates the comparison of training time.
In our experiments on TORCS, the simulated car was trained 6000 episodes with the Aalborg track using BAMDDPG and DDPG, respectively. Figure 8(a) illustrates the learning curve comparison of DDPG and BAMDDPG. The curve depicts the moving average of episode reward while the shaded area depicts the mean ± the standard deviation. Figure 8(a) demonstrates that BAMDDPG and DDPG both converge and oscillate around a specific mean episode reward after being trained 6000 episodes. Figure 8(b) demonstrates that BAMDDPG takes less time to train since the aggregated policy quickly approximates the optimal policy as subpolicies are trained to a certain extent. It takes 22.84 hours for BAMDDPG to be trained 6000 episodes, but 52.77 hours for DDPG, which demonstrates BAMDDPG can cut down the training time by 56.7%. Figure 8(b) also shows that training time spent by BAMDDPG and DDPG is not so different in the first 1500 episodes. The reason is that the attention is mostly paid on environment exploring by the simulated car at first and these initial episodes finish quickly. Exploring time spent by BAMDDPG and DDPG is nearly the same. From the perspective of network training, the first 1500 episodes can be considered as the initialization of the corresponding networks.
4.2.3. Effectiveness of Aggregation
The ability of the BAMDDPG algorithm to reduce training time is based on policy aggregation. Section 3.3 illustrated the conclusion that the performance of the aggregated policy is better than that of subpolicies through theoretical analysis. In addition, Section 4.1 has shown the effectiveness of aggregation on 2D robot arm benchmark. In this section, we are to further illustrate the effectiveness of aggregation on TORCS.
In order to avoid the influence of too many subpolicies on the conciseness and contrast of expression, only three subpolicies are trained by the BAMDDPG algorithm in this experiment. The trained subpolicies and the aggregated policy control the same simulated car on the same track, Aalborg track, within one lap. Then, we observe the total reward and whether the car can finish one lap on the track or not. Table 4 illustrates the simulated car controlled by the aggregated policy finishes the Aalborg track and gained much larger total reward than subpolicies, but the cars controlled by subpolicies all left the track and are not able to complete the track, which indicates that aggregation technique does increase the performance of subpolicies.
Figure 9 further illustrates the difference in total reward between subpolicies and the aggregated policy. As shown by the real line, the total reward of the aggregated policy is in a steady upward trend as the number of steps increases. However, the total reward of subpolicy 2 and subpolicy 3 increases steadily in the initial stage and then stops rising because the car pulled out of the track at some point. The performance of subpolicy 1 is the worst, and its total reward is always the lowest and ultimately remains unchanged due to the car leaving the track.
4.2.4. Effect from Number of Subpolicies
The final policy gained by BAMDDPG is based on the aggregation of subpolicies, but the algorithm does not give specific number of subpolicies. In theory, when there is large enough number of subpolicies, the aggregated policy successfully approximates the optimal policy. However, aggregating a large number of subpolicies is inefficient in consideration of computing and storage resource consumption in practice.
Under the consideration of balancing efficiency and performance, this section explores the appropriate number range of subpolicies through experiment. We choose the numbers of subpolicies within 30 and get the appropriate number of subpolicies by comparing the performance of the aggregated policies with different number of subpolicies. These aggregated policies are tested on the Aalborg track, and we then compare their training time, total reward within 5000 steps. Furthermore, we compare the generalization performance of the aggregated policies by testing them on the CG1 track and CG2 track. Experimental results are demonstrated in Figure 10 and Table 5.
Figure 10 illustrates the comparison of total reward gained by aggregated policies with different number of subpolicies on the Aalborg track. Since the episode of TORCS may not terminate, we set the maximum number of steps to be 5000 in one episode. The aggregated policies with 3–10 subpolicies are able to reach the maximum number of steps while others terminate early in one episode. Therefore, they gained much larger reward than those aggregated policies with over 10 subpolicies.
Table 5 demonstrates, for policies aggregating from different numbers of subpolicies within 30, no large difference appears in training time, but the performances of different policies vary from each other. The policies aggregating from 3 to 10 subpolicies can achieve the maximum interaction number of 5000 steps on the Aalborg track, complete the training Aalborg track with larger total reward than the aggregated policies with over 10 subpolicies, and pass the test track CG1 and CG2 safely.
Generally speaking, when the number of subpolicies is 3–10, the corresponding aggregated policies perform well and have better generalization performance than the aggregated policies with over 10 subpolicies, which means 3–10 is the appropriate number of subpolicies for BAMDDPG in practical application.
However, the aggregated policies with over 10 subpolicies cannot reach the maximum steps on the Aalborg track and are not able to finish the CG1 track. The reason why the aggregated policies with over 10 subpolicies performed worse mainly lies in the limit of the centralized experience replay buffer. During the training time, we fixed the size of the centralized experience replay buffer to 100, 000 transition tuples , by considering the feasibility and efficiency of implementation. However, this buffer could not manage to share all experiences with more than 10 subpolicies. As a result, the aggregated policies with over 10 subpolicies gained less knowledge and performed not well. The experiment with a larger buffer size will display a better performance with aggregation of 10 subpolicies. But the memory setting has a nonmonotonic effect on the reinforcement learning (RL) performance . The influence of the memory setting in RL arises from the trade-off between the correct weight update direction and the wrong direction.
4.2.5. Generalization Performance
Generalization performance is a research hotspot in the field of machine learning, and it is also a key evaluation index for the performance of algorithms. An overtrained model often performs well in the training set, while it performs poorly in the test set. In our experiments, self-driving policies are learned successfully on the Aalborg track using BAMDDPG. The car controlled by these policies has good performance on the training track. However, the generalization performance of the learned policies is not known. Hence, we test the performance of the aggregated policy learned with BAMDDPG on both the training and test tracks, including Aalborg, CG1, and CG2, whose maps are illustrated in Figure 11.
The total reward of the aggregated policy shown in Table 6 differs in different tracks since the length of different tracks is not the same. On a long track, the car travels for a longer time, and the total reward will be larger. In our experiment, route CG2 is the longest and CG1 is the shortest.
Table 6 illustrates that the car controlled by the aggregated policy passes the test tracks successfully. It demonstrates that the learned aggregated policy from BAMDDPG achieves a good generalization performance.
This paper proposed a deep reinforcement learning algorithm, by aggregating multiple deep deterministic policy gradient algorithm and an m-out-of-n bootstrap sampling method. This method is effective to the sequential and iterative training data, where the data exhibit long-tailed distribution, rather than the norm distribution implicated by the i.i.d. data assumption. The method can learn the optimal policies with much less training time for tasks with continuous space of actions and states.
Experiment results on the 2D robot arm game show that the reward gained by the aggregated policy is 10%∼50% better than those gained by the nonaggregated subpolicies. Experiment results on TORCS demonstrate the proposed method can learn successful control policies with less training time by 56.7%, compared to the normal sampling method and nonaggregated subpolicies.
The program and data used to support the findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 12 months after publication of this article, will be considered by the corresponding author. The simulation platform (The Open Racing Car Simulator, TORCS) used to support the findings of this study is open-sourced and is available at http://torcs.sourceforge.net/.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was supported by NSFC (61672512 and 51707191), CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, and Shenzhen Engineering Laboratory for Autonomous Driving Technology.
- R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT press, Cambridge, MA, USA, 1998.
- K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: a brief survey,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 26–38, 2017.
- A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems, pp. 1097–1105, Lake Tahoe, NV, USA, March 2012.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
- V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
- H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-Learning,” in Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ USA, March 2016.
- T. Schaul, J. Quan, I. Antonoglou et al., “Prioritized experience replay,” 2015, https://arxiv.org/abs/1511.05952.
- Z. Wang, T. Schaul, M. Hessel et al., “Dueling network architectures for deep reinforcement learning,” in Proceedings of the 33rd International Conference on Machine Learning, vol. 4, pp. 2939–2947, New York, NY, USA, 2016.
- D. Silver, G. Lever, N. Heess et al., “Deterministic policy gradient algorithms,” in Proceedings of the 31st International Conference on Machine Learning, pp. 387–395, Bejing, China, June 2014.
- T. P. Lillicrap, J. J. Hunt, A. Pritzel et al., “Continuous control with deep reinforcement learning,” Computer Science, vol. 8, no. 6, p. A187, 2015.
- Z. Zheng, C. Yuan, Z. Lin et al., “Self-adaptive double bootstrapped DDPG,” in Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 3198–3204, Stockholm, Sweden, July 2018.
- S. Gu, T. Lillicrap, Z. Ghahramani et al., “Q-prop: sample-efficient policy gradient with an off-policy critic,” in Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, May 2017.
- V. Mnih, A. P. Badia, M. Mirza et al., “Asynchronous methods for deep reinforcement learning,” in Proceedings of the International Conference on Machine Learning, pp. 1928–1937, San Juan, PR, USA, May 2016.
- B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap, CRC Press, Boca Raton, FL, USA, 1994.
- I. Osband, C. Blundell, A. Pritzel et al., “Deep exploration via bootstrapped DQN,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 4026–4034, Barcelona, Spain, December 2016.
- Z. Yang, K. E. Merrick, H. A. Abbass et al., “Multi-task deep reinforcement learning for continuous action control,” in Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 3301–3307, Melbourne, Australia, August 2017.
- R. Lowe, Y. Wu, A. Tamar et al., “Multi-agent actor-critic for mixed cooperative-competitive environments,” Advances in Neural Information Processing Systems, 2017.
- W. Shi, S. Song, and C. Wu, “Soft policy gradient method for maximum entropy deep reinforcement learning,” 2019, https://arxiv.org/abs/1909.03198.
- R. Davidson and E. Flachaire, “Asymptotic and bootstrap inference for inequality and poverty measures,” Journal of Econometrics, vol. 141, no. 1, pp. 141–166, 2007.
- H. Ishwaran, L. F. James, and M. Zarepour, “An alternative to the out of bootstrap,” Journal of Statistical Planning and Inference, vol. 139, no. 3, pp. 788–801, 2009.
- J. Wu and H. Li, “Aggregated multi-deep deterministic policy gradient for self-driving policy,” in Proceedings of 5th International Conference on Internet of Vehicles, vol. 11253, pp. 179–192, Paris, France, November 2018.
- A. M. F. Mood, Introduction to the Theory of Statistics, McGraw-Hill Education, New York, NY, USA, 1950.
- A. Dvoretzky, J. Kiefer, and J. Wolfowitz, “Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator,” The Annals of Mathematical Statistics, vol. 27, no. 3, pp. 642–669, 1956.
- B. Wymann, E. Espié, C. Guionneau et al., Torcs: The Open Racing Car Simulator, 2015, SourceForge.
- D. Loiacono, L. Cardamone, and P. L. Lanzi, Simulated Car Racing Championship: Competition Software Manual, Politecnico di Milano, Dipartimento di Elettronica, Informazione e Bioingegneria, Milan, Italy, 2013.
- R. Liu and J. Zou, “The effects of memory replay in reinforcement learning,” in Proceedings of the ICML 2017 Workshop on Principled Approaches to Deep Learning, Sydney, Australia, 2017, https://arxiv.org/pdf/1710.06574.pdf.
Copyright © 2020 Junta Wu and Huiyun Li. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.