Abstract
The traditional Deep Deterministic Policy Gradient (DDPG) algorithm has been widely used in continuous action spaces, but it still suffers from the problems of easily falling into local optima and large error fluctuations. Aiming at these deficiencies, this paper proposes a dualactordualcritic DDPG algorithm (DNDDPG). First, on the basis of the original actorcritic network architecture of the algorithm, a critic network is added to assist the training, and the smallest Q value of the two critic networks is taken as the estimated value of the action in each update. Reduce the probability of local optimal phenomenon; then, introduce the idea of dualactor network to alleviate the underestimation of value generated by dualevaluator network, and select the action with the greatest value in the twoactor networks to update to stabilize the training of the algorithm process. Finally, the improved method is validated on four continuous action tasks provided by MuJoCo, and the results show that the improved method can reduce the fluctuation range of error and improve the cumulative return compared with the classical algorithm.
1. Introduction
As artificial intelligence continues to thrive, reinforcement learning (RL), which is a learning process that combines exploration and action, has been well developed in discrete action spaces focusing on decision control. By letting the agents learn continuously in a way of trial and error, RL pursues the overall maximum return while seeking the optimal action policy [1, 2]. However, when highdimensional inputs or continuous action tasks are involved, traditional RL that relies on maximizing expected returns by performing trial and error may not work well. To tackle these kinds of problems, the concept of deep reinforcement learning (DRL) has been presented. In 2013, DeepMind proposed a method of using deep neural networks to play Atari games. It is the first successful and versatile DRL algorithm, although its scope of application is still limited to lowdimensional discrete action spaces. The topics dealing with continuous action tasks have become a new set of research interests [3, 4].
The basic idea of deep reinforcement learning is to fit the value function and policy function in reinforcement learning through a neural network. Typical algorithms include Deep QNetwork (DQN) [5] based on discrete action tasks and Deep Deterministic Policy Gradients (DDPG) [6] based on continuous action tasks. DDPG and DQN have very high similarities in algorithms. The main difference is that DDPG introduces a policy network to output continuous action values. DDPG can be understood as an extension algorithm of DQN in continuous action. DDPG algorithm has been studied extensively with a series of outcomes obtained. Mnih et al. [7] proposed the concept of twolayer BP neural network and hence improved the DDPG algorithm. The search efficiency of BP network was improved by using ArmijoGoldsteinbased criterion and BFGS method [8]. Nikishin et al. [9] reduced the influence of noise on the gradient by averaging methods under the premise of random weights. Parallel actor networks and prioritized experience replay are used and tested in the continuous action space of bipedal robots [10]. The experimental results show that the revised algorithm can effectively improve the training speed. In addition, the storage structure of experience in DDPG is optimized, which improves the convergence speed of the DDPG algorithm through binary tree [11–13].
To sum up, the above methods propose improvements to address the shortcomings of DDPG, and all have achieved good results. Although the performance of the improved algorithms has been significantly improved, the flaws of local optimal solutions and large error fluctuations need to be further addressed.
The main content of this paper is as follows: Firstly, the basic principle of DDPG is introduced, and then, combined with the description of the network structure and its associated parameters, the existing shortcomings are also analyzed. Secondly, an improved algorithm is proposed to tackle the shortcomings of DDPG. The improvement method is mainly divided into two aspects. First, in order to reduce the probability of local optimal solutions, a critic network is added to assist training, and the smallest Q value in the two critic networks is taken as the estimated value of the action. Second, the dualcritic network will select the suboptimal Q value to update each round, and the suboptimal Q value also corresponds to the suboptimal action, which leads to the continuous underestimation of the action value of the agent. In response to this problem, this work introduces a dualactor network based on the dualcritic network architecture; that is, the most valuable action in the two action networks is selected for training under the minimum Q value, so as to improve the robustness of the network structure. Finally, the effectiveness of the improved method is verified in eight simulated, experimental environments.
The rest of this paper is organized as follows: The basics of DDPG are introduced in Section 2. In Section 3, the idea of improving the algorithm is elaborated. Section 4 includes experimental results and analysis. Section 5 summarizes the work and refers to the future works.
2. Deep Deterministic Policy Gradients
The problem that reinforcement learning needs to solve is how to let the agent learn what actions to take in an environment, so as to obtain the maximum sum of reward values [12–14]. The reward value is generally associated with the task goal defined by the agent. The DDPG algorithm is used to solve the reinforcement learning problem in continuous action space [6, 15–17]. The main process is as follows: Firstly, the experience data generated by the interaction between the agent and the environment is stored in the experience recall mechanism. Secondly, the sampled data is learned and updated through the actorcritic architecture, and finally the optimal policy is obtained. The structure of the DDPG algorithm is shown in Figure 1 [15].
Based on the deterministic policy gradient, the DDPG algorithm uses a neural network to simulate the policy function and the Q function and combines the deep learning method to complete the task training [16]. The DDPG algorithm continues with the organizational structure of the DQN algorithm and uses actorcritic as the basic architecture of the algorithm [17]. The combination of the concepts of the online network and the target network in the DQN algorithm with the actorcritic method makes both actor and critic modules in the DDPG have access to the structure of the online network and the target network [6, 18, 19].
During the training process, the agent in the current state S decides the action A that needs to be performed through the current actor network and then calculates the Q value of the current action and the expected return value according to the current critic network. Then, the actor target network selects the optimal action among the actions that can be performed according to the previous learning experience, and the value of the future action is calculated by the critic target network. The parameters of the target network are periodically updated by the online network parameters of the corresponding module.
DDPG adopts a “soft” method to update the target network parameters; that is, the magnitude of each update of the network parameters is very small, which improves the stability of the training process [20–22]. The update coefficient is denoted as τ, then the “soft” update method can be expressed as
DDPG makes the decision of using action by the deterministic policy . It approximates the stateaction function via a value network, with the definition of the target function as the accumulated reward with a discounted factor [23, 24] as shown in the following equation:
In the online network of the critic, the update of the network parameters is based on the minimal value of the mean square error of the loss function [10], which can be expressed as
For the actor online network, the network parameters are updated according to the loss gradients of the policy [10] as shown in the following equation:.
3. The DDPG Based on DualActors and DualCritics
3.1. Error Analysis
It is an inevitable problem for QLearning to tend to overestimate errors [25–28]. In QLearning, the update of the estimated value of an action by the learning algorithm is conducted by the εgreedy policy , hence the actual maximal value of an action is usually smaller than the estimated maximal value of this action as shown in the following equation:
Equation (5) has already been proved for its establishment [29, 30]. Even the zero mean error of the initial state will lead to an overestimation of the action value due to the update of the value function, and the adverse effect of this error will be gradually enlarged by the calculation of the Bellman equation.
In the structure of actorcritic, the update of the actor policy depends on the critic value function [31–33]. Given the online network parameter φ, denotes the updated parameter of the actor network calculated by the estimated maximal value function , denotes the parameter obtained by using the actual value function , where is unknown in the training process which represents the value function in an ideal state, then and can be expressed in the following equation:
In Equation (6), , which normalizes gradients by using and . Otherwise, highly estimated errors would have been a certain case in a strict constraint if gradient normalization had not been used [34, 35].
Since the gradient is updated in the direction of the local maximum, there is a very small number k1, so that when the learning rate of the neural network is less than k1. The parameter based on and the parameter based on converge to the local optimal value of the corresponding Q function, at this time, the estimation of is restricted to be below as shown in the following equation:
On the contrary, there is an extremely small number k2, so that when the learning rate of the neural network is less than k2, the parameter and the parameter also converge to the local optimal value of the corresponding Q function, and the estimation of is limited below .
If the training effect of the critic network is satisfying, the estimation of the policy value will be at least similar to the actual value of as shown in the following equation:
At this time, if the learning rate of the network is smaller than the smaller one of k1 and k2, we know by combining Equations (8) and (9), the action value will be overestimated as shown in the following equation:
The existence of errors will lead to inaccurate estimation of the action value, making the suboptimal policy be taken as the optimal policy output of the online network, thereby affecting the performance of the algorithm.
3.2. DualActors and DualCritics Network Structure
Due to the existence of the overestimated error, the estimation of the value function can be used as an approximate upper limit of the estimated value of the future state. If there is a certain error in every Q value update, the accumulation of errors will result in a suboptimal policy. Aiming at this kind of problem, an additional critic network is used in this work. The smallest Q value of the two networks is taken as the estimated value of the action in each update, so as to reduce the adverse effect of the overestimated error.
The process of obtaining the smallest Q value via the dualcritic network is shown in the following equation:
Although the dualcritic network can reduce the overestimated error of the algorithm and reduce the probability of generating a local optimal strategy, in the actual training process, it is rare for the learning rate of the neural network to be less than the minimum value of k1 and k2. Combined with Section 3.1 analysis, that is, the probability of overestimation is very low. The dualcritic network will select the suboptimal Q value to update in each round. The suboptimal Q value also corresponds to the suboptimal action, which leads to the continuous underestimation of the action value of the agent, and in turn affects the rate of convergence of the critic network [36–38].
Aiming at the problem of underestimation of the dualcritic network, in this work a dualactor network is presented for training on the basis of the dualcritic network architecture. The network selects the action with the highest value among the two actions under the minimum Q value, which is used to reduce the influence of the Q value underestimation and improve the robustness of the network structure.
The network structure of the dualactors and dualcritics is shown in Figure 2.
For a twoactor network, the training of this network is subject to the same issues upon the use of the same sample data and processing methods. In order to ameliorate this kind of problems, the update of the parameters of the twoactor network is based on different policy gradients, which helps to reduce the coupling between the twoactors and further improves the convergence rate of the algorithm [39, 40].
If the policies of the twoactors are defined as and , and the parameters of the dualcritic network are and , we will have two actions = μ (s ) and = μ (s ), then we can select the action with the maximal value based on this dualactor network by using the following equation:
3.3. Modeling the Algorithm
Combining the ideas proposed in Section 3.2, this paper proposes a dualactor and dualcritics based DDPG algorithm (DNDDPG). The process of the DNDDPG algorithm is shown in Algorithm 1.

4. Experiments
4.1. Software and Hardware Setup
The software environment used in this work is Anaconda3 4.8.3 (Python 3.8), the integrated development environment (IDE) is Pycharm, TensorFlowGPU 1.8.0 is used as the learning framework. Python virtual environment is run in Anaconda3. NVIDIA GeForce GTX 1650 + CUDA 11.1 is the hardware environment.
4.2. Experimental Setup
In this paper, the Arm environment is written based on the Pyglet module. Two classical controls on the OpenAI GYM [20] platform and four continuous control tasks in the Mujoco physics simulator [21] are used as the experimental environment. OpenAI GYM is an open source toolkit that provides a variety of game environment interfaces to facilitate the research and development of artificial intelligence experiments.
The Arm environment used in this work includes the following items:(1)Arm_easy. 400 400 2dimensional space is constructed in the Arm environment. One end of a robot arm is fixed in the middle of the environment. The goal of the training is to make the other end of the robot arm find the blue target point as shown in Figure 3.(2)Arm_hard. This is similar to the Arm_easy environment, the only difference is that the target point is randomly generated in each round.
Two classical, continuous control task used in this work are shown below.(1)Pendulum. The pendulum starts at a random position, the aim is to push it swing upwards and keep erected.(2)Mountain Car Continuous. This task is to drive a car to reach the top of a hill; however, the power of the car is not sufficient to drive it directly to reach the top, it needs to rise and drop on the left and right sides repeatedly so that it can accumulate power to reach the top. It is shown in Figure 4.
The 4 Mujoco continuous control tasks include:(1)Half Cheetah. Train a bipedal agent to learn running as shown in Figure 5.(2)Hopper. Train a single legged robot to learn jumping forward.(3)Humanoid. Train a 3dimensional bipedal agent to learn standing without falling down.(4)Walker2d. Train a 3dimensional bipedal agent to walk forward as fast as possible.
This work compares the performance of DNDDPG and the original DDPG algorithm. In order to study the improvement effect of the dualcritic network and the dualactor network, the DCNDDPG algorithm which is the singleactor and dualcritic network is included for comparison. The outcomes of the comparison are shown intuitively through experiments.
4.3. Parameter Setting
To ensure the accuracy and fairness of the experimental results, the common parameter values of different algorithms are the same. The training rounds for both the Arm environment and the two Gym classic control tasks are set to 2000 times, and the maximum number of training steps per round is 300 times. The training rounds of 4 kinds of Mujoco continuous control tasks are set to 5000 times, and the number of training steps per round is the maximum number of round steps in the Gym environment. The agent continuously learns and explores in the environment. If the preset task in the environment is successfully completed or the number of training times per round exceeds the maximum number of times, the scene will be reset and a new round will be started. Some parameters in the MuJoCo task are shown in Table 1.
4.4. Experimental Outcomes
In this work, the performances of three algorithms, DNDDPG, DCNDDPG and original DDPG, are compared in terms of their performance in two Arm environments, two Gym classical control environments, and four continuous tasks in Mujoco. DNDDPG and DCNDDPG are both based on the improvement of the DDPG method, the difference is that DCNDDPG is based on the original DDPG with addition of an extra critic network, while DNDDPG is based on the DCNDDPG with addition of an extra actor network to optimizing training. The comparison of these three algorithms can make a more intuitive display of the two improved methods mentioned in this article: dualcritics and dualactors. The experimental results are shown in Figure 6.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
The shaded part in the figure represents the standard deviation during training, that is, when using the same hyperparameters and network model, different random number seeds are used to achieve random exploration. The shaded upper limit is the optimal result. The xaxis represents the number of rounds of agent training, the yaxis represents the cumulative reward obtained per round, and the experiment recorded the average reward value per 100 rounds.
In the environments of Arm easy and Arm hard, the average rewards from three algorithms stay around a same value. In some cases, the rewards from both DCNDDPG and DDPG are superior to that of DNDDPG. However, from the point of view of overall training effects, DNDDPG performs better than the other two algorithms, while DCNDDPG is slightly better than DDPG. In Pendulum experiment, the overall performance of the DNDDPG is the best, which is due largely to the fact that dualcritics network is able to reduce the error while dualactors network selects the action of higher value. In cases of Mountain Car Continuous, the average rewards from these three algorithms tend to be the same. However, during the process of 200 time steps, DNDDPG has a better convergence speed than the rest two algorithms. In addition, in Half Cheetah, Humanoid, Hopper and Walker2d, DNDDPG has a worse starting performance than DCNDDPG and DDPG, which could be due to the fact that DCNDDPG and DDPG have relatively simpler network structure able to deal with complex environment easier than DNDDPG. The DNDDPG needs a period for training, and after this initial training period the average reward from DNDDPG becomes obviously better than the rest two algorithms. Again, the overall performance of DCNDDPG is better than DDPG. Finally, the shaded areas of different algorithms are compared, with the outcomes that the area of DNDDPG is smaller than those of DCNDDPG and DDPG, which reflects that the training of DNDDPG is more stable.
From the experimental results in Figure 6, the dualcritics method is able to increase the performance of DDPG algorithm, but to a limited extent. By introducing dualactors method, the DNDDPG network, based on the DCNDDPG, is able to further increase the overall performance and training stability of the algorithm. Hence, compared to the original DDPG, the DNDDPG which is based on dualactors and dualcritics, has the best increased performance.
5. Conclusion
A deep deterministic policy gradient algorithm is proposed based on a dualactors and dualcritics network. In order to reduce the overestimated error in the original actorcritic network, a dualcritics target network is introduced into the algorithm, and the minimum action estimate generated by the two networks is selected to update the policy network. In order to alleviate the problem of underestimation caused by the dualcritics network, a dualactors network is added on the basis of the original network, and the action with the highest value among the two actions generated by the dualactors network is selected. The experimental results show that, compared with the original DDPG algorithm, and the DDPG algorithm based on the singleactor and twocritics network, the novel DNDDPG algorithm based on the dualactors and dualcritics network has a higher cumulative reward and a smaller standard deviation of training.
There is more to be explored in future work. First, in order to improve the optimization ability of the algorithm, more suitable deep learning methods can be explored and applied to neural networks. Second, for the experience replay mechanism in the DDPG algorithm, it is viable to explore whether there is a better method to determine the sample priority to improve the convergence speed during training.
Data Availability
The dataset can be accessed upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest to report regarding the present study.
Acknowledgments
This work was supported by (1) Training Project of Top Scientific Research Talents of Nantong Institute of Technology (XBJRC2021005); (2) the Universities Natural Science Research Projects of Jiangsu Province (17KJB520031, 21KJD210004, 22KJB520032, 22KJD520007, 22KJD520008, and 22KJD520009); and (3) the Science and Technology Planning Project of Nantong City (JC2021132, JCZ21058, JCZ20172, JCZ20151, and JCZ20148).