Research Article  Open Access
ZhiBin Zhang, XinHong Li, JiPing An, WanXin Man, GuoHui Zhang, "ModelFree Attitude Control of Spacecraft Based on PIDGuide TD3 Algorithm", International Journal of Aerospace Engineering, vol. 2020, Article ID 8874619, 13 pages, 2020. https://doi.org/10.1155/2020/8874619
ModelFree Attitude Control of Spacecraft Based on PIDGuide TD3 Algorithm
Abstract
This paper is devoted to modelfree attitude control of rigid spacecraft in the presence of control torque saturation and external disturbances. Specifically, a modelfree deep reinforcement learning (DRL) controller is proposed, which can learn continuously according to the feedback of the environment and realize the highprecision attitude control of spacecraft without repeatedly adjusting the controller parameters. Considering the continuity of state space and action space, the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm based on actorcritic architecture is adopted. Compared with the Deep Deterministic Policy Gradient (DDPG) algorithm, TD3 has better performance. TD3 obtains the optimal policy by interacting with the environment without using any prior knowledge, so the learning process is timeconsuming. Aiming at this problem, the PIDGuide TD3 algorithm is proposed, which can speed up the training speed and improve the convergence precision of the TD3 algorithm. Aiming at the problem that reinforcement learning (RL) is difficult to deploy in the actual environment, the pretraining/finetuning method is proposed for deployment, which can not only save training time and computing resources but also achieve good results quickly. The experimental results show that DRL controller can realize highprecision attitude stabilization and attitude tracking control, with fast response speed and small overshoot. The proposed PIDGuide TD3 algorithm has faster training speed and higher stability than the TD3 algorithm.
1. Introduction
With the rapid development of space technology, the structure and composition of OnOrbit Servicing Spacecraft (OOSS) are becoming more and more complex, and the performance is constantly improving. The effectiveness of attitude control determines the success or failure of the service mission and the life of the spacecraft. OnOrbit Servicing Spacecraft is characterized by flexible multibody structure, liquid sloshing, and fuel consumption and needs to change the structure and parameters according to the mission requirements. For example, in the case of mass loss due to fuel consumption, the rate of mass change of the spacecraft is known to be a function of control application and actuator hardware characteristics. The motion of space manipulator will cause mass displacement, resulting in the change of inertial parameters, and cause disturbances to the attitude stability of the service spacecraft body. On the other hand, the spacecraft is affected by gravity gradient torque, aerodynamic torque, radiated torque, and other unknown disturbance torques. These peculiarities make the attitude of the OOSS present dynamic characteristics such as uncertainty, timevarying, strong nonlinearity, and highorder multivariable coupling [1]. Even in many complex space missions, it is difficult to establish an accurate mathematical model, which increases the difficulty of attitude control.
The classical attitude control methods include PID control [2], adaptive control [3], sliding mode control [4], Lyapunov control [5], optimal control [6], and robust control [7]. These control algorithms have achieved good results in simulation experiments and practical applications. Traditional attitude control algorithms need to establish a relatively accurate system model and need to design the parameters of the controller accurately. When the system cannot be modeled completely or the environment changes greatly, the performance of the controller will degrade to some extent. For systems with unclear or completely unknown mathematical models, the intelligent control method with selflearning ability will be a promising choice.
Reinforcement learning (RL) is a kind of machine learning, which has a close relationship with dynamic programming and optimal control theory [8]. The basic idea of RL is to explore the optimal strategy through the interaction between agent and environment, so as to maximize the return [9]. Classical RL, such as Qlearning, discretizes the action and state space and uses the table method to solve the problem [10]. However, the actual control problem may have continuous action and state space, and it is difficult to discretize. Highdimensional continuous state and action space increase the computational burden and lead to the socalled Curse of Dimensions (CoD) problem. DRL solves the CoD problem by introducing neural networks to approximate value function and policy [11]. In 2015, Lillicrap et al. proposed the Deep Deterministic Policy Gradient (DDPG) algorithm to solve the control problem with continuous action space [12]. DRL became widely known in 2016 when Google AlphaGo defeated the top international Go player Lee Sedol in the Go competition.
Up to now, DRL has been used in robots [13], UAV [14, 15], energy [16, 17], transportation [18], and other control fields. DRL based on the Markov decision process (MDP) provides an effective way to realize intelligent control of spacecraft. In the process of selflearning, DRL optimizes the parameters of neural networks iteratively, which eliminates the trouble of design parameters, enables it to adapt to the changing software, hardware, and environment, and can continuously optimize the performance of the controller by changing the setting of reward function. As space exploration missions become more frequent and complex, spacecraft are getting farther and farther from the earth; DRL technology with fast learning ability and selfregulation ability will play an increasingly important role in spacecraft attitude control system.
Based on the above discussion, a modelfree attitude control scheme based on DRL is proposed, and a kind of mixed reward function for spacecraft attitude control is designed to realize attitude stabilization control and attitude tracking control of spacecraft. Aiming at the problem that the TD3 algorithm is relatively slow to explore the optimal strategy without using any prior knowledge, the PIDGuide TD3 algorithm is proposed, which uses the PID controller to guide the exploration process of the DRL algorithm, so as to significantly speed up the training speed and improve the convergence accuracy of the algorithm. Aiming at the problem that reinforcement learning (RL) is difficult to deploy in the actual environment, this paper proposes that the algorithm should be pretrained on the ground and then finetune the parameters on orbit, so as to save training time and computing resources and achieve better results quickly.
The rest of the paper is organized as follows. The problem statements and control objectives are given in Section 2. The DRL controller of spacecraft attitude is designed, and the PIDGuide TD3 algorithm is proposed in Section 3. Section 4 proves the effectiveness of the proposed algorithm through three simulation cases. Conclusions are given in Section 5.
2. Problem Statement and Preliminaries
2.1. Problem Statement
The paper mainly studies the attitude control problem of OOSS under complex situations such as uncertain inertia matrix, even unable to establish the motion model, and external disturbances. The control objective is to achieve attitude stabilization control under bounded disturbances, that is, where is the attitude angle, is the angular velocity, is the target attitude angle, and is the target angular velocity.
The characteristics of this control problem are as follows: (A) it is difficult to establish accurate dynamic and kinematic models; (B) there are unknown external disturbances, such as perturbation and solar wind; (C) the actuator is saturated.
2.2. Attitude Representation
The definition of the earthcentered inertial coordinate system , spacecraft bodyfixed coordinate system , and orbital coordinate system is shown in Figure 1. When the attitude of the spacecraft is in stable state, the spacecraft bodyfixed frame coincides with the orbital coordinate system. When attitude motion occurs, the orientation or pointing of the bodyfixed frame relative to the orbital coordinate system is the attitude of the spacecraft. There are usually five ways to describe the attitude of the spacecraft, including Euler angles, Rotation Matrix, Quaternions, Rodrigues Parameters (RP), and Modified Rodrigues Parameters (MRP). Different description methods can be converted between each other. The characteristics of different description methods can be found in [19, 20].
2.3. Spacecraft Dynamics
The kinematic equation of spacecraft attitude described by MRP is given by
The matrix form of the above equation is as follows: where is the Modified Rodrigues Parameters and is the rotation angular velocity of the spacecraft relative to the earthcentered inertial, which is expressed in the bodyfixed frame. is the thirdorder identity matrix. is the skewsymmetric matrix of , defined as
The attitude dynamics equation can be expressed by the following equation: where is the inertia matrix of the spacecraft and is the skewsymmetric matrix of . is the control torque. is the total disturbing torque, including the gravity gradient torque, aerodynamic torque, magnetic disturbing torque, and radiation torque.
Remark 1. The dynamic model mentioned here is only used as a simulation of the space environment and will not be used directly in the controller.
3. DRL Controller Design for Spacecraft
3.1. Background
Deep Deterministic Policy Gradient (DDPG) is an actorcritic method that uses neural networks as the approximation of policy function and Qfunction, namely, the policy network and the Q network. Therefore, it can deal with the problem with continuous action space. At the same time, the use of empirical playback and double network solves the problem that actorcritic is difficult to converge [12].
While DDPG can achieve great performance sometimes, it is frequently vulnerable to hyperparameters and other kinds of tuning [21]. A common failure mode for DDPG is that the learned Qfunction begins to overestimate values dramatically, which leads to the policy breaking, because the error in the Qfunction is introduced into the training of the policy network. The Twin Delayed DDPG (TD3) algorithm is an improved version of the DDPG, using three techniques to solve the problems of DDPG [22].
The first one is Double Q Networks. TD3 learns two Qfunctions and uses the smaller of the two values to form the targets in the Bellman error loss functions; thus, the overestimation of the approximation of the value function is reduced.
The second one is “Delayed” Policy Updates. TD3 updates the policy (and target networks) less frequently than the Qfunction.
The last one is Target Policy Smoothing. TD3 adds noise to the target action to reduce sensitivity and instability, making it harder for the policy to exploit Qfunction errors by smoothing out Q along changes in action. The combination of these three techniques significantly improves performance of TD3 relative to baseline DDPG.
3.2. EndtoEnd Attitude Control Based on TD3 Algorithm
The training goal of the DRL model is to maximize the total return when the agent interacts with the environment. DRL controller has a natural similarity with traditional control but uses different terms to represent the same concept, as shown in Figure 2. DRL is composed of agent, environment, state, action, and reward. The agent learns the optimal policy through interacting with the environment and takes the optimal action to maximize the longterm reward, while the traditional control is to design the controller (policy) through experts. The state feedback signal refers to the observation of the environment, and the reference signal is built into the reward function and observation.
(a)
(b)
3.2.1. Environment and State
In order to implement DRL, the simulation environment of spacecraft attitude motion should be established according to the attitude dynamic and kinematic equations of spacecraft. The simulation environment includes the environment dynamic model and the interface between the agent and the environment. The input of the environment is the action output by the agent, and the output is the observed state and reward signal after the action is performed. For the spacecraft attitude control problem, the system states are selected as attitude angle and angular velocity . The reference signal includes the target state and . The system state and the error signal are used as the observations, where . The action space is continuous threedimensional control torque .
3.2.2. Reward Function
Reward refers to the reward signal that the agent measures its performance according to the task goal. A reasonable reward function is the key for agent to learn effective policy, which determines the convergence speed and stability of RL. Typically, a positive reward is offered to encourage certain behaviors of the agent and a negative reward is offered to deter others [23]. The reward function can be divided into three types: continuous reward function, discrete reward function, and mixed reward function. The continuous reward function varies continuously with observation and action. In general, continuous reward signals can improve the convergence during training and simplify the network structure. The discrete reward function varies discontinuously with observation and action. This type of reward signals slow down the convergence rate and require a more complicated network structure. Discrete rewards are usually implemented as event that occurs in the environment. For example, if the agent exceeds a certain threshold, it may receive a positive reward; if a certain performance constraint is violated, it will be punished. The more commonly used reward signal is mixed reward signal. Mixed rewards include continuous rewards and discrete rewards. Discrete reward signals keep the system away from bad states, and continuous reward signals improve convergence by providing smooth rewards near the target state.
The following mixed reward function is designed for spacecraft attitude control: where , , and represent vector norm. is a continuous reward, which can stabilize the system state and reduce fuel consumption. The smaller the error, the greater the reward of . The second term in represents energy consumption. When the attitude of the spacecraft is very close to the target attitude, the error change is small, and the change of will be very insignificant. Therefore, the continuous reward is introduced to increase the reward gradient when the absolute value of each error component is less than 0.1, so as to guide the attitude angle to approach the target value quickly and accurately. is a discrete reward, which can control the attitude angle not to exceed the range and increase training speed.
3.2.3. TD3 Algorithm
The agent receives observations and rewards from the environment and sends actions to the environment. The TD3 algorithm is used as the learning method of agent in this paper. TD3 and DDPG are both offpolicy algorithms. TD3 creates an experience replay buffer to store historical experiences and then randomly sample transitions from it and feed those sample data to update actor and critic networks. The existence of experience replay buffer helps the agent to be able to learn previous experiences and improve the efficiency of sample utilization. Random sampling can break the correlation between samples and make the learning process of agents more stable [24].
TD3 uses a total of 6 neural networks, namely, actor network , actor target network , critic network , critic network , critic target network , and critic target network . The role and update rules of each network are as follows. (1)Actor network : responsible for the iterative update of actor network parameters and selects the current action according to the current state , which is used to interact with the environment to generate the next state and reward (2)Actor target network : responsible for selecting the optimal next action according to the next state sampled in the experience replay buffer. Network parameters are periodically copied from (3)Critic networks : responsible for the iterative update of Q network parameters and calculate the current value . The target value is the smaller of the two value, namely, (4)Critic target networks : responsible for calculating of the target value. Network parameters are periodically copied from
For the critic networks, the loss function is defined as
For the actor network, the deterministic policy is used to optimize the parameters, and the loss function is defined as
The target networks are updated by soft update method:
3.3. PIDGuide TD3 Algorithm
TD3 is a modelfree algorithm, which does not use any prior knowledge and constantly explores through the interaction with the environment to obtain the optimal strategy. It is timeconsuming for agent to find optimal policy without any prior knowledge due to the problems such as partial observability of environmental feedback, the sparsity of reward, and highdimensional state and action space.
In order to speed up the training speed and improve the convergence stability of the algorithm, a PIDGuide TD3 algorithm is proposed in this section. The core idea of the PIDGuide TD3 algorithm is as follows. In the current state , two actions are generated by the action network and PID controller, respectively. Then, the critical network is used to evaluate the two actions; the action with higher value will be actually executed. In fact, any modelfree controller can guide TD3 to conduct policy search. Figure 3 describes the structure of PIDGuide TD3.
PID is a modelfree controller based on the feedback of error. PID requires precise design of parameters, so the change of environment will lead to serious degradation of its performance. The formula of PID algorithm is where , , and are positive definite matrices containing the control parameters, which are, respectively, called proportional coefficient, integral coefficient, and differential coefficient.
The pseudocode of PIDGuide TD3 is given in Algorithm 1. The main steps of the PIDGuide TD3 algorithm are as follows: (1)Randomly initialize critic networks , , and actor network with weights , , and initialize target networks , , with weights , , (2)Initialize replay buffer (3)Start a new episode, set the target position, randomly reset the environment, and get the initial state (4)For every step, select an action according to the current policy with exploration noise:where the exploration noise (5)Select another action according to PID controller:(6)Execute the action and observe the reward and the new state (7)Store transition tuple in (8)If is terminal, reset environment state(9)If it is time to update the critic networks, then randomly sample a mini batch of transitions from (10)Compute target action:where is the scale of exploration noise and is the maximum control torque (11)Compute target value:where is the discount factor (12)Update critic networks by(13)If it is time to update the actor network, update by the deterministic policy gradient:(14)Update target networks bywhere is the update rate for target model

3.4. Pretraining and FineTuning
When the state space and action space are too large, the DRL algorithm is difficult to be applied in space tasks directly due to the low learning efficiency and the difficulty in training the networks. In order to shorten the training time and avoid dangerous states during the exploration, this paper proposes to use the pretraining and finetuning method in deep learning to further improve the learning efficiency.
Pretraining in deep learning is training the machines before they start performing a particular task. Pretraining imitates the way human beings process new knowledge. The weights saved from the previous network will be used as the initial weight for the new experiment. In this way, the old knowledge helps new models successfully perform new tasks from old experience instead of from scratch.
A wellestablished paradigm is to pretrain models using largescale data and then to finetune the models on target tasks that often have less training data. Pretraining has enabled stateoftheart results on many tasks, including object detection, image segmentation, and action recognition [25].
The pretraining and finetuning method makes it possible to deploy the DRL controller on orbit. The real space environment is different from the simulation environment; there are some uncertain information such as unknown disturbances and unknown inertial parameters in space. For the spacecraft attitude control task, the agent which has been pretrained on the ground only needs a small amount of onorbit training to finetune the parameters to obtain good performance.
4. Simulation and Results
In order to verify the effectiveness and superiority of the proposed PIDGuide TD3 algorithm, the simulations are carried out in this section. The following numerical simulations are organized.
Case 1: in the ideal environment without external disturbances, the agent is trained to realize the attitude stabilization control and attitude tracking control of spacecraft, respectively.
Case 2: on the basis of Case 1, the existence of unknown disturbance torques is considered.
Case 3: on the basis of Case 2, the PIDGuide TD3 algorithm is used to accelerate the training speed and convergence stability.
4.1. Simulation Setup
All training processes are done on the computer with an Intel Seon Silver 4210 CPU and an NVIDIA Quadro P2000 GPU. Python 3.7 is used as the project interpreter. The deep learning framework TensorFlow2.0.0 is used for training the networks under Windows system.
The experiments are conducted in the OpenAI Gym simulation environment. The step size of the Gym simulator, which specifies the duration of each physics update step, is set to 0.1 s to develop highly accurate simulations. The inertia matrix is set as . The initial states of system are selected as and rad/s. The target states are and rad/s. The maximum value of control torque is .
The policy and value functions are approximated by fourlayer neural networks with Relu activations on each hidden layer. The composition of the networks is shown in Table 1.

The hyperparameter settings during the implementation of the algorithm are given in Table 2. All network architecture and hyperparameters used in the three cases are the same.

4.2. Case 1: EndtoEnd TD3 Algorithm under Ideal Environment
In order to verify the performance of the EndtoEnd DRL controller and the effectiveness of the reward function proposed in Section 3.2, the agent is trained in an ideal environment without external disturbances to realize the attitude stabilization and attitude tracking control of spacecraft, respectively.
Figure 4 shows the learning process of the EndtoEnd TD3 algorithm. It can be seen that the algorithm basically achieves convergence after 100 episodes of training, and the rewards for each episode finally stabilized at 15000.
In this case, the control strategy output by the agent, namely, the control torque, is shown in Figure 5(a). Figures 5(b) and 5(c) show the parameter response of the attitude angle and attitude angular velocity under the abovementioned control torque, respectively. After 30 seconds, the attitude angle has been basically controlled to the target value, and the angular velocity is basically zero. The convergence speed is fast, and the amount of overshoot is small. The error curve of the attitude angle is shown in Figure 5(d). The error level drops to 10^{4}, which shows that the control accuracy is relatively high.
(a)
(b)
(c)
(d)
The objective of attitude tracking control is to track the desired attitude with and rad/s. The simulation results are illustrated in Figure 6. It can be seen that both the attitude and angular velocity converge to the target attitude rapidly, which indicate that the tracking objective is accomplished with the proposed EndtoEnd DRL controller.
(a)
(b)
(c)
(d)
4.3. Case 2: EndtoEnd TD3 Algorithm under Unknown External Disturbances
In this case, the spacecraft suffers from external disturbances . The learning process is illustrated in Figure 7. The simulation results are illustrated in Figure 8. Compared with the ideal environment, the training speed of the algorithm is slightly reduced, and it still has good convergence accuracy. It can be seen that the disturbance torques are completely compensated by the controller.
(a)
(b)
(c)
(d)
4.4. Case 3: PIDGuide TD3 Algorithm under Unknown External Disturbances
In order to verify the superiority of the PIDGuide TD3 algorithm proposed in Section 3.3, the training speed and stability performance of PIDGuide TD3, EndtoEnd TD3, and pretraining/finetuning method are compared, respectively. The training process and testing process are illustrated in Figures 9 and 10. The simulation results are illustrated in Figure 11.
(a)
(b)
(c)
(d)
A PIDlike controller designed for spacecraft attitude stabilization is used as the guide controller [26] . The weighting coefficients are chosen as , , and .
From Figures 9 and 10, it can be discerned obviously that all the three algorithms got good convergence performance after training. However, compared with EndtoEnd TD3 algorithm, the PIDGuide TD3 has significantly faster training speed and higher convergence accuracy. The PID controller itself does not have very high performance, but it can guide the TD3 to produce better sample data at the beginning of training, thus speeding up the training process.
Further, it can also be observed that the pretrained agent only needs a few episodes of training to adapt to the new environment and at the same time avoid the occurrence of dangerous states in the process of exploration. The benefits of pretraining extend beyond merely quick convergence, since pretraining can improve model robustness and uncertainty. Therefore, the pretrained DRL controller can be deployed to the spacecraft and then finetune parameters on orbit, instead of training from scratch.
5. Conclusions
A DRLbased control approach is proposed to handle the modelfree attitude control problem of OOSS under the guidance of the mixed reward system. The PIDGuide TD3 algorithm based on prior knowledge is proposed to increase the training speed and learning stability of the TD3 algorithm. In addition, the pretraining and finetuning method is proposed to realize the deployment of DRL controller in space. Simulation results show that the DRL controller can achieve highprecision attitude stabilization and attitude tracking control with fast response and small overshoot. The learning curves of the three DRL methods illustrate that the proposed PIDGuide TD3 algorithm has faster learning speed and convergence accuracy than the baseline TD3 algorithm. The pretraining and finetuning method can make the controller adapt to uncertain and unknown disturbances in the actual environment very quickly.
Data Availability
All data, models, or code generated or used during the study are available from the corresponding author by request.
Conflicts of Interest
The authors declare no conflicts of interest.
Acknowledgments
This research was supported by the National Defense Science and Technology Innovation Zone of China, grant number 00205501.
References
 W. Hu, Fundamental spacecraft dynamics and control, John Wiley & Sons, 2015.
 J. Su and K. Cai, “Globally stabilizing proportionalintegralderivative control laws for rigidbody attitude tracking,” Journal of Guidance, Control, and Dynamics, vol. 34, no. 4, pp. 1260–1264, 2011. View at: Publisher Site  Google Scholar
 S. J. Paynter and R. H. Bishop, “Adaptive nonlinear attitude control and momentum management of spacecraft,” Journal of Guidance, Control, and Dynamics, vol. 20, no. 5, pp. 1025–1032, 1997. View at: Publisher Site  Google Scholar
 M. Jafari and S. Mobayen, “Secondorder sliding set design for a class of uncertain nonlinear systems with disturbances: an LMI approach,” Mathematics and Computers in Simulation, vol. 156, pp. 110–125, 2019. View at: Publisher Site  Google Scholar
 D. Fragopoulos and M. Innocenti, “Stability considerations in quaternion attitude control using discontinuous Lyapunov functions,” IEE ProceedingsControl Theory and Applications, vol. 151, no. 3, pp. 253–258, 2004. View at: Publisher Site  Google Scholar
 Y. Park, “Robust and optimal attitude stabilization of spacecraft with external disturbances,” Aerospace Science and Technology, vol. 9, no. 3, pp. 253–259, 2005. View at: Publisher Site  Google Scholar
 F. Bayat, S. Mobayen, and T. Hatami, “Composite nonlinear feedback design for discretetime switching systems with disturbances and input saturation,” International Journal of Systems Science, vol. 49, no. 11, pp. 2362–2372, 2018. View at: Publisher Site  Google Scholar
 R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT press, 2018.
 R. S. Sutton, A. G. Barto, and R. J. Williams, “Reinforcement learning is direct adaptive optimal control,” IEEE Control Systems Magazine, vol. 12, no. 2, pp. 19–22, 1992. View at: Publisher Site  Google Scholar
 C. J. C. H. Watkins and P. Dayan, “Qlearning,” Machine Learning, vol. 8, no. 34, pp. 279–292, 1992. View at: Publisher Site  Google Scholar
 H. V. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double qlearning in thirtieth AAAI conference on artificial intelligence,” pp. 2094–2100, 2016, https://arxiv.org/abs/1509.06461. View at: Google Scholar
 T. P. Lillicrap, J. J. Hunt, A. Pritzel et al., “Continuous control with deep reinforcement learning,” 2015, https://arxiv.org/abs/1509.02971. View at: Google Scholar
 H. Xiong, T. Ma, L. Zhang, and X. Diao, “Comparison of endtoend and hybrid deep reinforcement learning strategies for controlling cabledriven parallel robots,” Neurocomputing, vol. 377, pp. 73–84, 2020. View at: Publisher Site  Google Scholar
 D. Xu, Z. Hui, Y. Liu, and G. Chen, “Morphing control of a new bionic morphing UAV with deep reinforcement learning,” Aerospace Science and Technology, vol. 92, pp. 232–243, 2019. View at: Publisher Site  Google Scholar
 L. Gong, Q. Wang, C. Hu, and C. Liu, “Switching control of morphing aircraft based on Qlearning,” Chinese Journal of Aeronautics, vol. 33, no. 2, pp. 672–687, 2020. View at: Publisher Site  Google Scholar
 Q. Huang, R. Huang, W. Hao, J. Tan, R. Fan, and Z. Huang, “Adaptive power system emergency control using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 2, pp. 1171–1182, 2020. View at: Publisher Site  Google Scholar
 J. Duan, R. D. Di Shi, H. Li et al., “Deepreinforcementlearningbased autonomous voltage control for power grid operations,” IEEE Transactions on Power Systems, vol. 35, no. 1, pp. 814–817, 2020. View at: Publisher Site  Google Scholar
 Z. Ning, R. Y. K. Kwok, K. Zhang et al., “Joint computing and caching in 5Genvisioned Internet of vehicles: a deep reinforcement learningbased traffic control system,” IEEE Transactions on Intelligent Transportation Systems, pp. 1–12, 2020. View at: Publisher Site  Google Scholar
 N. A. Chaturvedi, A. K. Sanyal, and N. H. McClamroch, “Rigidbody attitude control,” IEEE Control Systems, vol. 31, no. 3, pp. 30–51, 2011. View at: Publisher Site  Google Scholar
 S. Bandyopadhyay, S. J. Chung, and F. Hadaegh, “Attitude control and stabilization of spacecraft with a captured asteroid,” in AIAA Guidance, Navigation, and Control Conference, no. 596, 2015. View at: Publisher Site  Google Scholar
 W. Shi, S. Song, C. Wu, and C. L. P. Chen, “Multi pseudo qlearningbased deterministic policy gradient for tracking control of autonomous underwater vehicles,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 12, pp. 3534–3546, 2019. View at: Publisher Site  Google Scholar
 S. Fujimoto, H. V. Hoof, and D. Meger, “Addressing function approximation error in actorcritic methods,” 2018, https://arxiv.org/abs/1802.09477. View at: Google Scholar
 D. Dewey, “Reinforcement learning and the reward engineering principle,” AAAI Spring Symposium Series, pp. 13–16, 2014. View at: Google Scholar
 Y. H. Wu, Z. C. Yu, C. Y. Li, M. J. He, B. Hua, and Z. M. Chen, “Reinforcement learning in dualarm trajectory planning for a freefloating space robot,” Aerospace Science and Technology, vol. 98, article 105657, 2020. View at: Publisher Site  Google Scholar
 K. He, R. Girshick, and P. Dollár, “Rethinking imagenet pretraining,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4918–4927, Seoul, Korea (South), Korea (South), 2019. View at: Publisher Site  Google Scholar
 C. Li, K. L. Teo, B. Li, and G. Ma, “A constrained optimal PIDlike controller design for spacecraft attitude stabilization,” Acta Astronautica, vol. 74, pp. 131–140, 2012. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2020 ZhiBin Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.