Online Optimal Control of Robotic Systems with Single Critic NN-Based Reinforcement Learning

Long, Xiaoyi; He, Zheng; Wang, Zhongyuan

doi:https://doi.org/10.1155/2021/8839391

Complexity

On this page

Abstract Introduction Preliminaries Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Filtering, Control, and Optimization of Distributed Networked Systems

View this Special Issue

Research Article | Open Access

Volume 2021 | Article ID 8839391 | https://doi.org/10.1155/2021/8839391

Online Optimal Control of Robotic Systems with Single Critic NN-Based Reinforcement Learning

Xiaoyi Long,¹Zheng He,¹and Zhongyuan Wang¹

Academic Editor: Jing Na

Received29 Aug 2020

Revised21 Oct 2020

Accepted11 Jan 2021

Published09 Feb 2021

Abstract

This paper suggests an online solution for the optimal tracking control of robotic systems based on a single critic neural network (NN)-based reinforcement learning (RL) method. To this end, we rewrite the robotic system model as a state-space form, which will facilitate the realization of optimal tracking control synthesis. To maintain the tracking response, a steady-state control is designed, and then an adaptive optimal tracking control is used to ensure that the tracking error can achieve convergence in an optimal sense. To solve the obtained optimal control via the framework of adaptive dynamic programming (ADP), the command trajectory to be tracked and the modified tracking Hamilton-Jacobi-Bellman (HJB) are all formulated. An online RL algorithm is the developed to address the HJB equation using a critic NN with online learning algorithm. Simulation results are given to verify the effectiveness of the proposed method.

1. Introduction

In the control field and practical applications, reinforcement learning (RL) [1, 2] and adaptive dynamic programming (ADP) [3, 4] play a critical role to address the optimal control problems. The purpose of optimal control is to design a stabilizing control law by minimizing a predefined performance function. A lot of work focusing on the regulation problem for optimal control using the RL/ADP algorithms has been reported [5, 6] in the past years. The objective is to solve an optimal control that can maximize or minimize the system output energy and control actions, where the associated optimal control equations can be numerically solved via neural networks (NNs). From the perspective of both the theoretical study and practical application, these results pave a new way to solve the optimal control problems. Relevant surveys about the recent developments on the RL and ADP can be referred to [7, 8].

RL was first developed in the intelligent control field, which was used to address the discrete-time Markov decision problems. Then, it has been extended to solve the continuous-time (CT) systems. With respect to optimal control designs, Abu-Khalaf et al. [9] suggested a policy iteration (PI) for the optimal regulation of CT nonlinear systems with actuator saturation. To overcome the problem of using the time derivatives of CT dynamics, Lewis et al. [10] developed an integral RL (IRL) technique for systems with partially known dynamics, where the full system information is avoided. In [11], the authors have employed an actor-critic structure and developed a synchronous PI algorithm for CT systems. In this framework, both the optimal cost function and control policy are estimated using NNs, whose weights are updated online simultaneously. For completely unknown system dynamics, the results in [12] showed that a model-free PI approach can be developed for CT linear systems, which can online calculate the optimal solutions using the input/output measurements. This principle was subsequently extended to nonlinear systems in [13, 14]. Another intelligent learning method used in the RL, experience replay (ER), was recently incorporated into the synthesis of ADP for optimal control in [15], where the past system observations are utilized together with the current information to enhance the convergence speed of the online learning.

Moreover, most existing results on the ADP-based optimal control designs focus on the optimal regulation problems only. However, in the practical application, optimal tracking control problem (OTCP) is more widely used than the optimal regulation problem [16–18], in particular for robotic applications. However, the OTCP is more challenging to address than the regulation problem, since its solution is usually composed of a feedforward action to guarantee the perfect tracking and a feedback action to stabilize the closed-loop system dynamics [19]. For linear systems, the solution of OTCP is calculated by addressing the Riccati equations [20], while for nonlinear systems, the existing solution for the OTCP can be derived with a feedforward term by employing the dynamics inversion and a feedback term by calculating a complex HJB equation [16, 21]. However, it is well known that deriving the solution of OTCP is typically intractable, especially for the online tracking control. Hence, only few results have been reported to address the OTCP in the literature, in particular for robotic systems.

According to the above facts, we propose a new RL algorithm to realize the optimal tracking control of robotic systems. To this end, the system model is rewritten as a state-space form, which will contribute to the realization of optimal tracking control. Then, a steady-state control is designed, and then an adaptive optimal tracking control is used to retain that the tracking error converges in an optimal manner. To derive this optimal control, the command trajectory to be tracked and the modified tracking Hamilton–Jacobi–Bellman (HJB) can be formulated. Finally, an online RL method is used to address the derived HJB equation using a single critic NN approximation. Numerical simulations are also given to show the validity of the proposed approach. The contributions can be summarized as follows:(1)To achieve the optimal tracking control, the robotic system model is transformed into a canonical form, which will contribute to the realization of optimal tracking control.(2)A critic NN is applied to reconstruct the cost function with guaranteed convergence, such that the actor NN used in the existing ADP structures is avoided and the computational costs can be reduced.(3)A RL algorithm is proposed to obtain the solution of the derived HJB equation, which can guarantee the convergence of critic NN weights to ensure the optimal tracking error convergence.

The paper is structured as follows: in Section 2, the system model is transformed into a canonical form, and a tracking performance function is constructed. In Section 3, an adaptive steady-state control is designed and an optimal control is developed with RL to make the tracking error dynamics convergent. For this purpose, a single critic NN is applied to estimate the solution of the HJB equation and update the optimal control action. Section 4 gives some simulation results to show the validity of the developed control and learning techniques. Conclusions are summarized in Section 5.

Notations: denotes the real number set. is the n-dimensional real vector. is the real matrices. denotes the Euclidean norm of a vector in or a matrix in . is the identity matrix, and means the zero matrix. and are the maximal and minimum eigenvalues of a matrix, respectively. is a diagonal matrix with component defines the partial differential operation.

2. Preliminaries and Problem Statement

In this paper, we consider the general rigid body dynamics for a nonlinear robotic manipulator. On the basis of the Lagrangian formulation, the manipulator dynamics can be formulated as [22, 23]where denotes the generalized coordinates representing the joint position and denote the derivatives of joint position (e.g., velocity and acceleration) with respect to time . Let denote the number of degrees of freedom and denote a positive definite inertia matrix which is invertible, are the Coriolis/centripetal dynamics, and represents the gravitational dynamics.

For the brevity of notation, we set ; then,

Hence, we write system (1) as a state-space form aswhere is the system state, is the known system dynamics, which is a continuous function with , is the control gain matrix, and denotes the control torque.

Assumption 1. The system dynamics with are Lipschitz. Hence, system (3) is stable, i.e., a continuous control can be found to stabilize the system, where is an admissible set.
This paper aims to find an optimal control to ensure that the system state tracks a desired trajectory by minimizing the following cost function:where and are positive definite symmetric matrices [2]. is the set of admissible policies [9], is a positive utility function, and is the tracking error defined aswhere is the reference trajectory. In this paper, the reference trajectory and its derivative are assumed to be continuous and bounded.

Remark 1. There have been several methods developed to solve the tracking control problem of robotic systems, e.g., [24, 25]. However, most existing robotic controllers are not designed in the optimal manner, i.e., the required control actions may be large. Differing from these results, a novel RL algorithm is proposed in the following sections to design an optimal control for robotic systems to achieve trajectory tracking and reduce the required control energy.

Remark 2. For linear OTCP scheme, the derivation of optimal solution is shown in [18], i.e., with being the input matrix, and can be solved by addressing a Riccati equation and is the feedforward action. Different from linear systems, the tracking control problem for nonlinear systems is not trivial since we cannot obtain a uniform formulation as the linear cases. This fact stimulated the current study.

3. Online Dynamic Tracking Control

To realize the optimal tracking control design, we decompose the control input into two parts [18, 21] aswhere and are the steady-state control and optimal control, which are applied to make the steady-state trajectory tracking and stabilize the tracking error dynamics optimally, respectively. Figure 1 shows the proposed control system structure.

3.1. Steady-State Tracking Control

Since can be adopted to ensure that the control error converges to zero in the steady-state, then we have from (3) thatwhere is the feedback gain set by the desingers and is defined as the generalized inverse of .

Then, based on (3) and (7), we have the tracking error as

From equations (7) and (8), we know that the tracking control of system (3) can be considered the regulation problem of (8). Hence, an optimal control will be designed to stabilize tracking error dynamics (8) in an optimal manner.

3.2. Approximate Optimal Tracking Control

The controller can be used to make (8) converge in an optimal sense. For this purpose, we can rewrite the infinite horizon cost function (4) as follows:

Then, an admissible control policy should be found so that cost function (9) of system (8) can be minimized. To this end, the Lyapunov equation of (7) is given bywith being the partial differential.

Then, the optimal cost function is given asand the derived HJB equation is shown as

The optimal control can be derived by solving from (10) as

The problem to be finally addressed is to solve HJB equation (12) to obtain the optimal cost function required in control (13).

3.2.1. Online Reinforcement Learning Algorithm

To calculate the above optimal control, we can recall the policy iteration (PI) method. Inspired by [1, 26], a policy iteration algorithm can be given as follows:(1)Select a small positive constant . Let and , then set an initial admissible control policy .(2)Solve the nonlinear Lyapunov equation using the control policy with .(3)Improve the control policy by(4)If , stop the iteration and take the approximate optimal control; else, let and go back to Step 2.

The above PI scheme can guarantee the convergence to the optimal cost function and control action, i.e., and as . The convergence proof of the PI algorithm was detailed in [9].

3.2.2. Neural Network Approximation

The above policy iteration method is run offline. To implement online optimal control, we will introduce an online learning method in this section.

From HJB equation (12), it is generally difficult to derive its solution. As shown in [27, 28], we will use a critic NN to estimate the ideal cost function . In this paper, the cost function can be considered as smooth, then a critic NN [26, 28] is applied to approximate aswhere is the ideal NN weight, is the activation function, is the number of neurons, and is the approximation error. Then, we have its derivative with respect to aswhere and are the partial derivatives. Then, based on (17), equation (10) is represented as

Assumption 2. (see [9, 26]). Consider the critic NN weight with the regressor , and then the error and its gradient are all bounded. Moreover, we have and as .
According to ideal cost function (16), the actual cost function can be given aswhich can be used to estimate the practical cost function. For critic NN (19), we can select such that for and for . Then, we havewhere .
Then, the approximated Hamiltonian function can be derived asFor training the critic NN to obtain the control action, it is expected to estimate to minimize the objective function . Hence, the gradient descent algorithm can be used to update the critic NN weights bywhere is the learning gain.
Based on (18), the Hamiltonian function iswhere is the residual error.
Define and as the estimation error of the critic NN weights, and a positive constant with . Then, from (21) and (23), we have . Hence, the estimation error dynamics are given byThe persistent excitation (PE) condition is required to retain the critic NN weights convergence and the condition with a positive constant . This condition can be satisfied in this paper since we consider the tracking control problem, and thus, the probing noise used in many existing ADP literature studies may be not necessary.
When implementing the online optimal control algorithm with critic NN (16), we have from (13) and (17) the optimal control asThen, the approximated control action with critic NN (19) is formulated asEquation (26) implies that with the updated critic NN weights , the approximated control action can be calculated directly. Consequently, the widely used actor-critic structure can be simplified, and only the critic NN is adopted in this paper to reduce the computational cost.
Next, the stability of the proposed algorithm is given.

Lemma 1. For error system (8) with control (26) and learning law (22), then the estimation error dynamics (24) are uniformly ultimately bounded (UUB).

Proof. We select the Lyapunov function candidate as . Then, the time derivative of the Lyapunov function along the trajectory of error dynamics (24) isAfter some mathematical manipulations, we haveConsidering the Cauchy–Schwarz inequality and noticing the assumption , we can conclude that as long as andAccording to the Lyapunov theory, we obtain that the estimation error is UUB.

4. Simulation

To demonstrate the validity of the developed method, a numerical simulation based on a SCARA robot plant is given. Consider the dynamics of a two degree-of-freedom SCARA robot system aswhere and are the SCARA robot’s joint position and velocity vectors. The inverse of inertia matrix , Coriolis dynamics , and gravity dynamics is shown aswith , and the detail modelling process can be found in [29].

Then, system (30) can be formulated aswhere denote the drift dynamics and denotes the control gain. To complete the optimal tracking control, equation (7) for system (30) can be given as with . The initial critic NN weights are selected as , and the initial system states are and . In the learning process, we set the learning gain as . The activation function of the critic NN is chosen as , and the weighting matrices Q and R are selected as identity matrices as [29]. The desired trajectory are given as , where , , , and . During the implementation of the policy iteration algorithm, we take the sinusoidal signals as the reference and thus the persistence excitation condition has been fulfilled. In this case, the probing noise introduced into the system can be avoided.

With these parameter configurations, numerical simulations are conducted and simulation results are given in Figures 2–5. The online profile of the approximated critic NN weights with proposed adaptive law (22) are displayed in Figure 2, which converge to the vector after a short transient stage. Clearly, we can find that the weights are convergent when performing the proposed online learning. With these online updated critic NN weights, the estimated control can converge to the ideal solution. As a consequence, the system tracking performances are given in Figures 3 and 4, which indicate that the system states can track the desired trajectories with the proposed optimal control. To better show the motion tracking results, the tracking errors are given in Figure 5, which converge to zero with very smooth profiles. Moreover, the control input is given as Figure 6, which is also bounded. It can be found from above simulations that the proposed optimal control can realize perfect tracking control. Specifically, the proposed RL algorithm can retain convergent response for the critic NN.

5. Conclusion

The optimal tracking control design for robotic systems using the RL algorithm is presented in this paper. The system model is first transformed into a canonical form, which can facilitate the realization of optimal tracking control design. To maintain the tracking response, a steady-state control is designed, and then an optimal tracking control is used to ensure the tracking error dynamics to be optimal. The online RL algorithm is dedicated to solving the HJB equation using a critic NN. Numerical simulations are given to show the effectiveness of the proposed technique. We will study the nonlinear optimal tracking problem with fully unknown system dynamics in the future work.

Data Availability

The data used to support the findings of this study were curated by the authors and are available upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Key R&D Project (2016YFE0202300), National Natural Science Foundation of China (U1903214, 62071339, 61671332, and U1736206), and Hubei Province Technological Innovation Major Project (2019AAA049).

References

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT press, Cambridge, MA, USA, 1998.
F. L. Lewis and D. Liu, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, vol. 17, John Wiley and Sons, Hoboken, NJ, USA, 2013.
W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, vol. 703, John Wiley & Sons, Hoboken, NJ, USA, 2007.
H. Zhang, D. Liu, Y. Luo, and D. Wang, Adaptive Dynamic Programming for Control: Algorithms and Stability, Springer Science and Business Media, Berlin, Germany, 2012.
S. Bhasin, R. Kamalapurkar, M. Johnson, K. G. Vamvoudakis, F. L. Lewis, and W. E. Dixon, “A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems,” Automatica, vol. 49, no. 1, pp. 82–92, 2013.
View at: Publisher Site | Google Scholar
Y. Lv, J. Na, Q. Yang, X. Wu, and Y. Guo, “Online adaptive optimal control for continuous-time nonlinear systems with completely unknown dynamics,” International Journal of Control, vol. 89, no. 1, pp. 99–112, 2016.
View at: Publisher Site | Google Scholar
S. Chen Anthony and H. Guido, “Adaptive optimal control via continuous-time Q-learning for unknown nonlinear affine systems,” in Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 1007–1012, IEEE, Nice, France, December 2019.
View at: Google Scholar
J. Y. Lee, J. B. Park, and Y. H. Choi, “Integral Q-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems,” Automatica, vol. 48, no. 11, pp. 2850–2859, 2012.
View at: Publisher Site | Google Scholar
M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, no. 5, pp. 779–791, 2005.
View at: Publisher Site | Google Scholar
D. Vrabie and F. Lewis, “Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems,” Neural Networks, vol. 22, no. 3, pp. 237–246, 2009.
View at: Publisher Site | Google Scholar
K. G. Vamvoudakis, D. Vrabie, and F. L. Lewis, “Online adaptive algorithm for optimal control with integral reinforcement learning,” International Journal of Robust and Nonlinear Control, vol. 24, no. 17, pp. 2686–2710, 2014.
View at: Publisher Site | Google Scholar
Y. Jiang and Z.-P. Jiang, “Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics,” Automatica, vol. 48, no. 10, pp. 2699–2704, 2012.
View at: Publisher Site | Google Scholar
R. Song, F. L. Lewis, and Q. Wei, “Off-policy integral reinforcement learning method to solve nonlinear continuous-time multiplayer nonzero-sum games,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, pp. 704–713, 2016.
View at: Google Scholar
R. Song, Q. Wei, and Q. Li, “Off-policy integral reinforcement learning method for multi-player non-zero-sum games,” in Adaptive Dynamic Programming: Single and Multiple Controllers, pp. 227–249, Springer, Berlin, Germany, 2019.
View at: Google Scholar
H. Modares, F. L. Lewis, and M.-B. Naghibi-Sistani, “Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems,” Automatica, vol. 50, no. 1, pp. 193–202, 2014.
View at: Publisher Site | Google Scholar
H. Modares and F. L. Lewis, “Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning,” Automatica, vol. 50, no. 7, pp. 1780–1792, 2014.
View at: Publisher Site | Google Scholar
Y. Lv, X. Ren, and J. Na, “Online Nash-optimization tracking control of multi-motor driven load system with simplified RL scheme,” ISA Transactions, vol. 98, pp. 251–262, 2020.
View at: Publisher Site | Google Scholar
J. Na and G. Herrmann, “Online adaptive approximate optimal tracking control with simplified dual approximation structure for continuous-time unknown nonlinear systems,” IEEE/CAA Journal of Automatica Sinica, vol. 1, pp. 412–422, 2014.
View at: Google Scholar
G. Xiao, Y. Luo, H. Zhang, and H. Jiang, “Data-driven optimal tracking control for a class of affine non-linear continuous-time systems with completely unknown dynamics,” IET Control Theory and Applications, vol. 10, no. 6, pp. 700–710, 2016.
View at: Publisher Site | Google Scholar
H. Modares and F. L. Lewis, “Linear quadratic tracking control of partially-unknown continuous-time systems using reinforcement learning,” IEEE Transactions on Automatic Control, vol. 59, no. 11, pp. 3051–3056, 2014.
View at: Publisher Site | Google Scholar
Y. Lv, X. Ren, and J. Na, “Adaptive optimal tracking controls of unknown multi-input systems based on nonzero-sum game theory,” Journal of the Franklin Institute, vol. 356, no. 15, pp. 8255–8277, 2019.
View at: Publisher Site | Google Scholar
F. Piltan, N. Sulaiman, H. Nasiri, S. Allahdadi, and M. A. Bairami, “Novel robot manipulator adaptive artificial control: design a novel siso adaptive fuzzy sliding algorithm inverse dynamic like method,” International Journal of Engineering, vol. 5, pp. 399–418, 2011.
View at: Google Scholar
L. Peng and P.-Y. Woo, “Neural-fuzzy control system for robotic manipulators,” IEEE Control Systems Magazine, vol. 22, pp. 53–63, 2002.
View at: Google Scholar
Y.-C. Chang and B.-S. Chen, “A nonlinear adaptive H/sup/spl infin//tracking control design in robotic systems via neural networks,” IEEE Transactions on Control Systems Technology, vol. 5, pp. 13–29, 1997.
View at: Google Scholar
D. Zhao, S. Li, and Q. Zhu, “Adaptive synchronised tracking control for multiple robotic manipulators with uncertain kinematics and dynamics,” International Journal of Systems Science, vol. 47, no. 4, pp. 791–804, 2016.
View at: Publisher Site | Google Scholar
D. Wang, D. Liu, and H. Li, “Policy iteration algorithm for online design of robust control for a class of continuous-time nonlinear systems,” IEEE Transactions on Automation Science and Engineering, vol. 11, no. 2, pp. 627–632, 2014.
View at: Publisher Site | Google Scholar
Y. Lv and X. Ren, “Approximate Nash solutions for multiplayer mixed-zero-sum game with reinforcement learning,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 49, pp. 2739–2750, 2018.
View at: Google Scholar
J. Zhao, J. Na, and G. Gao, “Adaptive dynamic programming based robust control of nonlinear systems with unmatched uncertainties,” Neurocomputing, vol. 395, pp. 56–65, 2020.
View at: Publisher Site | Google Scholar
J. Na, M. N. Mahyuddin, G. Herrmann, X. Ren, and P. Barber, “Robust adaptive finite-time parameter estimation and control for robotic systems,” International Journal of Robust and Nonlinear Control, vol. 25, no. 16, pp. 3045–3071, 2015.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2021 Xiaoyi Long et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

737

Downloads

1011

Citations