Neural Network-Based Intelligent Computing Algorithms for Discrete-Time Optimal Control with the Application to a Cyberphysical Power System

Jiang, Feng; Zhang, Kai; Hu, Jinjing; Wang, Shunjiang

doi:https://doi.org/10.1155/2021/5549678

Complexity

On this page

Abstract Introduction Conclusions Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Theory and Applications of Cyber-Physical Systems

View this Special Issue

Research Article | Open Access

Volume 2021 | Article ID 5549678 | https://doi.org/10.1155/2021/5549678

Neural Network-Based Intelligent Computing Algorithms for Discrete-Time Optimal Control with the Application to a Cyberphysical Power System

Feng Jiang,¹Kai Zhang,¹Jinjing Hu,¹and Shunjiang Wang¹

Academic Editor: Qiuye Sun

Received26 Feb 2021

Revised14 Mar 2021

Accepted07 Apr 2021

Published17 Apr 2021

Abstract

Adaptive dynamic programming (ADP), which belongs to the field of computational intelligence, is a powerful tool to address optimal control problems. To overcome the bottleneck of solving Hamilton–Jacobi–Bellman equations, several state-of-the-art ADP approaches are reviewed in this paper. First, two model-based offline iterative ADP methods including policy iteration (PI) and value iteration (VI) are given, and their respective advantages and shortcomings are discussed in detail. Second, the multistep heuristic dynamic programming (HDP) method is introduced, which avoids the requirement of initial admissible control and achieves fast convergence. This method successfully utilizes the advantages of PI and VI and overcomes their drawbacks at the same time. Finally, the discrete-time optimal control strategy is tested on a power system.

1. Introduction

Adaptive dynamic programming (ADP) [1–4], which integrates the advantages of reinforcement learning (RL) [5–8] and adaptive control, has become a powerful tool in solving optimal control problems. With decades of development, ADP has also provided many approaches to solve other control problems, such as robust control [9, 10], optimal control with input constraints [11, 12], optimal tracking control [13, 14], zero-sum games [15], and non-zero-sum games [16]. Furthermore, ADP methods have been widely applied to the real-world systems, such as water-gas shift reaction [17], battery management [18], microgrid systems [19, 20], and Quanser helicopter [21]. These aforementioned papers were all inspired and developed by the basic works of ADP-based optimal control; i.e., optimal control is the core research topic of ADP.

The bottleneck of solving the nonlinear optimal control problems is to obtain the solutions of Hamilton–Jacobi–Bellman (HJB) equations. However, these equations are generally difficult or even impossible to be solved analytically. To overcome this difficulty, ADP has given several important iterative learning frameworks, such as policy iteration (PI) [2, 22, 23] and value iteration (VI) [24–26]. PI algorithm starts from an initial admissible control policy and then proceeds the policy evaluation step and the policy improvement step successively till convergence. The main advantage of PI is that it ensures all the iterative control policies are admissible and achieves fast convergence. The drawback of PI is also obvious. The requirement of initial admissible control is a strict condition in practice, which seriously limits its applications. Different from PI, VI can start from an arbitrary-positive semidefinite value function, which is an easy-to-realize initial condition. Although the easier initial condition makes VI more practical, it also leads to a longer iteration learning process; that is, VI achieves convergence much slower than PI. Thus, it is desired to develop a new method, which avoids the requirement of initial admissible control and gets convergence faster than the VI algorithm. To realize these purposes, the multistep heuristic dynamic programming (HDP) approach [27] is presented to integrate the merits of PI and VI algorithms and overcome their drawbacks.

This paper reviews the state-of-the-art ADP algorithms for the optimal control of discrete-time (DT) systems. The rest of this paper is arranged as follows. In Section 2, the problem formulation is derived. Three iterative model-based offline learning algorithms along with comprehensive comparisons are presented in Sections 3 and 4. The proposed DT optimal control strategy is tested on a power system in Section 5. Finally, a brief conclusion is drawn in Section 6.

2. Problem Formulation

In this paper, we consider the general nonlinear DT system:where represents the system state, denotes the control input, and and are the system functions.

The purpose of the optimal control issue is to find out a state feedback control policy , which can not only stabilize system (1) but also minimize the following performance index function:where . The matrices and determine the performance of system states and control inputs, respectively. Given the admissible control policy , the value function can be described by

According to the definition of optimal control, the optimal value function can be defined by

By using the stationarity condition [28], the optimal control policy can be derived aswhere .

The key to obtaining the optimal control policy is to solve the following DT HJB equation [27]:

Remark 1. Figure 1 provides the relationship and difference between discrete-time and continuous-time optimal control. The real-world systems generally exist in the continuous-time forms. After mathematical modeling, they are formulated by the continuous-time system models. Through sampling and discretization, the continuous-time system models are converted into the discrete-time ones. Therefore, the associated performance indexes and HJB equations of discrete-time systems are in the discretization forms compared with the continuous-time systems. The key to solving the discrete-time optimal control issue is the discrete-time HJB equation, which is a nonlinear partial difference equation. The existing works regarding continuous-time systems are much more than the ones regarding discrete-time systems. In order to overcome this bottleneck, several ADP learning algorithms along with their neural network (NN) implementations will be introduced.

3. Model-Based PI Algorithm for the Optimal Control Problem of DT Systems

In this section, the model-based PI algorithm along with its NN implementation will be introduced in detail. The model-based PI algorithm [2, 23] is shown in Algorithm 1.

	Step 1: (Initialization)
	Let the iteration index .
	Select an initial admissible control policy .
	Choose a small enough computation precision .
	Step 2: (Policy Evaluation)
	With , compute the iterative value function by

	Step 3: (Policy Improvement)
	With , update the iterative control policy by

	Step 4: if , stop and the optimal control policy is acquired;
	Else, let and go back to Step 2.

The actor-critic dual-network structure with the gradient-descent updating law is employed to implement Algorithm 1. First, construct the critic NN to approximate the iterative value function:where and denote the NN weights and NN activation functions of the critic network and is the iteration index for the following gradient-descent method.

Define the error function for the critic NN:where . If we select a large enough integer , then, with the admissible control , one has [2]; that is, can be expressed as .

In order to minimize the error performance , the gradient-descent-based updating law for the critic NN is given bywhere is the learning rate of the critic NN.

Similar to the design of critic NN, the actor network, which is used to approximate the iterative control policy, is expressed as

The error function for the actor NN is defined aswhere can be attained according to Algorithm 1.

To minimize the error performance , using the chain derivation rule, the updating law for the actor NN is designed bywhere is the learning rate of the actor NN.

Remark 2. Figure 2 displays the NN implementation diagram of PI algorithm. First, NN weights of the actor network should be chosen to generate admissible control. Second, critic and actor networks are updated via the gradient-descent-based learning law to realize policy evaluation and improvement steps, respectively. After iteration, critic and actor networks achieve convergence, where the NN-based approximate optimal control can be obtained. Many stability proofs of the NN implementation procedure have been given in the existing works. Here, we introduce the following rigorous proof to demonstrate the optimality and convergence.

Theorem 1. Let the target iterative value function and control policy be described by and , respectively. Let the critic and actor NNs be updated via (9) and (12), respectively. If the learning rates and are selected to be appropriately small, then the NN weights and will asymptotically converge to the ideal values and , respectively.

Proof. Let and . According to (9) and (12), it can be acquired thatwhere and .
Construct the following Lyapunov function candidate:The difference of the Lyapunov function (14) can be derived asIf the learning rates are selected to satisfy and , then one has , which implies the NN weights and will asymptotically converge to the ideal values.
This completes the proof.

4. Model-Based VI Algorithm and Multistep HDP Algorithm

With the help of the initial admissible control, the PI algorithm achieves fast convergence. However, the weakness of the PI algorithm is obvious. The PI algorithm requires the initial control policy to be admissible, which is a strict condition. How to find out an initial admissible control policy is still an open problem, which limits the real-world applications of the PI algorithm. To relax the strict condition, the model-based VI algorithm [24–26] is shown in Algorithm 2, where the initial condition becomes much easier.

	Step 1: (Initialization)
	Let the iteration index .
	Select an initial value function .
	Choose a small enough computation precision .
	Step 2: (Policy Improvement)
	With , compute the iterative control policy by

	Step 3: (Policy Evaluation)
	With , calculate the iterative value function by

	Step 4: if , stop and the optimal control policy is acquired;
	Else, let and go back to Step 2.

Remark 3. Different from the PI algorithm, the VI algorithm does not require the initial admissible control, and one only needs to provide a specific initial value function, which makes the VI algorithm more practical in the real-world applications. However, without the help of the initial admissible control, the VI algorithm generally suffers from the low convergence speed. From the aforementioned content, it can be observed that the PI and VI algorithms have their own advantages and disadvantages. The PI algorithm can achieve fast convergence, while it requires an initial admissible control policy. The VI algorithm can start from an easy-to-realize initial condition, while it generally suffers from the low convergence speed. Thus, it is expected to design a new approach, which can make the trade-off between the PI algorithm and the VI algorithm.
That is, it is desired to develop an algorithm, which achieves convergence faster than the VI algorithm and does not require an initial admissible control policy. To realize this goal, the multistep HDP method [27] will be introduced in Algorithm 3.
Construct the critic and actor NNs to approximate the iterative value function and control policy as follows:where and are the NN weights and and are the associated NN activation functions.
According to Algorithm 3, using the NNs to estimate the solutions will yield the following error:Let and . Equation (17) becomesTo minimize , we employ the least-square method to update . Collect different data sets for training, where is a large enough number. Then, one has and . The least-square-based updating law for is given byTo minimize , the gradient-descent-based updating law for the actor NN is given by

	Let the iteration index .
	Select an initial value function .
	Choose a small enough computation precision .
	Step 2: (Policy Improvement)
	With , compute the iterative control policy by

	Step 3: (Multistep Policy Evaluation)
	With , calculate the iterative value function by

	Step 4: if , stop and the optimal control policy is acquired;
	Else, let and go back to Step 2.

Remark 4. From Table 1 and Figure 3, we can see the performance comparison and relationship among PI algorithm, VI algorithm, and multistep HDP. Due to the existence of initial admissible control, the PI algorithm gets fast convergence. However, the condition of initial admissible control is difficult to realize. Different from the PI algorithm, the initial condition of VI algorithm is easy-to-realize. However, the initial condition may not be admissible, which may lead to the low stability. Multistep HDP follows the initial condition of VI algorithm and develops the multistep policy evaluation step to obtain more history data. Therefore, multistep HDP is easy-to-realize and achieves fast convergence at the same time; that is, multistep HDP successfully combines the advantages of PI and VI algorithms.

5. Application to a Benchmark Power System

The benchmark power system investigated in this paper is illustrated in Figure 4. This power system can be regarded as a microgrid, which is composed of nonpolluting energy (subsystems I and II), load demand sides (subsystem III), and regular generations (subsystem IV). The core control unit is the management center, which maintains the frequency stability against load variations.

5.1. System Model and Application

In Figure 5, first, the real-world power system can be formulated by a state space function via mathematical modeling. After sampling and discretization, the system model can be controlled by computers. Through iterative ADP learning, the approximate optimal control can be obtained. Substituting the approximate optimal control into the system model will yield simulation results. To test the effectiveness of the proposed DT optimal control strategy, let us consider the following power system [19, 20]:where is the frequency deviation; denotes the turbine power; represents the governor position value; , , and denote the time constants of turbine, governor, and power system, respectively; represents the gain of power system; is the speed regulation coefficient; denotes the control input; and is the state variable. Let , where , and . Then, the system (21) can be discretized as the form of (1). Set the matrices in the performance index function: and .

5.2. Simulation Results

Simulation results are shown in Figure 6. Figure 6(a) implies the system states cannot be stabilized without control. Then, we apply the optimal control strategy into the system. Figure 6(b) indicates the system states can be stabilized after 8 time steps under optimal control. Comparing the trajectories of the system states, the superior control performance of optimal control strategy can be observed. Figure 6(c) shows the 2D plot of convergence trajectory in detail. Figure 6(d) provides the evolution of the control input. The aforementioned simulation results demonstrate the high stability, fast convergence, and low control cost of the DT optimal control strategy.

(a)

(b)

(c)

(d)

6. Conclusions

In this paper, several state-of-the-art ADP-based methods have been reviewed to address the optimal control problem of DT systems. A comprehensive comparison has been made between PI and VI. A novel multistep HDP method has been introduced to integrate the advantages of PI and VI algorithms with either strict requirement of initial admissible control or longer interaction learning process. The simulation results have demonstrated the effectiveness of our proposed schemes.

Data Availability

Data are available upon request to the corresponding author.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Science and Technology Foundation of SGCC (Grant no. SGLNDK00DWJS1900036).

References

Y. Yang, C. Xu, D. Yue, X. Zhong, X. Si, and J. Tan, “Event-triggered ADP control of a class of non-affine continuous-time nonlinear systems using output information,” Neurocomputing, vol. 378, pp. 304–314, 2020.
View at: Publisher Site | Google Scholar
D. Liu, Q. Wei, and P. Yan, “Generalized policy iteration adaptive dynamic programming for discrete-time nonlinear systems,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 45, no. 12, pp. 1577–1591, 2015.
View at: Publisher Site | Google Scholar
R. Wang, Q. Sun, W. Hu, Y. Li, D. Ma, and P. Wang, “SoC-based droop coefficients stability region analysis of the battery for stand-alone supply systems with constant power loads,” IEEE Transactions on Power Electronics, vol. 36, no. 7, pp. 7866–7879, 2021.
View at: Publisher Site | Google Scholar
B. Luo, D. Liu, H.-N. Wu, D. Wang, and F. L. Lewis, “Policy gradient adaptive dynamic programming for data-based optimal control,” IEEE Transactions on Cybernetics, vol. 47, no. 10, pp. 3341–3354, 2017.
View at: Publisher Site | Google Scholar
R. Wang, Q. Sun, D. Ma, and Z. Liu, “The small-signal stability analysis of the droop-controlled converter in electromagnetic timescale,” IEEE Transactions on Sustainable Energy, vol. 10, no. 3, pp. 1459–1469, 2019.
View at: Publisher Site | Google Scholar
D. Liu, X. Yang, D. Wang, and Q. Wei, “Reinforcement-learning-based robust controller design for continuous-time uncertain nonlinear systems subject to input constraints,” IEEE Transactions on Cybernetics, vol. 45, no. 7, pp. 1372–1385, 2015.
View at: Publisher Site | Google Scholar
R. Wang, Q. Sun, D. Ma, and X. Hu, “Line impedance cooperative stability region identification method for grid-tied inverters under weak grids,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 2856–2866, 2020.
View at: Google Scholar
D. Zhao and Y. Zhu, “MEC-a near-optimal online reinforcement learning algorithm for continuous deterministic systems,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 2, pp. 346–356, 2015.
View at: Publisher Site | Google Scholar
Q. Zhang, D. Zhao, and D. Wang, “Event-Based robust control for uncertain nonlinear systems using adaptive dynamic programming,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 1, pp. 37–50, 2018.
View at: Publisher Site | Google Scholar
Y. Wang, J. Sun, H. He, and C. Sun, “Deterministic policy gradient with integral compensator for robust quadrotor control,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 50, no. 10, pp. 3713–3725, 2020.
View at: Publisher Site | Google Scholar
X. Yang and H. He, “Event-triggered robust stabilization of nonlinear input-constrained systems using single network adaptive critic designs,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 50, no. 9, pp. 3145–3157, 2020.
View at: Publisher Site | Google Scholar
B. Luo, D. Liu, and H.-N. Wu, “Adaptive constrained optimal control design for data-based nonlinear discrete-time systems with critic-only structure,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 6, pp. 2099–2111, 2018.
View at: Publisher Site | Google Scholar
W. Zhou, H. Liu, H. He, J. Yi, and T. Li, “Neuro-Optimal tracking control for continuous stirred tank reactor with input constraints,” IEEE Transactions on Industrial Informatics, vol. 15, no. 8, pp. 4516–4524, 2019.
View at: Publisher Site | Google Scholar
H. Zhang, X. Cui, Y. Luo, and H. Jiang, “Finite-horizon $H_{\infty }$ tracking control for unknown nonlinear systems with saturating actuators,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 4, pp. 1200–1212, 2018.
View at: Publisher Site | Google Scholar
Q. Wei, D. Liu, Q. Lin, and R. Song, “Adaptive dynamic programming for discrete-time zero-sum games,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 4, pp. 957–969, 2018.
View at: Publisher Site | Google Scholar
R. Song, F. L. Lewis, and Q. Wei, “Off-policy integral reinforcement learning method to solve nonlinear continuous-time multiplayer nonzero-sum games,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 3, pp. 704–713, 2017.
View at: Publisher Site | Google Scholar
Q. Wei and D. Liu, “Data-driven neuro-optimal temperature control of water-gas shift reaction using stable iterative adaptive dynamic programming,” IEEE Transactions on Industrial Electronics, vol. 61, no. 11, pp. 6399–6408, 2014.
View at: Publisher Site | Google Scholar
Q. Wei, D. Liu, F. L. Lewis, Y. Liu, and J. Zhang, “Mixed iterative adaptive dynamic programming for optimal battery energy control in smart residential microgrids,” IEEE Transactions on Industrial Electronics, vol. 64, no. 5, pp. 4110–4120, 2017.
View at: Publisher Site | Google Scholar
D. Wang, H. He, C. Mu, and D. Liu, “Intelligent critic control with disturbance attenuation for affine dynamics including an application to a microgrid system,” IEEE Transactions on Industrial Electronics, vol. 64, no. 6, pp. 4935–4944, 2017.
View at: Publisher Site | Google Scholar
X. Yang, H. He, and X. Zhong, “Adaptive dynamic programming for robust regulation and its application to power systems,” IEEE Transactions on Industrial Electronics, vol. 65, no. 7, pp. 5722–5732, 2018.
View at: Publisher Site | Google Scholar
B. Luo, H.-N. Wu, and T. Huang, “Optimal output regulation for model-free quanser helicopter with multistep q-learning,” IEEE Transactions on Industrial Electronics, vol. 65, no. 6, pp. 4953–4961, 2018.
View at: Publisher Site | Google Scholar
D. Wang and X. Zhong, “Advanced policy learning near-optimal regulation,” IEEE/CAA Journal of Automatica Sinica, vol. 6, no. 3, pp. 743–749, 2019.
View at: Publisher Site | Google Scholar
Q. Wei, D. Liu, Q. Lin, and R. Song, “Discrete-time optimal control via local policy iteration adaptive dynamic programming,” IEEE Transactions on Cybernetics, vol. 47, no. 10, pp. 3367–3379, 017.
View at: Google Scholar
Q. Wei, D. Liu, and H. Lin, “Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems,” IEEE Transactions on Cybernetics, vol. 46, no. 3, pp. 840–853, 2016.
View at: Publisher Site | Google Scholar
Q. Wei, F. L. Lewis, D. Liu, R. Song, and H. Lin, “Discrete-time local value iteration adaptive dynamic programming: convergence analysis,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 48, no. 6, pp. 875–891, 2018.
View at: Publisher Site | Google Scholar
Q. Wei, D. Liu, and Q. Lin, “Discrete-time local value iteration adaptive dynamic programming: admissibility and termination analysis,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 11, pp. 2490–2502, 2017.
View at: Publisher Site | Google Scholar
B. Luo, D. Liu, T. Huang, X. Yang, and H. Ma, “Multi-step heuristic dynamic programming for optimal control of nonlinear discrete-time systems,” Information Sciences, vol. 411, pp. 66–83, 2017.
View at: Publisher Site | Google Scholar
D. Liu, D. Wang, D. Zhao, Q. Wei, and N. Jin, “Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming,” IEEE Transactions on Automation Science and Engineering, vol. 9, no. 3, pp. 628–634, 2012.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2021 Feng Jiang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

228

Downloads

788

Citations