Abstract
Adaptive dynamic programming (ADP), which belongs to the field of computational intelligence, is a powerful tool to address optimal control problems. To overcome the bottleneck of solving Hamilton–Jacobi–Bellman equations, several stateoftheart ADP approaches are reviewed in this paper. First, two modelbased offline iterative ADP methods including policy iteration (PI) and value iteration (VI) are given, and their respective advantages and shortcomings are discussed in detail. Second, the multistep heuristic dynamic programming (HDP) method is introduced, which avoids the requirement of initial admissible control and achieves fast convergence. This method successfully utilizes the advantages of PI and VI and overcomes their drawbacks at the same time. Finally, the discretetime optimal control strategy is tested on a power system.
1. Introduction
Adaptive dynamic programming (ADP) [1–4], which integrates the advantages of reinforcement learning (RL) [5–8] and adaptive control, has become a powerful tool in solving optimal control problems. With decades of development, ADP has also provided many approaches to solve other control problems, such as robust control [9, 10], optimal control with input constraints [11, 12], optimal tracking control [13, 14], zerosum games [15], and nonzerosum games [16]. Furthermore, ADP methods have been widely applied to the realworld systems, such as watergas shift reaction [17], battery management [18], microgrid systems [19, 20], and Quanser helicopter [21]. These aforementioned papers were all inspired and developed by the basic works of ADPbased optimal control; i.e., optimal control is the core research topic of ADP.
The bottleneck of solving the nonlinear optimal control problems is to obtain the solutions of Hamilton–Jacobi–Bellman (HJB) equations. However, these equations are generally difficult or even impossible to be solved analytically. To overcome this difficulty, ADP has given several important iterative learning frameworks, such as policy iteration (PI) [2, 22, 23] and value iteration (VI) [24–26]. PI algorithm starts from an initial admissible control policy and then proceeds the policy evaluation step and the policy improvement step successively till convergence. The main advantage of PI is that it ensures all the iterative control policies are admissible and achieves fast convergence. The drawback of PI is also obvious. The requirement of initial admissible control is a strict condition in practice, which seriously limits its applications. Different from PI, VI can start from an arbitrarypositive semidefinite value function, which is an easytorealize initial condition. Although the easier initial condition makes VI more practical, it also leads to a longer iteration learning process; that is, VI achieves convergence much slower than PI. Thus, it is desired to develop a new method, which avoids the requirement of initial admissible control and gets convergence faster than the VI algorithm. To realize these purposes, the multistep heuristic dynamic programming (HDP) approach [27] is presented to integrate the merits of PI and VI algorithms and overcome their drawbacks.
This paper reviews the stateoftheart ADP algorithms for the optimal control of discretetime (DT) systems. The rest of this paper is arranged as follows. In Section 2, the problem formulation is derived. Three iterative modelbased offline learning algorithms along with comprehensive comparisons are presented in Sections 3 and 4. The proposed DT optimal control strategy is tested on a power system in Section 5. Finally, a brief conclusion is drawn in Section 6.
2. Problem Formulation
In this paper, we consider the general nonlinear DT system:where represents the system state, denotes the control input, and and are the system functions.
The purpose of the optimal control issue is to find out a state feedback control policy , which can not only stabilize system (1) but also minimize the following performance index function:where . The matrices and determine the performance of system states and control inputs, respectively. Given the admissible control policy , the value function can be described by
According to the definition of optimal control, the optimal value function can be defined by
By using the stationarity condition [28], the optimal control policy can be derived aswhere .
The key to obtaining the optimal control policy is to solve the following DT HJB equation [27]:
Remark 1. Figure 1 provides the relationship and difference between discretetime and continuoustime optimal control. The realworld systems generally exist in the continuoustime forms. After mathematical modeling, they are formulated by the continuoustime system models. Through sampling and discretization, the continuoustime system models are converted into the discretetime ones. Therefore, the associated performance indexes and HJB equations of discretetime systems are in the discretization forms compared with the continuoustime systems. The key to solving the discretetime optimal control issue is the discretetime HJB equation, which is a nonlinear partial difference equation. The existing works regarding continuoustime systems are much more than the ones regarding discretetime systems. In order to overcome this bottleneck, several ADP learning algorithms along with their neural network (NN) implementations will be introduced.
3. ModelBased PI Algorithm for the Optimal Control Problem of DT Systems
In this section, the modelbased PI algorithm along with its NN implementation will be introduced in detail. The modelbased PI algorithm [2, 23] is shown in Algorithm 1.

The actorcritic dualnetwork structure with the gradientdescent updating law is employed to implement Algorithm 1. First, construct the critic NN to approximate the iterative value function:where and denote the NN weights and NN activation functions of the critic network and is the iteration index for the following gradientdescent method.
Define the error function for the critic NN:where . If we select a large enough integer , then, with the admissible control , one has [2]; that is, can be expressed as .
In order to minimize the error performance , the gradientdescentbased updating law for the critic NN is given bywhere is the learning rate of the critic NN.
Similar to the design of critic NN, the actor network, which is used to approximate the iterative control policy, is expressed as
The error function for the actor NN is defined aswhere can be attained according to Algorithm 1.
To minimize the error performance , using the chain derivation rule, the updating law for the actor NN is designed bywhere is the learning rate of the actor NN.
Remark 2. Figure 2 displays the NN implementation diagram of PI algorithm. First, NN weights of the actor network should be chosen to generate admissible control. Second, critic and actor networks are updated via the gradientdescentbased learning law to realize policy evaluation and improvement steps, respectively. After iteration, critic and actor networks achieve convergence, where the NNbased approximate optimal control can be obtained. Many stability proofs of the NN implementation procedure have been given in the existing works. Here, we introduce the following rigorous proof to demonstrate the optimality and convergence.
Theorem 1. Let the target iterative value function and control policy be described by and , respectively. Let the critic and actor NNs be updated via (9) and (12), respectively. If the learning rates and are selected to be appropriately small, then the NN weights and will asymptotically converge to the ideal values and , respectively.
Proof. Let and . According to (9) and (12), it can be acquired thatwhere and .
Construct the following Lyapunov function candidate:The difference of the Lyapunov function (14) can be derived asIf the learning rates are selected to satisfy and , then one has , which implies the NN weights and will asymptotically converge to the ideal values.
This completes the proof.
4. ModelBased VI Algorithm and Multistep HDP Algorithm
With the help of the initial admissible control, the PI algorithm achieves fast convergence. However, the weakness of the PI algorithm is obvious. The PI algorithm requires the initial control policy to be admissible, which is a strict condition. How to find out an initial admissible control policy is still an open problem, which limits the realworld applications of the PI algorithm. To relax the strict condition, the modelbased VI algorithm [24–26] is shown in Algorithm 2, where the initial condition becomes much easier.

Remark 3. Different from the PI algorithm, the VI algorithm does not require the initial admissible control, and one only needs to provide a specific initial value function, which makes the VI algorithm more practical in the realworld applications. However, without the help of the initial admissible control, the VI algorithm generally suffers from the low convergence speed. From the aforementioned content, it can be observed that the PI and VI algorithms have their own advantages and disadvantages. The PI algorithm can achieve fast convergence, while it requires an initial admissible control policy. The VI algorithm can start from an easytorealize initial condition, while it generally suffers from the low convergence speed. Thus, it is expected to design a new approach, which can make the tradeoff between the PI algorithm and the VI algorithm.
That is, it is desired to develop an algorithm, which achieves convergence faster than the VI algorithm and does not require an initial admissible control policy. To realize this goal, the multistep HDP method [27] will be introduced in Algorithm 3.
Construct the critic and actor NNs to approximate the iterative value function and control policy as follows:where and are the NN weights and and are the associated NN activation functions.
According to Algorithm 3, using the NNs to estimate the solutions will yield the following error:Let and . Equation (17) becomesTo minimize , we employ the leastsquare method to update . Collect different data sets for training, where is a large enough number. Then, one has and . The leastsquarebased updating law for is given byTo minimize , the gradientdescentbased updating law for the actor NN is given by

Remark 4. From Table 1 and Figure 3, we can see the performance comparison and relationship among PI algorithm, VI algorithm, and multistep HDP. Due to the existence of initial admissible control, the PI algorithm gets fast convergence. However, the condition of initial admissible control is difficult to realize. Different from the PI algorithm, the initial condition of VI algorithm is easytorealize. However, the initial condition may not be admissible, which may lead to the low stability. Multistep HDP follows the initial condition of VI algorithm and develops the multistep policy evaluation step to obtain more history data. Therefore, multistep HDP is easytorealize and achieves fast convergence at the same time; that is, multistep HDP successfully combines the advantages of PI and VI algorithms.
5. Application to a Benchmark Power System
The benchmark power system investigated in this paper is illustrated in Figure 4. This power system can be regarded as a microgrid, which is composed of nonpolluting energy (subsystems I and II), load demand sides (subsystem III), and regular generations (subsystem IV). The core control unit is the management center, which maintains the frequency stability against load variations.
5.1. System Model and Application
In Figure 5, first, the realworld power system can be formulated by a state space function via mathematical modeling. After sampling and discretization, the system model can be controlled by computers. Through iterative ADP learning, the approximate optimal control can be obtained. Substituting the approximate optimal control into the system model will yield simulation results. To test the effectiveness of the proposed DT optimal control strategy, let us consider the following power system [19, 20]:where is the frequency deviation; denotes the turbine power; represents the governor position value; , , and denote the time constants of turbine, governor, and power system, respectively; represents the gain of power system; is the speed regulation coefficient; denotes the control input; and is the state variable. Let , where , and . Then, the system (21) can be discretized as the form of (1). Set the matrices in the performance index function: and .
5.2. Simulation Results
Simulation results are shown in Figure 6. Figure 6(a) implies the system states cannot be stabilized without control. Then, we apply the optimal control strategy into the system. Figure 6(b) indicates the system states can be stabilized after 8 time steps under optimal control. Comparing the trajectories of the system states, the superior control performance of optimal control strategy can be observed. Figure 6(c) shows the 2D plot of convergence trajectory in detail. Figure 6(d) provides the evolution of the control input. The aforementioned simulation results demonstrate the high stability, fast convergence, and low control cost of the DT optimal control strategy.
(a)
(b)
(c)
(d)
6. Conclusions
In this paper, several stateoftheart ADPbased methods have been reviewed to address the optimal control problem of DT systems. A comprehensive comparison has been made between PI and VI. A novel multistep HDP method has been introduced to integrate the advantages of PI and VI algorithms with either strict requirement of initial admissible control or longer interaction learning process. The simulation results have demonstrated the effectiveness of our proposed schemes.
Data Availability
Data are available upon request to the corresponding author.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the Science and Technology Foundation of SGCC (Grant no. SGLNDK00DWJS1900036).