#### Abstract

Recently, increasing attention has been paid to nuclear power control with the appeals of clean energy and demands of power regulation to integrate into the power grid. However, a nuclear power system is a discrete-time (DT) nonlinear and complicated system, where the parameters entangle with intrinsic states. Furthermore, the need for huge computational ability due to the high-level order property in the nuclear reactor model causes many difficulties in the power control of nuclear industries. In this study, a new scheme of optimal tracking control for DT nonlinear nuclear power systems is provided to accomplish the power control of a 2500-MW pressurized water reactor (PWR) nuclear power plant. The proposed approach based on the value iteration method is a novel algorithm in the human intelligence community, which has a basic actor-critic structure with neural networks (NNs). The new approach has some modifications, where the cost function is redefined by leveraging the higher-order polynomial to substitute neural networks in the entire actor critic architecture. Simulation results of the 2500-MW PWR nuclear power plant are given to demonstrate the effectiveness of the developed method.

#### 1. Introduction

Considering the issues of environmental deterioration, e.g., air pollution due to excessive fossil fuel consumption, it is significant that humans develop clean energy technology to ease this situation. Nuclear energy is almost the most rapidly developing clean power to provide power to the power grid. However, currently adopted control strategies have problems such as the intrinsic nonlinearity of nuclear reactor systems and varying parameters following the power level. In fact, over decades of development for nuclear power industry technology and control policy, many outstanding researchers have made excellent progress in this field.

Since the last century, one mature control strategy called PID control policy has deeply affected the power control in the nuclear industry [1, 2]. With the advancement in control technology, some model predictive control (MPC) and multimodel adaptive control theories focused on local linearization to approximate the nonlinearity of nuclear power systems and have also been applied in this area [3–5]. The control algorithm extensively applied in nuclear power control is fuzzy control or its combinations with other control theories to address different demands. Wu leveraged the parallel distribution compensation (PDC) method-based T-S fuzzy control to restrict the nonlinearity of a nuclear power system [6], and Eliasi designed an appropriate controller for the UTSG water level in nuclear power plants using a fuzzy control policy and MPC algorithm [7]. Many researchers have used different methods to tackle various problems. For example, Gang employed a radial basis function neural network (RBFN) to guarantee the correctness in identifying the nuclear steam generator process dynamics [8]. Wang applied the adaptive control method and guaranteed a cost control method in nuclear power control problems [9].

The aforementioned approaches mostly relate to the linearization procedure, which largely omits the numerical error toward a nonlinear model. The intrinsic nonlinearity of the high-order nuclear model is ignored. To better satisfy the demands of tracking control problems, it is necessary to propose a new control strategy in this area. With the development in the intelligent control community, researchers are pursuing reinforcement learning (RL) algorithms to solve nonlinear problems in practice. Adaptive dynamic programming (ADP), which was proposed by Werbos, plays an important role in reinforcement learning-based control policy [10–13], and it is well known as a self-learning optimal control policy. The well-studied iteration method is the policy iteration (PI) algorithm and value iteration (VI) algorithm. Among the iteration methods, the value iteration algorithm is one type of the most crucial iterative ADP algorithms. It has been studied in many types of research [14–16]. To find the optimal control policy of discrete-time affine nonlinear systems, Al-Tamimi and Lewis used heuristic dynamic programming (HDP) to fulfill the design purpose [17]. Wei. *Q*. [18] proposed a new value iteration method, which mainly focuses on optimal control for DT nonlinear systems. This study also provided detailed proof of the iterative control policy and illustrated that the value function was monotonically nondecreasing, which implies that it will converge to the extremum.

To satisfy the demands of industrial systems, optimal tracking control ADP methods have been deeply investigated. There are also ADP techniques [19–21] to obtain solutions of optimal tracking problems with various system dynamics, such as partially unknown system models or completely unknown system models. Related optimal tracking control techniques have been applied in many industrial plants in recent years [22–26].

In this study, a value-iteration optimal tracking control method is developed for DT nonlinear systems. The main contributions of this study are summarized as follows:(1)Compared with the traditional control methods dealing with DT nonlinear models of the 2500 MW pressurized water reactor (PWR) systems [1, 2, 4, 5], a self-learning optimal tracking controller is designed to satisfy complex nonlinear behaviors of the 2500 MW PWR nuclear system(2)The developed value-iteration method guarantees the control law converges to a near-optimal control solution and the admissibility of iterative control laws is analyzed

In this study, our major work is to design an optimal tracking controller for a 2500 MW PWR nuclear power plant by combining the properties of the value iteration and actor critic algorithm. The 2500-MW PWR nuclear power plant is introduced in Section 2, and the discrete definition is given. In Section 3, the details of this proposed algorithm are thoroughly described. The implementation of the proposed method and simulation works are provided in Section 4. Finally, the conclusions are drawn in Section 5.

#### 2. Nonlinear 2500-MW PWR Nuclear Power Plant

The famous nuclear system model is based on Mann’s model without xenon poisoning, which consists of a core full lump and two coolant lumps. The discrete version and its transformation are also given in this section.

##### 2.1. Nonlinear 2500-MW PWR Nuclear Power Plant

This fifth-order nonlinear PWR model includes the point kinetics equations, six delayed neutron groups, two equations for the lumped coolant outlet temperature and average fuel temperature, and the reactive equation of the control rod [27, 28]. Multiple sets of delayed neutron point reactor dynamic equations are described as follows:

To reduce the computational work caused by six delayed neutron point reactor dynamic equations, the simple method is to use single delayed neutron point kinetics equations to approximate multiple sets of delayed neutron point reactor dynamic equations [29]. Thus, the entire PWR model is summarized as follows:where is the neutron density relative to the initial equilibrium density, %, is the delay neutron density relative to its initial equilibrium density, %, is the average fuel temperature, °C, is the coolant temperature at the core outlet, °C, is reactivity contributed by the control rod movement, and is the speed of the control rod. The remaining specifications are illustrated in Nomenclature [30]. always approximates in the PWR model. The state can be described as a percentage factor of the full power level, since the reactor power is expressed as .

In addition, five parameters vary with , which causes severe instability of the nuclear reactor power model and increases the control complexity. The remaining parameters and specific relation are shown in Tables 1 and 2. With the lifting (lowering) load of the PWR model, the varying parameters will lead to a sharp difference in the solutions of the dynamic model. Thus, the model will be uncontrollable, and the solutions of this dynamic model may become divergent. Linearizing nuclear systems to realize various control goals has been a common method in traditional control policies in the past few years. However, there should be a new approach in nonlinear systems to solve these problems.

##### 2.2. System Discretization and Transformation

The optimal tracking control problem can be considered minimizing the real dynamic trajectory with the desired trajectory. Depending on the model, we let ; thus, the control-oriented nuclear power system can be defined aswhere ; thus, and can be derived as

As the above descriptions show, there is only one controllable signal , i.e., the speed of the control rod. We let control variable .

According to the definition, , and we have the discretization version of the PWR power model:

Basically, we define the optimal tracking control problem as obtaining an optimal control strategy so that the system can track the reference state, and the desired trajectory can be expressed as

The tracking error of the state is defined as . Thus, the dynamic error system can be described as follows:

Additionally, we must confirm an initial steady-state control policy , and the error of the controller can be defined as , so the dynamic tracking error system can also be written aswhere and .

We define the utility function as follows:

Thus, the tracking error cost function is written as

From the principle of optimality, the DT Hamiltonian function is derived as

Thus, the HJB equation can be written as

Then, the optimal tracking control law for the error system is derived:

Finally, we obtain the standard optimal control law as

For linear systems, the HJB equation is reduced to the Riccati equation. However, due to the nonlinearity of the nuclear power system, it is extremely intractable to solve the HJB equation (13) for the nonlinear system. Thus, the value iteration (VI) method based on the actor critic NN algorithm will be adopted to find an approximate optimal solution of the HJB equation (13), which implies that nothing is required about the knowledge of the model drifts or the command generator.

#### 3. Algorithm Analysis

The detailed convergence properties of this proposed value iteration algorithm are illustrated in this section, and the details of actor critic NN will also be discussed.

##### 3.1. Analysis of the Value Iteration Algorithm for Tracking Control of Nonlinear Systems

Considering the nonaffine nonlinear system, for an infinite time optimal tracking problem, the goal is to obtain an optimal controller such that the state tracks the specified reference trajectory .

*Remark 1. *For many nonlinear systems, there is a feedback control that makes (9) work. For example, with regard to DT nonlinear systems (7) with invertible , the desired control can be derived asFrom equation (11) of Section 2.2, the quadratic cost function of tracking errors is defined aswhere and and are positive definite functions.

To obtain an optimum tracking control law that tracks the reference state and minimizes the tracking error cost function (18), we can redefine the optimal tracking error cost function as follows:Based on Bellman’s principle of optimality, satisfies Bellman equation:Then, the optimal tracking control law is obtained byGiven the above formulation, we can derive the tracking error performance index function as follows:Let be a positive definite function for , and the initial tracking value function isThe optimal control law can be obtained bywhere . For , in this iterative value function algorithm, the value function is updated throughand the control policy is improved by

Theorem 1. *For the tracking error cost function and control law obtained by (22)–(25), we have , , , and satisfying and , respectively. If , we haveare satisfied uniformly; thus, the iterative value function satisfies*

Theorem 2. *For cost function and control law obtained by (22)–(25), we have , , , and satisfying . If , indexes (26) and (27) hold uniformly; then, the value function satisfies equation (28)**The proof is thoroughly provided in [17].*

Corollary 1. *For , and are obtained by (22)–(25). Let , , , and be constants that satisfy and , respectively. If , inequalities (26) and (27) hold uniformly. Then, the iterative value function converges to the optimal cost function , i.e.,**Based on the aforementioned analysis, we can conclude that the iterative tracking error performance index function will converge to the optimum as , which is independent of the initial value function .**According to Lyapunov stability principle, is a Lyapunov function. Since the utility function is a positive definite function and , it should be noticed that is a positive definite function as well. We letwhere the error tracking control law is admissible.*

##### 3.2. Actor Critic NN Implementation of the Value Iteration Algorithm for DT Nonlinear Systems

The actor critic NN has been employed in various fields to approximate the cost function and optimal controller. For example, the optimal tracking applied on partially unknown DT nonlinear systems [31] and rigorous proof for this method are provided. The actor critic structure also has good performance in the tracking control problem for continuous-time nonlinear systems [32, 33].

With regard to optimal tracking control algorithms of DT nonlinear systems, it is quite difficult to directly obtain solutions by solving (13). Thus, the actor critic network structure with the flowchart of the nuclear power system is given in Figure 1, which describes the inner procedures of the method.

While the tracking error system is fed with a specific initial state and desired trajectory, the error will be calculated by a utility function. Simultaneously, the critic network will be trained in a way that minimizes the utility function. Under the training process, the actor network will behave like an optimal controller. To avoid overfitting, a specified threshold is given at the beginning of the training procedure so that it can be stopped in time.

Inspired by Abu-Khalaf [34], an optimal control algorithm with a high-order polynomial was proposed to substitute the neural unit. This technique is introduced in this actor critic NN structure to obtain a better approximate effect.

To solve (24) and (25), we let the tracking error cost function be approximated by a critic NN:where we have the approximate activation function , the weight vector , and is the number of neural units in the hidden structure of the critic NN.

Then, the iterative formulation can be obtained as follows:

For each sample related to , we formulate the definition as follows:where and . Then, the weights of critic network are obtained. We let be approximated by an actor NN:where we have the approximate activation function and weight vector . is the number of neural units in actor NN.

According to (24) and (30), we will tune the weights of the critic NN at each iteration of this VI algorithm. Our goal is to minimize the residual error between each to obtain a new target function as follows:

Similarly, the actor NN is applied to evaluate and approximate the optimal tracking control policy. We will tune the weights of action NN to solve (20) at each iteration of this VI algorithm. According to , from (33), we can rewrite (14) aswhere and are updated by the same method as the weights of the critic NN. For getting the approximation of the weights of actor critic NNs, the least square (LS) method is utilized to solve the weights of NNs.

#### 4. Numerical Simulations

In this section, numerical results are given to demonstrate the validity of this value iteration optimal tracking method. Experimental simulations of the performance index and weights of actor critic NNs are provided. This developed method is an offline policy with an initial random control policy.

##### 4.1. Actor Critic NN’s Implementation of the Value Iteration Algorithm

It is generally known that NNs can be leveraged to approximate any functions on prescribed compact sets. We choose an error compact set to train this actor critic NN to obtain an offline tracking policy. As predefined in Section 3, the critic NN is approximated as with 15 neurons , and the weights are . The actor NN is chosen with 5 neurons, and the weights are .

At the beginning of this algorithm execution, generally, the matrices are and , where is the identity matrix. The tracking error compact sets are randomly set as the difference between the initial state and the desired trajectory. We chose the control period as periteration, and the total iteration epoch was 50. Simultaneously, we chose the initial working point with as state inputs and the power working point with as the desired tracking trajectory. Under the proposed value iteration algorithm, the tracking error performance index value, as shown in Figure 2, is monotonically nondecreasing and converges to , which is identical to our analysis results above.

Additionally, when the tracking error performance index function converges, the training process of the actor-critic NN will stop. As shown in Figure 3 and 4, the weights of the actor-critic NN converge to a steady solution, which implies that a perfect approximator of the optimal tracking controller is obtained.

##### 4.2. Application on the PWR Nuclear Power System

In this section, we apply the calculated control law on this DT nonlinear PWR power model, and the implementation results are shown in Figure 5–10.

Generally, the optimal tracking control of PWR power plants focuses on power-level adjustment, but there is high interference among different states in this nuclear power model. Thus, it is necessary to track all states in the PWR power model, and the tracking objects should guarantee the stability of each state and safety of nuclear plants.

As shown in Figures 5–9, we give a step increase signal to this PWR power system. These 5 figures demonstrate that the 5 states catch the desired trajectory in less than . We use a 5th-order DT nonlinear PWR power system in this study, which implies that Xenon poison is without consideration. Although the PWR power model is quite simplified, all states of this model are difficult to track.

As shown in Figure 5, the power level tracks the desired states without overshoot and oscillation. The average temperature of the reactor core and coolant outlet approximate the desired temperature; compared with the reference temperature, the maximum deviation is 0.031°C and 0.002°C, respectively. In addition, the reactor coefficient of the control rod tracks the desired curve in less than . Based the aforementioned results, we also can see that our tracking method applied this nuclear system has no steady-state error and less regulation time. As shown in Figure 10, the tracking errors progressively converge to 0 with the proposal of this value iteration optimal tracking method.

#### 5. Conclusion

It is well known that optimal tracking power-level control for DT nonlinear nuclear power systems is crucial for both regular operation and safety problems, and manual control is inefficient. However, the intrinsic nonlinearity and parameters that vary with the states cause difficulties in power-level control, and there are tracking issues.

In this study, a value iteration-based actor critic NN algorithm is designed to obtain an optimal tracking control policy for the DT nonlinear nuclear power plant. The proposed algorithm performs well in tracking states, as shown in the simulation results, and can also swiftly calculate the optimal control law. Thus, we formulate the tracking control problem as HJB equations to solve it.

#### Data Availability

The data used to support the findings of this study are included within the article.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

This work was supported in part by the Open Project Program of State Key Laboratory of Nuclear Power Safety Monitoring Technology and Equipment (no. K-A2020.406), National Key R&D Program of China (no. 2021YFE0206100), National Natural Science Foundation of China (no. 62073321), National Defense Basic Scientific Research Program (no. JCKY2019203C029), and Science andTechnology Development Fund, Macau SAR (0015/2020/AMJ).