#### Abstract

This paper presents an attitude control scheme combined with adaptive dynamic programming (ADP) for reentry vehicles with high nonlinearity and disturbances. Firstly, the nonlinear attitude dynamics is divided into inner and outer loops according to the time scale separation and the cascade control principle, and a general sliding mode control method is employed to construct the main controllers for the double loops. Considering the shortage of main controllers in handling nonlinearity and sudden disturbances, an ADP structure is introduced into the outer attitude loop as an auxiliary. And the ADP structure utilizes neural network estimators to minimize the cost function and generate optimal signals through online learning, so as to compensate defect of the main controllers’ adaptability speed and accuracy. Then, the stability is analyzed by the Lyapunov method, and the parameter selection strategy of the ADP structure is derived to guide implementation. In addition, this paper puts forward skills to speed up ADP training. Finally, simulation results show that the control strategy with ADP possesses stronger adaptability and faster response than that without ADP for the nonlinear vehicle system.

#### 1. Introduction

Attitude control for reentry vehicles has been a hotspot in the field of aerospace. The complex operating conditions and the high nonlinearity of vehicles themselves bring great challenges to attitude control. Fortunately, around these focuses, researchers continue to explore and ameliorate control schemes, developing a series of available control technologies.

For the control of space vehicles, some schemes have been investigated one after another. Some linear control methods, such as linear parameter varying (LPV) [1] and linear quadratic regulator (LQR), focus on linearizing the aircraft model. However, due to the highly nonlinear and coupling dynamic characteristics, to be honest, the capabilities of these linear control methods on actual nonlinear coupling vehicles are limited. Besides, some nonlinear control methods are widely employed, such as nonlinear dynamic inversion [2], sliding mode control, and backstepping method [3, 4]. Although these nonlinear control techniques can also effectively deal with the nonlinear nature of vehicles, they will still be slightly embarrassed and lack adaptability in the face of complex and changeable disturbances if without other auxiliary means. Therefore, in the recent development of vehicle control, more and more adaptive technologies have been favoured by researchers [5].

For the purpose of ameliorating the robustness of the controller by designing adaptive mechanism, observer-based adaptive control technology and other intelligent methods (as adaptive fuzzy control and iterative learning) have emerged one after another [6–8]. Especially, in recent years, thanks to the vigorous development of new artificial intelligence, reinforcement learning (RL) has attracted more and more attention, which has shown strong performance in solving adaptive and optimal control problems [9–11]. In the control domain, reinforcement learning is transformed into approximate or adaptive dynamic programming (ADP), which learns by interacting with the environment to determine what optimal actions to take to minimize a cost function over a period of time [12]. One of the core approaches is the critic-action (CA) design, which approximates the cost function and obtains the optimal actions by solving the Hamilton–Jacobi–Bellman equation with function estimators [13]. ADP contains a variety of structural classifications, including heuristic [14], dual heuristic [15], and action-dependent dynamic programming (ADHDP), etc., which have been made preliminary explorations and achievements in the field of vehicle control [16]. Specifically, Luo et al. developed a direct heuristic dynamic programming (dHDP) for longitudinal control of hypersonic vehicles and introduced fuzzy neural networks to enhance the learning ability and robustness of dHDP [17]. There is also an application of ADDHP to study the optimal control of attitude maneuver for three-axis spacecraft [18]. Some creative researchers improve ADP by redefining the two optimization objectives and apply ADP to the in-orbit reconfiguration of the vehicle attitude system under multitask constraints through dual optimization indexes [19]. Moreover, ADP can be associated with traditional methods, such as nonlinear filter [20] and sliding mode control [21], to implement a data-driven ADHDP auxiliary control scheme for the speed and altitude system of an air-breathing hypersonic vehicle [21]. In [22], a switching adaptive active anti-interference control technique based on reduced-order observer technique and ADP is proposed, considering the parameter uncertainty and external disturbance of variable structure near-space vehicles. Furthermore, aiming at the guidance and control problem of the vertical take-off and landing (VTOL) system with multivariable disturbances, an online kernel DHP robust control strategy based on the sparse kernel theory is designed for VTOL vehicles [23]. Most of the above control strategies with ADP utilize neural network estimators to approximate the cost function and optimal control law online, while Zhou et al. creatively put forward an incremental ADP (iADP) combining the advantages of the incremental control method and ADP [24]. This iADP is based on Markov decision-making process and Bellman optimal principle to directly derive the explicit expression of optimal control law, greatly simplifying the design process of ADP, and successfully exploited to satellite [25] and aircraft [26]. Similarly, Sun and van Kampen also come up with an incremental model-based DHP technology for vehicle control, replacing the model network in traditional DHP with an incremental model [27, 28].

In a word, the development of ADP in the field of vehicle control is rapidly deepening and expanding [16], but as far as the current literature is concerned, ADP is still rarely applied to the control of all three channels’ attitude angles of the vehicle. Moreover, most of the literature rarely mentions the internal weight convergence, parameter selection, training speed, and other issues of ADP based on critic-action networks, but these are problems to be concerned about. Therefore, this paper contributes to employ the ADP framework to the control of all three-channel attitude angles of a reentry vehicle. Inspired by the ADP as an auxiliary controller [21], this paper presents a framework combining conventional controller and ADP, and ADP is as the auxiliary means to enhance the rapidity and adaptivity of the whole attitude system. In addition, the internal convergence of the ADP structure and its parameter selection rules are discussed in depth. Aiming at the implementation problem, this paper considers the improvement measures to speed up ADP training, which will be provided to interested researchers for future discussion.

The rest of this paper is organized as follows. Firstly, the nonlinear dynamics of the three-channel attitude control system of the reentry vehicle is established in Section 2. Then, in Section 3, the control strategy based on the dual-loop main controller plus ADP is elaborated in detail. In Section 4, some issues about implementation are taken into consideration. Finally, the simulations and conclusions are presented in Sections 5 and 6, respectively.

#### 2. Nonlinear Model

To describe the attitude change of the reentry phase, we give the rotation equations of the vehicle around the center of mass, including rotation dynamics and attitude kinematics. They determine the attitude angles of the vehicle around the center of mass and the angular rate of the three channels during the flight. Considering the influence of Earth rotation on attitude control, a three degree of freedom nonlinear attitude model in the body coordinate system can be obtained [29]:where represent the angle of attack, sideslip, and bank angle, respectively; are the roll, pitch, and yaw rate, respectively. And denote the roll, pitch, and yaw control torques, respectively; is rotational inertia. are longitude, latitude, heading angle, and flight path angle, respectively; is the Earth rotation angular velocity.

In actual control, vehicles can be regarded as an ideal rigid body. Considering that the rotation rate of the Earth is far less than that of vehicles, the rotation of the Earth is ignored. Besides, orbital motion is much slower than attitude motion, so the orbital motion terms of vehicles are described as . Finally, simplified dynamics can be obtained:

Above attitude kinematics equation (2) is abbreviated aswhere and . are defined as

Similarly, rotational dynamics can be simplified aswhere denotes inertial matrix; is a vector of control torques. and are defined as

If there exist external disturbances, and are introduced into the vehicle system as follows:where and represent external disturbances.

Obviously, the attitude tracking control problem of the reentry vehicles can be described as

#### 3. Controller Design

In the previous section, the nominal attitude model of the reentry vehicle has been established by equations (3) and (5), which can be reorganized as equations (9a) and (9b). This section will devise a controller with an auxiliary according to this vehicle model:

It is well known that the attitude angles change more slowly than the angular rate. Therefore, according to the principle of time scale separation and cascade control, equations (9a) and (9b) can be divided into attitude angle slow loop equation (9a) and angle rate fast loop equation (9b), also known as outer loop and inner loop, respectively. In this section, the ADP-based controller will be presented, and the overall control strategy is shown in Figure 1.

As shown in Figure 1, there are two control loops. The outer loop is an attitude control loop with two controllers. The controller 1 generates the main angular rate instruction according to the guidance instruction , and the ADP controller outputs the control instruction according to the attitude angle error; both of which together yield the angular rate . Then, is a reference instruction for the inner angular rate loop so that the controller 2 of the inner loop generates the control torque , which acts on the vehicle to output the actual attitude angles and complete the control task.

In this paper, the inner controller 1 and outer loop controller 2 are implemented based on conventional sliding mode control and serve as the main controllers. To increase the performance of the main controller of the outer loop, the ADP controller acts as an auxiliary and adopts an action-dependent structure such as ADHDP. Note that ADHDP belongs to the category of ADP, so it is called ADP in this paper. The output of the ADP serves as a supplementary reference signal for the inner loop. The focus of this paper is to discuss the auxiliary role of ADP structure. Of course, the main controllers can also choose other methods to design, but how to select the main controllers is not the focus of this paper. It should be pointed out that only the ADP auxiliary controller is introduced into the outer loop, mainly because the outer loop variable is the attitude angle and the inner loop variable is the angular rate, and the attitude angle changes slowly than the angular rate. Therefore, in each iteration, the iterative speed of the ADP is more easily matched with the update speed of the main controller 1. Perhaps we can similarly introduce the ADP auxiliary controller with the same structure into the inner loop, and its rationality and effectiveness will be researched and verified in future work.

In the following subsections: according to cascade control strategy, the outer loop controllers are first designed, including the main controller 1 and the ADP-based auxiliary controller. After the reference command signal is obtained by the outer loop controllers, the inner loop controller 2 is presented.

##### 3.1. Outer Loop Controllers

###### 3.1.1. Main Controller 1

The control objective of the outer loop is to operate the actual attitude angle to track within the desired accuracy. First, take the tracking error . The sliding switching surface of the outer loop can be selected aswhere and are the parameters to be designed [30]. Obviously, on the sliding surface , the tracking error can be guaranteed to converge uniformly, that is,

In order to ensure the asymptotic convergence of the outer loop tracking error to the sliding surface, the virtual control law must be designed. First, take the derivative of as

Take the following Lyapunov function:and the derivative of is as

By Lyapunov stability, has to be guaranteed. Therefore, the sliding mode approach law can be chosen aswhere designed parameter and denotes a sign function.

According to equations (12) and (15), there exists

So, the virtual control law of the outer loop can be obtained as follows:

In order to avoid or reduce the sliding mode chattering caused by the sign function in equation (17), a smooth continuous function can be adopted instead of the sign function. Because the saturation function is one of the most simple and effective ways, the virtual control law is redesigned as follows:where denotes a saturation function with width as follows:

Therefore, according to control law equation (18), the attitude angles can track the commands, and the error uniformly converges. Next, will be provided as the main reference signal to the inner loop.

###### 3.1.2. ADP Auxiliary Controller

The idea of ADP is to take advantage of the function estimators to approximate the performance index functions and control strategies that meet the principle of optimality. By designing a critic-action structure, the critic network approximates the performance index (the cost function) and is defined as the forward accumulation of the utility function with the discount factor [20, 21]:where is usually defined as a quadratic. It can be seen that the cost function is also a quadratic convex function, with only a local minimum and at the same time a global minimum. The action network obtains the optimal control law by minimizing :

In this paper, only the auxiliary ADP controller is added to the outer loop to compensate for the attitude angle error generated by the main controller 1. ADP outputs ( has the same dimension as ), and the sum of and inputs as a reference instruction to the inner loop. Obviously, the ADP controller is sensitive to the attitude angle error. It can be imagined that ADP will start to work when a certain error occurs; when the error meets the threshold requirements, the ADP does not need to work, which will balance the loss in accuracy and calculation speed. However, this does not seem to be the focus of this paper. It may be discussed in future research, such as the selection and optimization of the threshold.

In Figure 2, ADP adopts a network structure based on ADHDP, which includes an action network, a critic network, and attitude model (9a). The input of ADP is the attitude error, and the action network generates the control signal . At the same time, the critic network approximates . The specific design of each network is given below.

*(1) Critic Network*. In Figure 3, the critic network uses a single hidden-layer BP neural network with six input nodes, *M* hidden nodes, and one output node. The input contains the attitude angle error and generated by the action network. The output is the estimated of the cost function . is the weight matrix of the input layer to the hidden layer and represents the weight of the *i*-th input node to the *j*-th hidden node. is the weight matrix from the hidden-to-output layer, and represents the connection weight of the *j*-th hidden node to the output. and are the input and output vectors of hidden nodes, respectively. The active functions of the hidden layer and the output layer are a bipolar sigmoid function and linear function, respectively. The attitude error is as follows:

The input of the critic network is as

The training of the critic network consists of two parts, one is the forward calculation, and the other is the error backpropagation of updating network weights. The forward process of step is

Equation (24) can be rewritten in matrix form as

Based on the Bellman optimality principle, the critic network approximates the cost function of the system. The actual is defined as the cumulative return from the current state to the future:where is a discount factor or forgetting factor, indicating the influence of the future state on the current strategy. is the utility function at each step, which is defined as a quadratic:

The following error can be defined, and the critic network can approximate by minimizing :

Therefore, network weights can be updated through backpropagation of .

*(2) Updating the Weights Wc2*. Using the gradient descent method, let be the gradient, sowhere each component of is represented aswhere is the learning rate. Equation (30) is combined and rewritten into a matrix form as

*(3) Updating the Weights Wc1*. Similarly, let be the gradient, so

Combine the above formula into a simplified matrix form as follows:where the symbol “×” represents the Hadamard product of two matrices, that is, bitwise multiplication; “” represents the ordinary multiplication of matrices. These symbols appearing in the later parts of this paper possess the same meaning.

*(4) Action Network*. As shown in Figure 4, the action network adopts a single hidden-layer BP neural network with three input nodes, *N* hidden nodes, and three output nodes. The network’s input is , and output is . Other parameters are defined similarly to the critic network. The active functions of the hidden and output layer are a bipolar sigmoid function and linear function, respectively.

The training of the action network also includes forward calculation and error backpropagation. Firstly, the forward process is briefly presented as

The action network generates an optimal control strategy by minimizing the system cost function . This goal can be achieved by minimizing the defined error :

*(5) Updating the Weights Wa2*. With the gradient descent method, the update process of iswhere represents the learning rate. The connection weight from the j-th hidden node to the output node is denoted as , so

The middle term in equation (38) indicates that the path of the backpropagated signal passes through the critic network when training the action network [31]. Furthermore, by the output and input of the critic network, can be obtained:

So,where represents the (*i* + 3)-th column of . Equation (40) can be rewritten in matrix form:where represents columns 4 to 6 of , that is, the connection weights of the three input nodes corresponding to and all hidden nodes in the critic network. From equations (37)–(41), can be deduced as

*(6) Updating the Weights Wa1*. Similar to the , the update of is

Substituting equation (41) into equation (44), can be easily obtained.

So far, the training process is completed. And the optimal control signal output by the action network will be combined with output by outer loop main controller 1, that iswhere the angular rate signal will be input as the reference command of the inner loop controller 2, and the control torque output by controller 2 will operate the vehicle to complete the attitude control task.

##### 3.2. Inner Loop Controller

To ensure that the actual angular rate can stably track the expected reference angular rate , similar to controller 1, the sliding variable is selected for inner loop controller 2 as follows:where and with . In order to ensure the inner loop tracking error asymptotically converges to the sliding surface , the actual control law has to be designed.

The derivative of is

Take the following Lyapunov function :

By Lyapunov stability, has to be guaranteed. Therefore, the dynamics can be chosen aswhere designed parameter and denotes a sign function.

According to equations (47) and (49), there exists

So, the actual control law of the inner loop can be obtained as follows:

Similarly, a continuous saturation function is chosen to replace the sign function to reduce the chattering. Therefore, the actual control law is rewritten as follows:where denotes a saturation function with width .

Therefore, for actual control law as equation (52), holds. That is, the actual attitude angular rate converges asymptotically to the expected angular rate .

#### 4. Implementation Issues

In Section 3, the design of ADP auxiliary controller is completed, but the parameter selection and training speed of ADP cannot be ignored in practical application. So, in this section, some issues are discussed about implementation of ADP structure, including parameter selection for networks and skills related to speed up training.

##### 4.1. Network Parameters and Their Convergence

It is clear that the critic network with a single hidden layer and randomly initialized weights can approximate with arbitrarily small errors, that is, . Similarly, the action network with randomly initialized weights can minimize the cost function and its output can approximate to the optimal control law , that is,

In other words, both the critic network and action network evolve towards the optimal direction to achieve their goals. Furthermore, considering equations (25) and (34), it is because of the adjustment of network weights , , , and that the output of the networks reaches the desired optimal value. That is, when the optimal control strategy is obtained, the network weights will also reach the optimal weights as follows [32]:where and represent the optimal weights of the critic and action network, respectively.

Lemma 1. *In critic and action network, the weights and are finally uniformly stable and approach the optimal weights and .*

*Proof. *It is well known that the weights of the input to hidden layer are similar to the weights of the hidden to output layer. In order to facilitate the elaboration, this paper only presents the uniform stability proof about and , which are the weights of the hidden to output layer. Let the optimal weights corresponding to and be and , respectively, and they are bounded. , , and are positive constants.

Equation (28) can be rewritten asFrom equations (29) to (31), the update of can be rewritten as follows:Similarly, the update of iswhere .

First, the Lyapunov method is adopted to analyse the convergence of :where is the error between actual and optimal weights. Then, the first-order difference of is expressed asAccording to equation (56), (60) can be obtained:In addition, denote the approximation error between actual and optimal output asSubstituting equations (60) and (61) into equation (59), can be deduced:Furthermore, applying the Cauchy–Schwarz inequality [33], it can be deduced asSimilarly, set , .

Denote the approximation error of the action network between the actual and optimal output as . Referring to , satisfiesFurthermore, set , and thenFrom the above derivation, we can finally take the total Lyapunov function asSelecting some parameters as equation (67), then equation (68) holds:where representsFurthermore, applying the Cauchy–Schwarz inequality, we getwhere the subscript “max” represents the upper bound of the corresponding parameters’ 2-norm, such as .

Therefore, for any holds. This indicates that the actual weights will converge to the optimal weights. In other words, the weight error are uniformly bounded. This also results in a stable ADP system and an optimal output.

Furthermore, note that the components of and are limited to [−1, 1] due to the activation functions of the hidden nodes, that areSo, there existAccording to equation (67), some networks’ parameters should satisfyEquation (74) provides a simple and intuitive guidance to select networks’ structure and learning rate, while maintaining the stability of weights and ADP structure.

##### 4.2. Improvement in Implementation

In the previous literature, when it comes to the training of feedforward networks, all weights usually need to be adjusted, so there are serious dependencies between different layers. Moreover, the algorithm based on gradient descent is widely applied to the learning of various feedforward neural networks. However, it is obvious that the learning method based on gradient descent is usually very slow and time-consuming because of improper learning steps, or it is easy to be overtrained and falls into local minima.

In order to make the training process as time-saving as possible and better meet the time matching between online training and practical applications, we can consider two ideas: one is based on Igelnik and Pao’s theory [34], that is, for a single hidden-layer forward neural network, if the weights of input to hidden layer are randomly initialized and kept constant, as long as the number of hidden nodes is sufficient, the approximation error of the network can be arbitrarily small. The second is based on the extreme learning machine (ELM) proposed by Huang et al. [35, 36]. For a single hidden layer forward neural network, the weights of the input to hidden layer are initialized randomly and kept constant, and then the hidden nodes are arbitrarily selected. The weights of hidden to output layer are directly determined analytically by the Moore–Penrose inverse, without necessary to derive and calculate partial derivatives layer by layer such as the gradient descent method. The speed of extreme learning methods has been proven to be tens or even thousands of times that of ordinary gradient descent methods, and it can effectively reduce complexity and avoid local minima [37].

To facilitate implementation, this paper will adopt the first idea to improve the performance; that is, the weights and are randomly initialized in a finite interval and kept constant, and only the weights and are adjusted by the gradient descent algorithm, resulting in effectively avoiding excessive time consumption. As for the thinking based on extreme learning machine, it is only given here without in-depth discussion due to the limited space of this paper and the lack of theoretical guidance in the application of vehicles. We may make further analysis and give more rigorous theories to support the application in practical vehicle control in future research.

#### 5. Simulations

In this section, the control strategy with ADP derived above is implemented to vehicle attitude control, and the effectiveness of the designed strategy is verified by comparing with the conventional controller without ADP.

According to a vehicle model in laboratory, the inertia matrix is taken as

The common parameters are taken as follows: , ; the width in the saturation function; . The number of hidden nodes is . According to equation (74), the discount factor takes with learning rate , . Take , and all weights are randomly initialized in [−0.2, 0.2].

Set the initial flight state of the vehicle as and . The desired attitude instruction is , and the simulation step size is 0.02 s. To verify the performance of the controller, pulsed disturbances