Abstract
For nonautonomous nonlinear systems, the optimal control design is affected by the terms of partial derivative. If a reinforcement learning (RL) strategy is developed to approximate the optimal control scheme in nonautonomous nonlinear systems, then the closed control system might be unstabilizing. Therefore, in this article, the approach of direct RL law for a nonautonomous thermoacoustic generator (TAG) is investigated. We establish the mathematical model of TAG by partial differential equations (PDEs) and then transforming them into time varying nonlinear systems. The direct RL technique with Newton–Leibniz formula is implemented to consider the partial derivative term from classical policy iteration (PI) method by modifying the computation using data collection between the two sampling times. Finally, several simulation studies with some comparisons are conducted to validate the theoretical analyses.
1. Introduction
As a key problem of energy mission, the thermoacoustic generator (TAG) has been attracted by many scholars [1]. The consideration of transforming from hightemperature source to the highefficient heat engine process as well as the process of converting heat to electricity were investigated in [1]. However, the main shortcoming is that the control design problem of TAG has not been much discussed. In the actual engineering system, optimal control is a remarkable technique, which gives their essential effectiveness for balancing the tracking problem and performance. Regarding the theoretical problem, for achieving optimal control design, one needs to solve the HamiltonJacobiBellman (HJB) equation in nonlinear systems or Riccati equation in linear systems subject to a userdefined performance index [2]. However, due to the challenge of analytically solving these equations as well as the existence of uncertainties, adaptive reinforcement learning (RL)based techniques have been employed to approximate optimal control solution [2–4]. On the other side, it should be noted that some different methods can be developed without RL technique, such as the approach in [5] to solve robust LQR in the presence of constraint by MinMax optimization.
As an important approach in the modern control technique, the RLbased control method aims to obtain the minimization of performance index while achieving the stability of closed systems. Up to now, many important researches have been implemented for linear systems using Kronecker product [6] to conveniently compute the quadratic form as well as for nonlinear systems by approximating with neural networks [7] to describe Bellman function. However, due to the time varying description of closed systems, it is challenging to investigate RLbased control strategy for practical systems such as robotics and generators. To overcome this challenge, the model transformation method [8] and direct RL technique for nonautonomous systems [9–11] are introduced. The technique of transforming closed systems is implemented by considering the desired trajectory as new state variables and changing the cost function by a different form [8]. After obtaining the autonomous systems, the online actorcritic strategy was discussed in [8] by generating the adaptation law of weights in actor/critic neural networks, which approximated the optimal control and Bellman function. Thanks to the property of Hamiltonian, it implies the method of training was achieved by minimizing square of Hamiltonian term. The classical actorcritic can also be extended for nonlinear continuous timedelayed dynamical systems [12] by adding the time delaybased integral term into optimal value functional. Moreover, it leads to the modification of the integral temporal difference error (ITDE) depending on time delay [12]. However, the implementation of traditional actorcritic strategy [8] usually requires certain model in the computation. For nonlinear systems containing unmodeled dynamics, Yang et al. in [13] proposed a robust online actorcritic strategy with the addition of robustifying term combining with fuzzy logic systemsbased approximator. Likewise, RL method is developed for general HJB problem with the additional variable [3], HJI problem under the disturbance influence [6, 14–16], and modified Hamiltonian [7]. Unlike the traditional online actor/critic study RL method by simultaneous tuning, the sequential learning using value iteration (VI) algorithm was discussed in [3] with the optimal function to be directly computed from the previous steps without solving Lyapunov equation. Furthermore, it should be noted that VI algorithm [3] does not require the admissible control in the first step as described in policy iteration (PI) algorithm. Regarding the modified Hamiltonian [7], this method was extended to deal with input constraint by equivalent map using special function and developed actor/critic learning for the modified value function and the equivalent modified Hamiltonian. For the perturbed systems in the presence of disturbance, RL algorithm is developed under the generalized disturbance attenuation criterion [14]. A remarkable approach of Q learning algorithm is introduced to study completely uncertain systems with the consideration of using twovariable optimal value function [17]. But, it can be seen that Q learning technique is only appropriate for linear systems with quadratic form Bellman function [17]. Recently, adaptive dynamic programming (ADP) control technique with the consideration of eventtriggered mechanism (ETM) is proposed for complicated systems, such as discretetime boilerturbine systems [18] and Roller Kiln temperature field using partial differential equations (PDE) [19]. However, the proposed RL control is only implemented for time invariant systems without considering the effect of desired trajectory depending on time [18]. It can be concluded that the above results of RL control design rely on autonomous or time invariant models. To deal with nonautonomous closed systems using RL control systems, time varying references are avoided or the transform techniques to autonomous systems are employed. There have been only very few studies of considering the direct RL control solution for time varying systems [9–11] due to the existence of the term . Authors in [9] improved conventional policy iteration (PI) technique with the addition of partial derivative with respect to time in each step. Because of nonlinear property, two NNs are also utilized in both Actor NN and Critic NN with the weights to be trained using the addition of partial derivative in terms of states [9]. However, the direct RL techniques in [9–11] are only purely mathematical analyses. A different approach of direct RL control can be mentioned as offpolicy technique [20, 21]. Due to the property of keeping the input control while computing RL algorithm, the offpolicy technique is able to address to completely uncertain systems in linear systems [20] and nonlinear systems [21]. Therefore, it can be determined that direct RLbased controllers for time varying closed practical systems are challenging issues and these motivate us to study this problem in TAG systems.
Inspired by the above analysis, we investigate the application of direct RL procedure in control system of time varying TAG systems, which are described by PDE. We use some appropriate modifications to transform PDE of TAG systems into a time varying dynamic equation. The major contributions of this article are given as follows:(1)First, it is obviously different from [2, 22] realizing RL procedure for robotic systems using differential equations; the structure of time varying RL is presented to obtain the optimal control for time varying TAG systems to be expressed by PDE.(2)Second, differential from researches of TAG systems [1], optimal control strategy is investigated with the related comparison to be discussed through theoretical analysis and simulation studies.
The remainder of this article is organized as follows. Section 2 gives the problem statement of this article. Section 3 focuses on analysis of mathematical modelling of TAG systems. The direct RL procedurebased control scheme for TAGs is discussed in Section 4. Simulation studies in a TAG control system are presented in Section 5 and the conclusion remarks are presented in Section 6.
2. Preliminaries and Problem Statement
The thermoacoustic generator (TAG) is a device that can generate thermal energy or consume acoustic energy to transfer heat from lowtemperature to hightemperature sources; from there, electricity can be obtained through electromechanical converters. In this paper, we investigate a fundamental structure of thermoacoustic generator (TAG), as shown in Figure 1, which includes 5 parts: regenerator (REG), heat exchanger (HHX, CHX), alternator (ALT), stub (STUB), and feedback pipe (FBP). According to thermoacoustic theory and the partial element equivalent circuit (PEEC) method [23], we construct a theoretical model for thermoacoustic generator from the equivalent parts model. Then, we apply the adaptive dynamic programming method to TAG from the obtained mathematical model. Moreover, based on physical phenomena of TAGs, some assumptions are required to represent TAGs by several partial derivative equations (PDEs) as well as equivalent circuits in next sections.
Assumption 1. The necessary assumptions to apply the linear thermoacoustic theory into modelling the thermoacoustic generator are as follows:(1)The material’s surface is smooth, the heat radiation is negligible, and the plates are rigid and stationary.(2)The acoustic pressure is direction only and the viscosity is independent of temperature.(3)The length of the plates is small compared to the size of the resonator.The control objective is to establish the control input of TAG systems for minimizing the performance index under the consideration of time varying systems. The control design is implemented by approximating with RL technique after obtaining the model of TAG.
Remark 1. It is worth emphasizing that unlike the work in [2, 22] studying RL technique for time varying closed control system of robotics by indirect methods, the proposed method in this article develops the direct RL method for time varying TAG systems.
3. Mathematical Modelling of Thermoacoustic Generator Systems
In this section, we proceed to model the thermoacoustic generator based on the operating principles combined with the partial element equivalent circuit (PEEC) method [23] to give the analogous circuit structure of the object. Then, we provide the mathematical model based on assumptions about the thermoacoustic generator’s operating conditions and mathematical transformations.
3.1. Regenerator
According to the linear thermoacoustic theory [1], the interaction between the acoustic and temperature fields can be given by the following equation (1) aswhere , , are the pressure , flow , and temperature of the gas , respectively; , are specific weight and the adiabatic ratio of gases; , are the angular frequency and the speed of sound ; , are viscous function and spatially averaged thermal function, respectively. For the convenience of analysis, we can separate equation (1) into the two following dynamic equations [24]:
From the PEEC method and the mathematical transformations, we can obtain the formula for the equivalent current source and the resistance of regenerator as follows: where , are the temperatures of the hot heat source and cold heat source ; , are the diameter and length of the stack ; and is the absolute viscosity coefficient of the gas .
3.2. Alternator
A simple linear model [25] characterizing the loudspeaker as a linear alternator is shown in Figure 2. The acoustic wave imposes an oscillating pressure on the diaphragm, which has an effective area S, as shown in Figures 2(a) and 2(b) and shows the equivalent circuit of the physical model, as shown in Figure 2(a). The diaphragm and the coil, with a total mass, M_{m}, are subjected to oscillatory motion. The loudspeaker has a mechanical stiffness, , and a mechanical resistance, . The coil has an electrical inductance, , and an electrical resistance, . The force factor is . A pure electrical resistance, , is connected as a load to extract electrical power in this model. The voltage on the load resistance is , and the current is .
(a)
(b)
Assume that all parameters are linear and frequencyindependent. Ignoring hysteresis losses, the alternator’s impedance in Figure 2 can be written as [22].
Notice that the pressure loss is much less than in traditional systems since the reactive component of the impedance is absolutely small even when the alternator is off resonance by a few H_{z} [26]:
Therefore, we can rewrite alternator’s equivalent impedance as
3.3. Stub
The stub is a piston placed perpendicular to the central conduit and shown in Figure 1. The principal objective of stubs is to accommodate for the pressure and flow phase shift after passing through alternator. Considering a straight acoustic duct, the relationship between the input and the output acoustic impedance can be expressed as [26]
Here, is the length of duct , is the acoustic propagation coefficient, and is the characteristic impedance of duct. The stub is basically a closedend portion of an acoustic duct, . Therefore, the input acoustic impedance of the stub can be approximately written aswhere
3.4. Heat Exchanger and Feedback Pipe
The heat exchangers are utilized as heat recovery equipment and are suitable for intake and return air systems. Two heat exchangers are employed in the thermoacoustic generator construction to recover and maintain the temperature of the hot and cold heat sources. The heat exchanger has a very low porosity while having a relatively long length. Compared to the thermoacoustic core’s enormous crosssectional area, the heat exchanger effectively adds a long but short cross section channel into the loop locally. As a result, from an acoustic standpoint, the heat exchanger exhibits a considerable inertance effect and the average acoustic resistance (Figure 3)
The remainder of the feedback pipe is just a lossy acoustic waveguide. Each unit section can be modelled as a mixture of resistance , inductance , and capacitance , all of which can be determined using equations (8) and (9).
After modelling the five elements of thermoacoustic generator, we merge the results into the closedcircuit shown in Figure 4.
Assumption 2. Attempts to simplify the thermoacoustic generator’s equivalent circuit:(1)The gas movement is optimal, neglecting all of the friction with the duct. As a result, the energy of the gas flow is conserved, allowing us to ignore the influence of the feedback pipe (FBP) component throughout the circuit survey.(2)In the heat exchanger, the equivalent inductance is very small compared to the impedance; at the same time, the impedance value can be adjusted through the solvent flow in the device. Therefore, the heat exchanger can be equivalent to a rheostat in the analogous circuit for modelling convenience.Denote , is the pressure of gas , is the output power on alternator , and the thermoacoustic generator’s mathematical model is as follows:The phase difference angles are defined as follows: where .
Remark 2. It is worth emphasizing that the thermoacoustic generator system can be enabled when the temperature differential between the hot and cold sources is positive. It means that the input control signal is always positive, and it is the system’s input constraint. Furthermore, the pressure of gas (state) is always positive when the system is performed.
4. Direct Reinforcement Learning for Nonautonomous TAG Systems
In this section, we consider the extension from classical RL algorithm [2, 22, 27] to time varying RL for the general class of time varying nonlinear systems as
With the initial value, , and the associated cost function is defined aswhere is the arbitrary states vector and is the vector of input signals. Additionally, is a continuously differentiable and positive definite map and the partial derivatives in terms of , , and of are continuous in . The optimal control objective is to minimize the cost function (14) by the input signal , which is belonged to the class of admissible signals to be defined as follows.
Definition 1. (see [4])The control signal is considered as an admissible control signal if and only if(1) is a piecewise continuous function with respect to time.(2)The solution of the time varying nonlinear system (13) under the input signal is converged to zero as time comes to infinity .(3)The integral (14) is finite: for all .According to dynamic programming (DP) principle [22, 27, 28], implementing the derivative with respect to time of optimal value along the system trajectory (13), one can obtain the following equation:where is defined as Hamiltonian. After obtaining the optimal control input , it implies that the corresponding optimal value function can be defined as . Therefore, the time varying HJB equation is expressed asBased on the view of the relations (15) and (16), the following iterative algorithm can be applied to the optimization problem of thermoacoustic generator systems.
In the light of [10], the improvement of traditional Algorithm 1 for the case of uncertain nonlinear systems can be completed after computing the deviation of two time varying value functions at the two different sampling times and by approximating with Newton–Leibniz formula and combining with function approximation theory. Therefore, Step 3 with Algorithm 1 in classical Algorithm 1 is transformed into the computation using data collection by two following steps. First, this deviation of two time varying value functions is calculated using Newton–Leibniz formula under the control input in time interval :Second, since and are unknown functions, by using the basis function method [29], Step 3 in traditional Algorithm 1 can be easily realized with updation law of weights as follows [10]:where are approximate errors, which converge to zero as the iteration step comes to infinity. Based on the above analysis and transformation, a numerical RL algorithm can be proposed.
Following that, we proceed to apply Algorithm 2 to the mathematical model of thermoacoustic generator (11) as analyzed:With the associated cost function,where , , are the initial state, input, and output signal, respectively. From the HJB equation, we employ the basis function method to approximate the unknown functions as (18) with which are the number of basis function in each function vector. The optimal weights are determined according to Algorithm 2, whereFurthermore, the terms are obtained from the collected data as follows:Afterward, the optimal weight parameter can be obtained by solving the problem as Algorithm 2 and calculating the control signal via (18).


Remark 3. The requirement of convergences of control policy as well as value function in Algorithm 2 is proved as described in [10] with the following steps. First, it is necessary to point out the property of admissible control input [4] and positive definite optimal function which are kept after each iteration of Algorithm 1. Second, the decrement of optimal function is determined and the convergences of control policy and value function in Algorithm 2 are guaranteed after obtaining the estimation of error between the control policies and as well as optimal functions and to be obtained in Algorithms 1 and 2. However, unlike the work in [10], the time varying RL is completely developed for a practical TAG system (19). Furthermore, it is obviously different from the method of time varying RL in [9] developing the actor/critic technique in [22]; RL procedure is implemented under the consideration of data collection in Algorithm 2. Additionally, to fulfill the term of partial derivative in HJB equation of nonautonomous systems, it is obviously different from indirect RL method studying the equivalent systems by adding more state variables [8, 22], the direct RL control for TAGs is able to keep the dynamic model without transforming systems.
5. Simulation Results
In this section, a TAG system with the parameters is to be given for control system as follows: , the sampling time , the initial state and initial control signal are , respectively, the approximation error , the designed value of state , and the desired output . To validate the effectiveness of the proposed Algorithm 2 with time varying RL, we implement two simulations using Algorithm 2 (Figures 4–6) and traditional PID control scheme (Figures 7–9). It is seen that the high performance of control input, state variable, and output signal is implemented by time varying RL control for TAG in comparison with traditional PID controller. The control signal of TAGs (Figure 7) is oscillated with high frequency in traditional PID controller. The disadvantage follows the oscillators in output and state signals of TAGs (Figures 8 and 9). In contrast to traditional PID control scheme, the control signal and state are stable under RL algorithm (Figures 4 and 5) as well as the average of output signal convergences to desired value (Figure 6).
6. Conclusion
This article proposes an application of direct time varying RL strategy to solve the optimal control for nonautonomous TAG systems subject to unknown system model parameters. The PDEbased mathematical model of TAGs is transformed into time varying nonlinear dynamical systems. After that by collecting the data between two sampling times under the control signals and Newton–Leibniz approximation, the conventional RL technique is modified to handle for nonautonomous TAG systems. Numerical simulations and the comparison with traditional method of TAG control system verify the high performance of the proposed method.
Data Availability
This publication is supported by multiple datasets, which are available at locations cited in the reference section.
Conflicts of Interest
The authors declare that they have no conflicts of interest.