Abstract

This paper presents a U-model-based adaptive sliding mode control (SMC) using a deep deterministic policy gradient (DDPG) for uncertain nonlinear systems. The configuration of the proposed methodology consisted of a U-model framework and an SMC with a variable boundary layer. The U-model framework forms the outer feedback loop that adjusts the overall performance of the nonlinear system, while SMC serves as a robust dynamic inverter that cancels the nonlinearity of the original plant. Besides, to alleviate the chattering problem while maintaining the intrinsic advantages of SMC, a DDPG network is designed to adaptively tune the boundary and switching gain. From the control perspective, this controller combines the interpretability of the U-model and the robustness of the SMC. From the deep reinforcement learning (DRL) point of view, the DDPG calculates nearly optimal parameters for SMC based on current states and maximizes its favourable features while minimizing the unfavourable ones. The simulation results of the single-pendulum system are compared with those of a U-model-based SMC optimized by the particle swarm optimization (PSO) algorithm. The comparison, as well as model visualization, demonstrates the superiority of the proposed methodology.

1. Introduction

A U-model is a generic and systematic control method that was proposed by Zhu et al. [1]. Different from other model-based and model-free control methods, it is a model-independent method in that it uses the dynamic model of the plant to design the controller, while the final performance is independent of the target plant. In doing this, the U-model provides a general routine to separate the system design and control design processes [2]. The gist of the U-model lies in designing a robust dynamic inverter that transforms the original plant into an identity matrix [2, 3]. This brings about two advantages. First, by cancelling the dynamics and nonlinearity, the overall system performance can be prescribed by a unit-negative feedback loop. Besides, the phase delay between the control and the output is eliminated, increasing the response speed of the system [4]. Due to its critical importance, the U-model has been combined with other control methods and has yielded a satisfying outcome. For examples, a U-model-based adaptive neural network [5], a U-model-based predictive control [6], a U-model-based fuzzy PID control [7], etc. However, the conventional U-controller has some drawbacks [8]. Firstly, it does not take into account disturbances and uncertainties. Second, difficulties in calculating dynamic inversion in continuous-time make it hard to be applied to continuous-time systems. Finally, the complexity of the U-control inverter depends on the target plant itself. If the plant is itself complex, then the U-model inverter is also hard to calculate. Therefore, finding a robust and simple dynamic inverter that can be concisely applied to continuous-time systems is a critical criterion for the success of the U-model.

A sliding mode control (SMC) is a robust nonlinear controller. Its implementation is usually based on the Lyapunov stability theorem and is distinguished from other controllers by its discontinuity [9]. By constructing a sliding mode variable, it forces the state variables of the system to slide to equilibrium within a given trajectory. The outstanding characteristics of the SMC are robustness, quick response, and easy implementation [10]. Indeed, the SMC makes a good supplement to a U-model control, and a combination of those two methods has attained certain accomplishments [11, 12]. However, its discontinuity in rationale is afflicted by the chattering problem, which has been widely studied [13]. The chattering issue not only impairs the system’s performance but also causes damage to physical instruments. Some solutions have been proposed to alleviate the chattering problem of SMC, including the fuzzy system [14], the boundary layer method [15], the PID-based method [16], high-order SMC [17], a neural network and PSO [18], etc. Nonetheless, these approaches either compromise stability and precision or require tedious craftsmanship and expert experience.

Reinforcement learning (RL) is a model-free methodology that optimizes its action on large-scale, complex problems through exploration and exploitation without explicit models [19]. Recently, with the development of deep learning, RL has been combined with deep neural networks to solve many control problems [2022]. Actor-critic learning is one popular framework of RL. Compared with classical Q-learning [23] and deep Q-learning [24], the actor-critic (AC) framework works with continuous states and action space [25]. This is realized by using an actor network to output continuous action and a critic network to estimate the Q-value. This characteristic guarantees AC’s potential in combinatorial optimization problems [2628]. Deep deterministic policy gradient (DDPG) is based on AC and was proposed in 2016 [29]. The appearance of DDPG enables the direct estimation of continuous action output in the RL realm. As its name suggests, the DDPG outputs a deterministic policy to the agent, with random noise added for exploration. The DDPG has been successfully implemented in many control scenarios [3032].

Based on the above discussion, the U-model framework has the potential to bridge the gap between linear and nonlinear systems, provided a robust dynamic inverter can be designed. An SMC is a special nonlinear control scheme that is highly regarded for its robustness. Although a combination of the U-model and SMC has been implemented and has yielded certain success, the chattering problem of SMC still an urgent need to be solved. Therefore, figuring out how to ease the chattering of SMC while maintaining its deserved performance is still challenging. In this paper, we propose a U-model-based adaptive SMC tuned by DDPG to tackle this problem. The parameter tuning problem of SMC is modeled as a combinatorial optimization problem, which is to be solved by DDPG. During the training phase, the DDPG undertakes exploration and exploitation to learn optimal action based on the current state automatically. Through penalizing the tracking error and DDPG output, the neuro network tries to minimize the error with minimal cost. Thus, the proposed adaptive SMC can attenuate the chattering issue without loss of stability or precision. Besides, this method does not require an estimation of the upper bound of the overall disturbance.

The contribution of this paper can be summarized as follows:(1)An adaptive SMC with variable thickness of the boundary layer, implemented as the dynamic inverter, based on DDPG, is proposed.(2)An SMC optimized by the PSO algorithm is provided as the baseline for comparison.(3)A nonlinear single-pendulum environment revised from Gym [33] is provided for simulation.(4)The simulation tests are conducted to illustrate the advantages and rationality of the proposed method.(5)Explainable artificial intelligence (XAI) methods are implemented to explain the trained DDPG model.The rest of this paper is organised as follows. Section 2 gives some preliminaries about the U-model framework and DDPG. Section 3 articulates the details of controller calculation step by step, including a conventional SMC, variable thickness of the boundary layer, a DDPG network, and an invariant controller. Section 4 presents the simulation results and analysis of two different target trajectories. The analysis focuses on settling time, accuracy, and chattering suppression. In addition, output visualization and the SHAP method are implemented to better understand the trained network. Section 5 gives a brief conclusion as well as future work suggestions.

2. Preliminaries

2.1. U-Model
2.1.1. U-Model Control Framework

Considering a general U-model control framework as shown in Figure 1. is the reference signal; is the output vector; is the error vector. The middle part of the pathway is composed of the invariant controller , the dynamic inverter , and the target plant . The prioritized task of the U-model control is to design a robust dynamic inverter that cancels the dynamics of . In other words, if there exists such that , and is a unit matrix, then the overall system performance is only determined by the invariant controller. Assume the desired transfer function , then the invariant controller can be derived as . Therefore, the implementation of the U-model-based control framework enables the assignment of system performance using linear system theory regardless of the nonlinearity of the target plant [34].

2.1.2. General U-Model Expression

A SISO CT polynomial dynamic system can be expressed as follows[35]:where is the output, is the input, and is the time. and are the and orders of derivative of and , respectively. is a variable that concludes all , and . contains all scalar coefficients. Throughout the study, it is assumed that the polynomial systems are strictly proper , which guarantees the causality of the systems. Accordingly, for linear polynomial systems, indicates when (1) is converted to its Laplace transform, the denominator Laplace polynomial has higher order than the numerator in the resultant transfer function.

Extend (1) to MIMO expression is given as follows:where is a vector containing all outputs.

is an input vector with the power of all inputs. is the order of derivative of that is directly related to , when has the derivative order . is now a matrix, instead of being a scalar. For simplicity and without loss of generality, we will omit the dependent variables.

Consider a generalised MIMO continuous-time state-space model expression given as follows:where is the output, is the input, and . is the dynamics of the system that updates the state variables, and calculates the system output. Extend it to MIMO state-space expression [7].where and are time-varying parameters, and is a smooth mapping from state vector to a specific output. Take the following system as an example:

Convert it to a U-model expression based on the following absorbing rule:where , .

2.1.3. U-Model Dynamic Inversion

Using the following equation, the U-model dynamic inversion is calculated as follows[36]:where is the desired output vector and is a null vector. The prerequisites for the solution to exist are external stability and the nonminimum phase of the system. Reconsidering (4), apply derivative to the output with respect to :

Replace with a polynomial equation and we have

Repeating the above derivative and replacement procedures for times. Rearrange the equation and combine similar terms as follows:where is the vector of state variables, and is a function of . Consequently, we need to solve an equation set of (10) to retrieve the final control.

2.2. DDPG

Deep deterministic policy gradient (DDPG) is an off-policy algorithm proposed in 2014 by Silver et al. [37]. DDPG is based on the actor-critic framework, which learns an action network and a Q network simultaneously. For every step, given current states, the actor network outputs the policy added with random noise. By executing the policy, the model receives a reward from the environment, and the policy network is updated using the temporal difference (TD) algorithm accordingly. The Q-value output by the critic network in turn guides the update of the actor network using the policy gradient algorithm [37]. To increase the stability of learning, a target actor network and a target critic network are added. They are updated through a soft update, namely a weighted average of the actor or critic networks and themselves. Besides, a replay buffer storing transition states and actions is implemented to increase the efficiency of sample utilization. Because of this double network configuration, DDPG can deal with situations when the action space and state space are all continuous [29]. The pseudo code of DDPG is shown in Algorithm 1.

(1)Initialize policy network , critic network and empty replay buffer
(2)Set target policy network and target critic network , with ,
(3)repeat
(4)   Observe state s and execute action a = clip , where
(5)   Observe next state s’, reward r, and done signal d to indicate whether s’ is terminal
(6)   Store (s, a, r, s’, d) in the replay buffer
(7)   If s’ is terminal, reset environment state
(8)   if it is time to update then
(9)        for the number of updates do
(10)            Randomly sample a batch of transitions, B = (s, a, r, s’, d) from
(11)            Compute targets
(12)            Update Q-function by one step of gradient descent using
(13)            Update policy by one step of gradient ascent using
(14)            Update target networks with
(15)        end for
(16)     end if
(17)   until convergence

3. Controller Design

This section describes the details of the conceptual framework of the controller, which will be implemented in the following simulation. The combined controller is composed of a U-model controller, a sliding mode controller with a variable boundary layer, and a DDPG network.

3.1. Framework Overview

Figure 2 illustrates the workflow of the proposed methodology. represents the reference signal, where is the degree of freedom (DoF) of the system. is the output vector, and represents the error vector. is the invariant controller, and is the original target dynamic plant. Between them is the dynamic inverter performed by a sliding mode controller, and it outputs the control vector . The parameters of the SMC are calculated by the DDPG module for every time step. Based on the gist of the U-model, if the dynamic inverter successfully performs dynamic inversion regardless of uncertainty, then the nonlinearity of the plant is cancelled, which means . In this case, the system is equivalent to a unit negative feedback loop, of which the performance is assigned by the transfer function of . In the following part of this section, the details of SMC, DDPG, and will be introduced.

3.2. Sliding Mode Control with Variable Boundary Layer

In this section, we consider a general second-order dynamic system expression, based on which the SMC with variable boundary layer is designed using backstepping and the Lyapunov stability theorem.

3.2.1. Conventional Sliding Mode Control with Backstepping

For simplicity and without loss of generality, consider the following single-input-single-output (SISO) dynamic model:where are state variables, is the control input vector, and is the overall disturbance vector. and are time-varying functions dependent on the state variables. For simplicity, the expression afterwards will ignore all dependent variables.

Assumption 1. The overall disturbance is bounded and satisfies , in which is a positively finite scalar.

Remark 1. In practice, the disturbance is often closely related to the state variables of the system. For example, viscous friction is a function of velocity. In practice, the state variables usually have bounds, so the total disturbance is usually bounded. Similar assumptions have been made in many control theory scenarios [38, 39].
The following is the construct virtual control variables:where forms the reference signal. is a constant positive value, and are virtual control variables. Construct the first partial Lyapunov function is given as follows:Take the derivative of (13) and integrate with (12), we haveAccording to the Lyapunov stability theorem, since , if , then , and that can converge to the equilibrium asymptotically. Regarding subsystem of , design a second partial Lyapunov function as follows:Take the derivative of (15) and integrate with (11)–(14), we haveSelect as the sliding mode variable and design control input aswhere is the sign function. is a positive scalar, and is called the switching gain.

3.2.2. Stability Analysis

Theorem 1. For a general second-order dynamic system as described in (11) and implementing a controller in (17), the system has asymptotic stability in the Lyapunov sense.

Proof. Integrating (17) into (16), we haveAccording to Assumption 1, , so and . Based on the Lyapunov asymptotic stability theorem [40], and will converge to equilibrium asymptotically. Therefore, it means that state variables and will follow the desired trajectory.

3.2.3. Adding Variable Boundary Layer

The switching gain plays a critical role in determining the performance of the SMC. If is large, the system will converge quickly, but the chattering problem is also exacerbated because of the discontinuity of the sign function. To alleviate the chattering issue, the sign function is replaced with a saturation function as follows [10]:where is the sliding mode variable and is the thickness of the boundary layer. Thus, (17) becomes

The introduction of the boundary layer constructs a space where the controller outputs a continuous torque so that the system trajectory can be smoother. However, the alleviation of chattering comes at the cost of lowered control accuracy. Figuring out how to select the optimal pair of largely relies on the human experience.

3.3. DDPG Network Module

The purpose of introducing the DDPG module is to adaptively select the optimal pair of for the sliding mode controller so that, when the error is large, the system will converge quickly; whereas when the system is near equilibrium, the chattering can be alleviated while ensuring the same level of accuracy. The DDPG belongs to the actor-critic framework of RL and, therefore, can deal with situations when state and action are all continuous values. In this paper, instead of directly outputting , it outputs , where is the inclination of the saturation function, as shown in Figure 19.

The actor and the critic network are constructed as fully connected neural networks. The details of the network structure are shown in Figure 4 and 5. The state is a vector , where are state variables and are corresponding errors. The output of the actor network is a vector . The action ranges of those two outputs are set to . The actor network takes the state vector as the input, and then it goes through three fully connected hidden layers with 64 units of nodes. The activation functions of the first two layers are the Rectified Linear Unit (ReLu) function, and the last layer uses the Sigmoid function to map every dimension of the output to . The critic network takes the concatenation of state and action as the input. It also uses several fully connected layers to process the information, and then outputs a scalar, which is also called a state-action value . This is an estimate of how well the action is given the current state.

3.4. Invariant Controller

An invariant controller is utilized in the U-model to assign system performance using linear system techniques. It can also be viewd as the implementation of a smooth transition process. Ideally, we hope the system can converge to equilibrium without oscillation or overshoot. Luckily, a critically-damped second-order differential equation meets our requirements and is designed as the invariant controller. The ideal closed-loop system transfer function can be written as follows:where is the damping ratio and is the natural frequency. They will be designed according to the required stability error and settling time. Also, the invariant controller can be calculated as follow:

4. Simulation

4.1. Dynamic Model Establishment

The simulation is implemented on a single pendulum. The structure graph is shown in Figure 6.

is the angle of the pendulum, is the length, is the weight, and is the gravity coefficient. For simplicity and without loss of generality, we assume the mass of the pendulum is concentrated at the end of the link, and therefore, the dynamic equation can be derived as follows:where is the angular acceleration, is the angular velocity, is the control torque, is the disturbance, and is the damping coefficient. Comparing (23) with (11), we have

Integrating (24) with (20), we have the sliding mode controller equation for a single-pendulum dynamics as follows:

4.2. DDPG Training Procedure

The training of DDPG follows the procedure of Algorithm 1, and the training-related parameters are shown in Table 1. The network is trained for 5000 episodes, with 200-time steps for each episode. Due to the limited computational resources, the sampling time during training is set to . The initial position of the pendulum is randomly set to be between rad, and the initial velocity is between rad/s. The target position, velocity, and acceleration of training are all set to 0. We later verify that, although the target of training is simplistic, the model can learn an effective policy that generalizes to time-varying trajectories (e.g., a sine wave). The state transition reward is designed as follows:

The reward is composed of three parts. takes a similar format as the sliding mode variable. An absolute operation is implemented on the angular error and velocity error, which implies a stricter punishment than the sliding mode variable because it requires both of them to converge to 0 simultaneously. Furthermore, there are slight adjustments to the weighting of the two terms. Since the priority goal is to track the desired position, we hope the reward can guide the model to pay more attention to angular error than velocity error. calculates the inclination of the saturation function. and penalizes the interference of DDPG with the SMC controller. The final constant coefficients are determined through experiments. Intuitively, the model is instructed not to output unnecessarily large parameters to the SMC controller, and so that prevents the chattering issues. Figure 7 illustrates the episode reward during training. We can see that with training, the network gradually finds a nearly optimal policy and optimizes the episode reward.

4.3. PSO Tuning Procedure

PSO is a meta-heuristic global optimization method. It treats the model as a black box and tries to find optimal solutions through inputs and outputs. The update rule of PSO borrows ideas from swarm intelligence. Through interacting with each other, all particles strike a balance between exploration and exploitation. PSO has been proven to be simple in rationale but effective in practice. In this paper, we use PSO to find the optimal parameters for conventional SMC given certain reference signals, which will then be used as the baseline for comparison. The objective of PSO is to find a set of that produces the minimum value of a so-called cost function . In this paper, the cost function for PSO to optimize is set as follows:which is the sum of the absolute sliding mode variable along one episode. The SMC parameter output by PSO is in (17). For a general PSO framework with particles of the swarm, the position of all particles at timestep can be represented as . The position update function can be written as follows:

Among them, is the velocity vector, which integrates the local best (the best position that the particle ever traversed) and global best (the best position that all particles ever traversed). are cognitive and social learning coefficients, which are related with local optimum and global optimum, respectively. is called inertia weight that adjusts the importance of the previous velocity. The pseudo code of PSO is shown in Algorithm 2.

(1)Initialization.
(2)for each one of the N particles do
(3)   Initialize the position and velocity
(4)   Set particles’ best position as current
(5)   Calculate the fitness of each particle and set the optimum as the
(6)end for
(7)Update.
(8)while stopping Criterion not met do
(9)   Update particles’ velocity using
(10)   Update particles’ position using
(11)   Evaluate the fitness of all particles
(12)   If , update individual best
(13)   If , update global best
(14)end while
(15)Output best results
4.4. Parameters Initialization

The parameters of single-pendulum dynamics are shown in Table 2. The parameters related to DDPG and sliding mode controller are shown in Table 1.

During testing, two target trajectories are assigned. One is a constant value target , and the other is a time-varying sine wave target . The sampling time is , and one episode takes 4000 steps. The disturbance is a random value between . For comparison, the initial state of the system is fixed to .

4.5. Results and Evaluation

In this section, the testing is implemented using a constant value target and a sine wave target, respectively. The results, including the torque profile, angle profile, model outputs, a sliding mode variable, and velocity error, are presented, with analysis and discussion. The baseline is the conventional sliding mode control optimized by the PSO algorithm, and the optimization function is the sum of absolute sliding mode variables at all time steps, namely . For label inside figures, we use “USMC” to represent a U-model-based SMC, and “USMC-RL” to represent a U-model-based adaptive SMC using DDPG. Through testing, the forward calculation time of our deep learning model is only about 600 , which satisfies real-time requirements.

4.5.1. Constant Value Target

Figures 819 show the simulation result on the tracking control of the constant value target . The optimal for conventional SMC is given by PSO. The sampling time is 0.005 s and the simulation lasts for 2400 steps.

For the sliding mode variable in Figure 8, we can see that USMC-RL renders a shorter settling time than USMC, and that the chattering issue is greatly alleviated in the vicinity of equilibrium. While the chattering amplitude of the USMC is about 0.07, the output of the USMC-RL remains smooth. Figure 9 shows the angle output and shows that USMC-RL converges quicker than USMC. Figure 10 is the profile of velocity error. We can see that, at the beginning, the velocity error of the USMC-RL is greater than that of the USMC. This is because the faster response of adaptive SMC produces a faster speed to converge to equilibrium. This can also be seen in the torque profile in Figure 11 where the USMC-RL performs severe chattering upon arriving at equilibrium. Besides, it is obvious that the USMC renders a much higher chattering amplitude (around 20 Nm) near equilibrium. The output of the model in Figure 19 illustrates that when the error is considerable at the beginning, the outputs of the model are near the upper bound 20; however, when the system is near equilibrium, the outputs are small (about only 2.5). This means that the model successfully learns to speed up the convergence while alleviating chattering according to the current situation. Also, we can explain the chattering upon arriving at the equilibrium of USMC-RL. This is because of the continuity of the model—it cannot jump from a very high output (20) to a very small output (2.5). Therefore, when it should produce a small output, it is still on its way down.

4.5.2. Sine Wave Target

Figures 1318 show the simulation result of the tracking control of a sine wave target . The optimal for conventional SMC is given by PSO. The sampling time is 0.005 s and the simulation lasts for 2400 steps.

The results analysis of the sine wave target is similar to that of the constant value target. For the sliding mode variables in Figure 13, we can see that the USMC-RL renders a shorter settling time than the USMC and that the chattering issue is greatly alleviated in the vicinity of equilibrium. While the chattering amplitude of the USMC is about 0.03, the output of the USMC-RL remains smooth. Figure 14 shows the angle output. We can see that the outputs of both controllers follow the reference signal with some slight lagging. This lagging is caused by the invariant controller of the U-model. Similar lagging is also witnessed in Figure 15. The same overshoot is also shown in Figure 15, which is induced by the chattering of USMC-RL, as shown in Figures 16 and 18, illustrates a periodic pattern of the model outputs. The outputs are elevated when the absolute reference acceleration reaches the peaks.

4.6. DDPG Output Visualization and Interpretation

Visualization is a simple but direct technique to interpret the deep learning model [41]. To justify the effectiveness of the learned policy, part of the output of the actor network is visualized. Figures 18 and 19 illustrate the and output of DDPG when the state of the pendulum is . The x axis is the angular error, ranging from rad to 2 rad, and the y axis is the velocity error, ranging from rad/s to 2 rad/s. It is clear that both outputs present a similar shape of a valley. When the errors are small, the outputs are small; when the errors grow larger, the outputs are becoming larger. This fits the intuition because larger outputs mean larger control gains, and subsequently, quicker responses and higher accuracy. We think this indicates the correctness of the learned policy. However, we can also see that the outputs following the line are smaller than , which is counterintuitive. If the angular error and velocity error have the same sign, then the error will keep increasing, and it is better to output larger values. We believe this is because, when , the velocity will force the absolute angular error to decrease, which renders a fake stimulation to the model, spurring the model to output larger values when . We believe this is caused by inadequate exploration.

4.7. DDPG Interpretation Using SHAP

While [42] visualization renders us a holistic picture of the model output, the information it can provide is only qualitative. To understand the model behaviour in a quantitative manner, we implemented the SHAP method for a model explanation, in terms of both global explanation and local explanation. First, an introduction of SHAP is given, and then, the global and local explanations are presented.

SHAP (SHapley Additive exPlanations) is a model-agnostic, post-hoc explainable artificial intelligence (XAI) method. It explains the final output as the addition of attributions from all inputs. In this way, it can explain the influence of each input on the output, in a positive or negative direction. It borrows ideas from SHapley values [43] to calculate the marginal contribution of each input. To calculate SHapley values, we have to retrieve “background data” through sampling. Similar to the monte-carlo method, the more data points we sample, the more accurate the estimation is. In this paper, the sampling ranges of four input features are set as follows:

We took a uniform sample of 5,000 points from this domain, combined them with the respective model outputs, and formed our background data. Figure 20 illustrates the distribution of our background data using a box plot. The mean values of four input features are , and the corresponding standard deviations are .

(Global explanation). The global explanation focuses on gauging how much influence each input feature has on each of the model outputs. This is carried out by calculating the expectation of the absolute SHapley value of each input feature with respect to the outputs. By using the background data, we have Figure 21, which shows the estimated SHapley value of each input feature with respect to the two model outputs . The feature names are listed and depicted in a descending order from top to bottom, according to the SHapley value. In other words, velocity error influences the outputs most, compared with other features. This fits with the intuition because the velocity range is larger than the angle range, and the velocity error accounts for the dynamic process. The second most influential feature is the angular error, and the last two features are angle and velocity. We can see that the first two features are related to the error information, which determines the robustness term of the SMC controller. The last two features are more related to the nominal control term. This result coincides with the rationale of the proposed method because the model outputs account for the robustness term of SMC, and it should take more consideration of the errors.

(Local explanation). The local explanation interprets quantitatively, figuring out why the model outputs certain values compared with the base value (the base value refers to the mean outputs among all the background data). Here, we take two extreme examples to show how our DDPG model can ease the chattering without compromising the settling time and accuracy. Figure 22 and Figure 23 show the force plots of two DDPG outputs with the input features being . We say this input is “under low error” because the errors are 0 and the system is in self-equilibrium. We can see from the figures that all four features are dragging the outputs from base values (13.87 and 14.13) to very low values (2.28 and 2.07). This behaviour fits with common sense since the system is now stable, so DDPG only needs small outputs to reject disturbance. In this way, the chattering issue is eased.

Another extreme is when the errors are large. Consider the initial condition of our simulations, which is . Figures 24 and 25 show the force plots under this situation. We can see that the velocity and velocity error drag the outputs to lower values, while the angle and angular error are dragging it higher. The reasons are twofold. Firstly, the low value of velocity and velocity error suggests that no large values are required. Secondly, both the high value of angle and the high value of angular error require larger outputs to stabilize. In this case, the system can converge quickly.

In summary, the analysis results of the SHAP method show that our DDPG model can make rational decisions;when the errors are large, output large values to converge quickly; when the errors are small, output small values to reduce chattering.

5. Conclusion

An adaptive sliding mode controller based on a deep deterministic policy gradient (DDPG) is proposed and combined with U-model control in this paper, for the tracking control of uncertain nonlinear systems. The proposed methodology successfully integrates the simplicity of the U-model and the robustness of SMC. Besides, the online tuning of DDPG results in lower chattering without compromising settling time or accuracy compared with a conventional sliding mode controller. Simulation on the single-pendulum proves its superiority. We think the combination of RL and SMC complements each other. First, from the controller’s point of view, the implementation of RL is an alternative method to transform original controllers into adaptive controllers. Compared with typical adaptive controller design methods, RL can reduce manual efforts and find optimum settings automatically. Second, combining with SMC enables the reservation of some explainability and robustness for RL. On the one hand, while we cannot understand the DDPG output directly, we can have an intuitive understanding by combining it with the formula of SMC. On the other hand, the RL algorithm tends to overfit, and the robustness of SMC can help overcome this drawback. Even when the real environment and dynamics are not completely the same as in training (due to wear, tear, hysteresis, etc.), SMC maintains a certain range of stability, which prevents the RL from failing.

However, compared with the other adaptive methods, there are some challenges existing for the RL-based SMC. First, while decreasing human craftsmanship, the RL algorithm requires a large amount of data for training, which may wear and tear the machines in practice. Many methods can be implemented to increase the efficiency of the data and accelerate convergence, e.g., model-based reinforcement learning and imitation learning to warm up. Second, is the generalization problem. The RL algorithms tend to overfit specific scenarios. How to bridge the gap between simulation and reality, as well as how to transfer the model to another scenario, are open questions.

For future work, researchers may implement maximum entropy reinforcement learning to fully explore the potentially larger action space and avoid converging to a local optimum. Finally, to accommodate real-life problems, this methodology can be extended to multi-input-multi-output (MIMO) systems using multi-agent reinforcement learning (MARL). Last but not least, the implementation of DDPG in this paper is only a simple version. More delicate considerations need to be implemented in order to maximize the potential of reinforcement learning in future research. Also, the DDPG-based method should be compared with other typical adaptive SMC controllers.

Data Availability

The simulation data are generated using Python, specifically sine wave function and random values. The source code can be accessed through https://github.com/AndyRay1998/RL-SMC-U.

Conflicts of Interest

The authors declare that there are no conflicts of interest.