Abstract

A novel backstepping control scheme based on reinforcement fuzzy Q-learning is proposed for the control of container cranes. In this control scheme, the modified backstepping controller can handle the underactuated system of a container crane. Moreover, the gain of the modified backstepping controller is tuned by the reinforcement fuzzy Q-learning mechanism that can automatically search the optimal fuzzy rules to achieve a decrease in the value of the Lyapunov function. The effectiveness of the applied control scheme was verified by a simulation in Matlab, and the performance was also compared with the conventional sliding mode controller aimed at container cranes. The simulation results indicated that the used control scheme could achieve satisfactory performance for step-signal tracking with an uncertain lope length.

1. Introduction

A robotic container crane is a robot that lifts the cargo off the ground with ropes and then carries the cargo to the designated locations. As the robot carries the cargo to the destination, it is necessary to stabilize the swing angle of the rope. Significant swing-related problems may cause the cargo to fall or even roll over. Therefore, an appropriate control strategy is necessary to ensure that the robot responds quickly to the ideal command while suppressing the amplitude of the swaying angle of the rope. The uncertainties of the system, resulted from the uncertain system parameters, can bring up the challenges on the design of the controller. Therefore, some previous control techniques relying on exact models exhibited certain limitations [13]. Many present control strategies aimed to the uncertain systems have been proposed for this problem, such as the sliding mode control [48], fuzzy control [9, 10], adaptive control [11, 12], and fuzzy PID control [13, 14] strategies.

Reinforcement learning (RL) is a learning method that gradually explores the optimal policy by interacting with the environment [15]. In the reinforcement learning, the target is usually to maximize the cumulative rewards or minimize the cumulative costs over the entire learning process. The entire process of typical reinforcement learning can be described as the following. The learning process starts by the agent adopting an action in the initial state based on the current policy, and the adopted action will transfer the system from the current state to the next state with certain probability. Subsequently, the agent will repeat to adopt an action and then transfer the system from the current state to the next state until the end of the learning. In this process, an action transferring the system from the current state to the next state will be evaluated with the reward or cost that is also called the instant reward or cost. The offered instant rewards/costs of each action from all the visited states can be further used to dynamically explore the optimal policy of adopting actions that can achieve the maximum of rewards or the minimum of costs over the entire process, which can be completed by many temporal difference (TD) methods such as Q-learning [16] and SARSA [17].

When the reinforcement learning is applied in the field of control for continuous systems, it is inevitable to encounter the problem of “curse of dimensionality” that means the number of discrete states that are supposed to be visited by the agent increases to infinity. Therefore, fuzzy logic can be used to fuzzify the system states, allowing the application of reinforcement learning methods originally used for discrete systems [18].

In recent years, there have been developments in the use of fuzzy reinforcement learning theory to solve control problems of nonlinear systems. In a recent approach [19], for the coordinated control problem of multiple manipulators, a reinforcement learning method was used to deal with the uncertainties of the dynamic models. This approach took into account minimizing both the errors of tracking trajectory and the control quantities for each robot, thereby solving the problem of the inconsistencies between different manipulators. In the literature [20], the control law was outputted from the reinforcement learning mechanism in which the actions corresponding to each state were set to satisfy the stability requirement, and a neural network was used to solve the problem of “the curse of dimensionality.”

In this paper, a modified backstepping controller is proposed for the underactuated system of robotic container cranes. The control gain of this controller is important because it influences the convergence of tracking errors. However, it is difficult for the designer to empirically obtain the appropriate values of the control gain because the experience about the appropriate control gains is always expensive or even unavailable in practices. Therefore, fuzzy Q-learning is applied to automatically search the optimal fuzzy rules that can output the appropriate values of the control gain. More precisely, the value of the Lyapunov function is used to judge the applied actions, and the control gains that result in the decrease of the value of Lyapunov function will be given a high-value reward and vice versa. Therefore, the control target can be achieved by the fuzzy Q-learning mechanism that obtains the optimal fuzzy rules outputting the appropriate control gains in the applied controller after the appropriate learning process, which can reduce the value of Lyapunov function and then achieve the convergence of tracking errors. The rest of this paper is organised as follows: in Section 2, a nonlinear dynamics model of robotic container cranes is established by the Lagrangian method. In Section 3, a reinforcement fuzzy Q-learning-based backstepping control scheme is detailed to control the position of the load and stabilize the swaying angle of the rope. The stability proof is also presented in this section. In Section 4, the simulation is conducted to verify the effectiveness of the applied controller, and the performance is compared with the conventional sliding model controller. The conclusion is given in Section 5.

2. Dynamical Model of the Robot

The robotic container crane model is shown in Figure 1, where x is the horizontal displacement of the robot; θ is the load swing angle; m1 and m2 are the weights of the robot body and the load, respectively; and L and F are the length of the rope and the driving force of the robot, respectively.

Assuming that the entire system is fiction-free and the ropes have no mass and undergo no elastic deformation, the kinetic (T) and potential (U) energy of the robot system can be, respectively, expressed as follows:where is the local gravitational acceleration. According to the Lagrangian equation,where is the control inputs of the system and . The dynamics equations of the container crane can be achieved:

The above equation can be rewritten in the state space form

The dynamics of a container crane, which is shown in equation (4), can also be presented in the block diagram as shown in Figure 2.

3. The Design of Reinforcement Learning-Based Backstepping Controller

In this section, a backstepping controller for a crane robot is designed. A fuzzy reinforcement Q-learning mechanism is applied to determine the appropriate parameters of the controller to achieve stability. The control scheme is shown in Figure 3.

First, the state space form of the system dynamics equation can be rewritten as

To satisfy the above equations,

Furthermore,

The dynamics model (equation (7)) is written in a form on which the backstepping control method can be conveniently applied:where and .

The first Lyapunov function of backstepping control is designed as follows:where . and are the desirable values of trolley’s position along the X-axis and the sway angle, respectively. Taking the first derivative for equation (9),

It is noticed that negative can be obtained by . is a positive number. Consequently, the second Lyapunov function is designed aswhere . Taking the first derivative for equation (11) yields the following:where , , , and .

It is worthwhile to notice that the crane robot system (equation (7)) is an underactuated system that has 2 controlled variables (position and swag angle, respectively) and only 1 control input (force F). As a result, the conventional backstepping control law for fully actuated systems is not applicable in this case. Moreover, the control law shown in equation (13) that can necessarily ensure a negative derivative of Lyapunov function would result in impractically big control signals when its dominator is approximate to zero (i.e., and are of small values).

Consequently, a novel control law for that can avoid the issue resulted from a zero value in the dominator is applied as follows:where is the control gain. It is noticed that the negative derivative of Lyapunov function (equation (11)) is maintained as long as the parameter shown in equation (14) satisfies the following equation:

It is difficult to design the parameter in a deterministic way to satisfy equation (15) because the small value of would result in the impractically big value of . Consequently, the reinforcement fuzzy Q-learning is applied to search the appropriate values of .

In equations (12) and (15), to ensure the stability of the system, is adjusted based on the values of and . A reasonable linguistical adjustment rule is as follows:(A)If is big and is positive, then is medium(B)If is big and is negative, then is small(C)If is small and is positive, then is very large(D)If is small and is negative, then is large

The terms small, medium, and large in the above equations are all linguistical descriptions. The actual numerical output is obtained by the fuzzy reasoning based on the numerical values of actions and the parameters of the fuzzy structure. The group of actions corresponding to the linguistical description of “large” is . The numerical value of directly affects the performance of the controller. The action among is selected based on and using the fuzzy Q-learning method to achieve the convergence of Lyapunov function. In this control scheme, and are used as the inputs for fuzzy reasoning, i.e., the state in Q-learning. and are fuzzified by the triangular membership functions shown in Figure 4 with the details of fuzzy sets shown as equations (16) and (17):where is the number of fuzzy sets for and is the number of fuzzy sets for .

We define the n-th fuzzy rule Rn in the fuzzy Q-learning as follows:

Rn: If is and is and…and is , then,  = UnL ( =  with q1L(n, 1) or  =  with q2L(n, 2) or  =  with q3L(n, 3) or…or  =  with q1L(p, 1)), where the set is the chosen set of parameters under the n-th rule in the state s at the moment k. The Rn rule corresponding to the input state vector sk = {, , …, } yields the membership functions {μ (sk), …, μ (sk)} at a given time. Each u in the set UnL has a corresponding q value. Therefore, it is necessary for the reinforcement learning to continually update the value for each action in all rules based on the membership functions and the rewards to achieve the optimal policy of selecting actions in all rules. Next, the rewards are given according to the variance of the value of Lyapunov function .

First, the Q-learning mechanism selects the smallest u value corresponding to the q value based on each fuzzy rule:

To prevent u from falling into local optimum conditions in the selection process, a greedy mechanism is introduced:

The numerical value of the parameter is obtained by defuzzifing the sample selected by the rule:

Under the greedy mechanism, the choice of u is random, which makes the reinforcement learning more globally exploratory in the training process.

The defuzzification of the Q value of the state vector at time k can be expressed by the following equation:

The defuzzification of the target value under state sk can be expressed as follows:

When the state vector sk of the system enters the next state sk+1 under the action of uIL, the generated cost information is ck, and the time error in the process iswhere is the discount factor that reflects the consideration of the future reward and is the reward given at the instant k. In this case, the variance of the Lyapunov function (equation (11)) is used as the reward. More precisely, if the value of Lyapunov function decreases during the period from instant k − 1 to instant k, a large reward value will be given. Otherwise, a small reward value will be given. The function describing the reward is as follows:

It is clear that a larger value (no greater than 1) will be attained with a more dramatic decrease of the Lyapunov function. The reward with the largest value of 1 will be achieved only if V(k) = 0, which means the system achieves the desired steady state.

The iteration equation of the final q value is as follows:where λ is the learning coefficient between 0 and 1.

The parameters of the proposed controller, which should be determined and tuned by users, consist of the parameter in the backstepping part (the parameter in the items of and in equation (14)) and the parameters in the reinforcement learning part (parameters of fuzzy sets in equation (16) and in equation (17), mutation probability in equation (19), discount factor in equation (23), learning rate in equation (25), and action group of control gain ). Several rules of tuning the parameters of the controller are given in Remarks 16 to achieve a satisfying control performance.

Remark 1. In the backstepping part, is the proportional gain of the first virtual control gain of the backstepping controller. A big value of can be chosen in order to achieve a fast decreasing value of Lyapunov function, which means the fast convergence of the errors of both the trolley’s position and the sway angle. However, an excessively big value of shows the risk on amplifying the measurement noise of tracking errors and the derivatives of tracking errors , which would negatively influence the control performance. On other words, there is a trade-off between the fast convergence of system errors and the immunity to the measurement noise. Hence, we suggest that the trials of selecting the value of should start at a small value (e.g., 0.01) and then gradually increase the value until the satisfyingly fast decrease of Lyapunov function is achieved.

Remark 2. In the reinforcement learning part, the fuzzy sets of inputs used to do fuzzy reasoning ( and shown in equations (16) and (17)) are important because they transform the numerical inputs and into the group of firing rates corresponding to the linguistic description (e.g., small, medium, and big), which are applicable on fuzzy reasoning. Hence, we suggest choosing big values for and and small values for and in order to cover the range of and during the control task. We also suggest that the fuzzy sets and should be distributed evenly among the selected range in order to well present the dynamics of the second Lyapunov function to the fuzzy inference.

Remark 3. In the reinforcement learning part, mutation probability reflects the trade-off between the exploration of potentially better solutions and the exploitation of learnt good solutions. It is generally agreed that the good solutions are likely obtained during the later stage of learning; therefore, we suggest offering with a big value (e.g., 0.6) during the initial stage of control and a small value (e.g., 0) during the later stage of control.

Remark 4. In the reinforcement learning part, discount factor shows the attention on the control performance in future steps. In our case, although the variance of Lyapunov function is influenced by the previous control signals, the current variance of Lyapunov function is mainly determined by the current control signal, which means the current control gain on stabilizing the system should be judged mainly by the current performance (the current variance of Lyapunov function). Therefore, we suggest discount factor should be offered a small value (e.g., 0.1).

Remark 5. In the reinforcement learning part, learning rate reflects the efficiency of remembering new knowledge and forgetting the old knowledge. A big value of could achieve a fast convergence of , which means a high learning efficiency. However, it is desirable for reinforcement learning to keep the old knowledge to certain extent because of the risk on learning the false knowledge (e.g., the data used to learn are contaminated by measurement noise or unknown disturbances). Therefore, we suggest learning rate should be offered a medium value (e.g., 0.4∼0.6).

Remark 6. Action group of control gains can be regarded as the most imperative part in determining the parameters of the controller because the actual control gains are calculated by equation (20) based on the members of this group. The minimum value of the member in this group should be small enough while the maximum value of the member in this group should be big enough, which ensure the optimal control gains satisfying equation (15) (meaning the convergence of Lyapunov function) are inside the set of calculated actual control gains. However, excessive big values of control gain could result in chattering effects. Hence, we suggest the minimum value of member should be very small (e.g., 0), and the trials of selecting the maximum value of member should start at a big value and then gradually decrease the value until the chattering effect is insignificant. Moreover, the rest members in the action group are suggested to be evenly distributed between the minimum value and the maximum value in order to calculating smooth control gains.

4. Simulation Result

Simulation was run to verify the effectiveness of the used controller. Our control target was to accurately control the position of the load with as little the sway angle of the rope as possible. In other words, the load is supposed to reach the designated position, and the angle of the rope is supposed to be stabilized around 0 by the used control scheme.

After carefully selecting parameters based on the aforementioned rules of selection shown in Remarks 15, the detailed parameters of the controller are determined. Various parameters of the robot system and the controller during the simulation are shown in Table 1.

The numbers of linguistical variables (fuzzy sets shown in equations (16) and (17)) to describe and are set as 10. The number of action candidates in the action group for calculating the control gain is set as 50 on each fuzzy rule. As a result, after carefully selecting the parameters based on Remark 2, the membership function of has lin = {−0.0015, −0.0012, −0.0008, −0.0005, −0.0002, 0.0002, 0.0005, 0.0008, 0.0012, 0.0015}, and the membership function of has lin = {−0.02, −0.0156, −0.0111, −0.0067, −0.0022, 0.0022, 0.0067, 0.0111, 0.0156, 0.02}. After carefully selecting the parameters based on Remark 6, the action group of control gain on each rule is .

The desirable of trolley’s position is set as a constant so that , while the sway angle is supposed to be minimized so that . The length of the rope is the crucial element influencing the stability of the crane system [19], and we set an uncertainty of to the rope length to show the ability of the applied control scheme to handle the uncertainty of rope length. The probability to explore potentially optimal fuzzy rules is set as during initial 60 s, from 60 s to 100 s, and after 100 s.

In the simulation, the proposed controller is compared with the conventional sliding mode controller (SMC) mentioned in [21].

The performance of the proposed reinforcement learning-based BSC and the conventional SMC on tracking a constant trolley’s position are shown in Figure 5. Clearly, in the proposed reinforcement learning-based BSC, the significant overshooting is observed during the initial 20 s, and a longer time is used to drive the trolley to reach the target position compared with conventional SMC, which could be resulted from the exploration on the bad fuzzy rules during the initial stage of reinforcement learning. However, compared with the conventional SMC, the designed controller can achieve the less steady-state error (SSE) after learning the optimal rules during the later stage of control (shown as the subplot on the right part of Figure 5). In other words, the designed controller can achieve the position tracking of the trolley with more accuracy at the expense of fluctuations during the initial stage of control, which is similar to the nature of reinforcement learning that the optimal solutions are obtained after trying many bad solutions. For avoiding the observed overshooting and fluctuations, the good fuzzy rules that could be obtained from the experience of designers and the prior knowledge of the crane system could be used as the initial rules of reinforcement learning.

The performance of the designed controller and the conventional SMC on stabilizing the sway angle at 0 degree are compared in Figure 6. Compared with the conventional SMC, although the fluctuation of sway angle under the used reinforcement learning BSC lasts longer (sway angle takes longer time to reach 0), it is clear that the sway angle can be around 0 degree with less chattering effects and less steady-state errors during the late period of control, shown as the 2 subplots in Figure 6. In other words, the applied controller on stabilizing sway angle starts at bad performance (longer settling time 0∼60 s) and then achieves better performance than conventional SMC during the late stage of control (60 s∼200 s). The reason of such performance of sway angle is the same to that of the trolley position, the nature of reinforcement learning that optimal solutions are obtained at the cost of trying many bad solutions. On top of that, the less overshooting of sway angle during the initial stage is also observed in the proposed controller compared with that of the conventional SMC.

Figure 7 shows the control forces generated by the reinforcement learning-based BSC and the conventional SMC. The applied reinforcement learning-based BSC provides outstanding chattering reduction and smaller control forces in comparison with conventional SMC during the entire period of control. It is also observed in Figure 7 that the control forces generated by reinforcement learning BSC in the initial stage of control (0–80 s) is relatively chattering than that in the late stage of control (80–200 s), which is in accordance with Figures 5 and 6. The reason is the same to that of the trolley position and sway angle, which have been explained in the previous discussion of results for Figures 5 and 6.

The dynamics of Lyapunov function are shown in Figure 8. It is clear that the Lyapunov function decreases with the fluctuation wearing off along with the time, which indicates stability is achieved by the appropriate control gain from the reinforcement learning. Fluctuating values of Lyapunov function are observed during the initial stage of control (0–50 s) because the reinforcement learning mechanism was trying bad solutions, which are in accordance with the fluctuations of the trolley position, sway angle, and control force during the initial stage of control. However, the fluctuation of Lyapunov function is still observed during the late stage (180–200 s) after learning, which is shown as the subplot in the right part of Figure 8. It is because the inputs ( and ) used to present dynamics of the second Lyapunov function are described by the limited linguistical variables (fuzzy sets) so that there are no enough fuzzy rules to correctly determine the appropriate control gain achieving the convergence when the value of Lyapunov function is close to zero. For example, the linguistic variables describing the fuzzy inputs are “zero” when and , which means there is only one fuzzy rule corresponding to all the states inside the range though those states would need different values of control gain to achieve the convergence of Lyapunov function. A more detailed example can be used to illustrate this issue; the state of pair and the state of pair have the same linguistic description “ is zero and is zero” in the fuzzy reasoning, and consequently, the same control gain will be generated for those 2 states though they would need different control gains to achieve convergence of Lyapunov function. Therefore, the Lyapunov function will continuously fluctuate as a result of the only one fuzzy rule that corresponds to the state of “ is zero and is zero” and adaptively tries different action candidates in different states inside a small range (e.g., and ).

The outputs of fuzzy logic inference in different times are shown in Figure 9. Clearly, the output function of fuzzy logic inference is different in different times because the fuzzy rules on which the fuzzy logic inference depends are different in different instants of learning. More precisely, the initial fuzzy rules are randomly set so that the output function of fuzzy inference is random at the beginning. And then, the output function of fuzzy logic inference varies significantly during the medium period of learning (t = 20 s; t = 100 s). After that, the output function of fuzzy logic inference tends to be stable with a less variance on its shape (t = 100 s, t = 160 s, and t = 180 s), which means the optimal rules have been learnt.

5. Conclusions

In this paper, we proposed a reinforcement learning-based backstepping control scheme (BSC) that can handle the underactuated system of container cranes. In the control scheme, the control gain of BSC, which influences the stability, is tuned by the reinforcement fuzzy Q-learning that automatically searches the optimal fuzzy rules to generate the appropriate control gains of BSC to achieve the decrease of Lyapunov function. The simulation results show the effectiveness of the applied control scheme by the less chattering effect and less steady-state error under the uncertainty of rope length compared with the conventional SMC.

Data Availability

The data used to support the findings of this study are included within the article and can be used for other research studies.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to thank LetPub (http://www.letpub.com) for their assistance on correcting writing issues during the preparation of this manuscript and on improving the quality of the figures used in the manuscript. This work was financially supported by major national planning projects of China’s Ministry of Industry and Information (Z135060009002-50).