#### Abstract

This paper proposes an adaptive formation tracking control algorithm optimized by Q-learning scheme for multiple mobile robots. In order to handle the model uncertainties and external disturbances, a desired linear extended state observer is designed to develop an adaptive formation tracking control strategy. Then an adaptive method of sliding mode control parameters optimized by Q-learning scheme is employed, which can avoid the complex parameter tuning process. Furthermore, the stability of the closed-loop control system is rigorously proved by means of matrix properties of graph theory and Lyapunov theory, and the formation tracking errors can be guaranteed to be uniformly ultimately bounded. Finally, simulations are presented to show the proposed algorithm has the advantages of faster convergence rate, higher tracking accuracy, and better steady-state performance.

#### 1. Introduction

A multi-mobile robot system can present intelligent behaviours through mutual cooperation and achieve work efficiency and fault tolerance that a single individual cannot provide, so that it can complete some more difficult tasks. The coordinated formation control of multiple mobile robots has received extensive attention due to its important applications in the industrial and medical field [1]. The most existing control methods dealing with formation control problems of multiple mobile robots mainly include behaviour-based control [2], virtual structures [3], and leader-follower architecture [4–6]. As a decentralized control strategy, the leader-follower formation structure has become the preferred control strategy due to its simplicity and scalability and requires less computation and communication resources than other strategies [7]. The movement types of mobile robots are divided into omnidirectional mobile robot (OMR) (holonomic) and nonholonomic one [8]. Concerning the uncertainty and disturbance of robot docking, reference [9] proposed a novel robust containment architecture for nonholonomic mobile robot formations with docking capabilities, which realized multirobot formation maintenance/switching, docking, and collision avoidance. In [10], a dynamic control law was developed for the cooperative target encircling problem of multiple unicycle mobile robots subject to heterogeneous input disturbances generated by the linear exogenous system.

Since nonholonomic mobile robots have fewer controllable degrees of freedom (DOFs) than holonomic mobile robots, geometric constraints are introduced on the robot's motion. Common examples with incomplete constraints are unicycles and car-like wheeled mobile robots. In contrast, holonomic mobile robots have the same number of controllable DOFs and total DOFs, which makes them very flexible and being able to move within the workspace without geometric constraints (e.g., they can perform both rotation and lateral translation). A typical example of this class of vehicles is the omnidirectional robot with mechanical wheels. More details on the types and configurations of mobile robots can be found in [11]. References [12, 13] studied the formation control problem of multiple omnidirectional mobile robots. Reference [12] developed a distributed adaptive control law for a multirobot system by obtaining information from moving targets through some mobile robots and using backstepping control technology in formation control. Reference [13] proposed an improved collision avoidance and formation control to configure a multirobot system optimized for omnidirectional visual simultaneous localization and mapping. However, the performance of the control schemes in [12, 13] will deteriorate when there exist uncertainties in the kinematics and dynamics of the omnidirectional mobile robots.

In the research of formation control of complex nonlinear systems, sliding mode control (SMC) is an effective robust controller for suppressing disturbances because of its excellent characteristics of being insensitive to system parameter changes and external disturbance when the system enters the sliding mode. In [14], a nonsingular fast terminal SMC was proposed, which can drive the tracking error to zero in finite time. Reference [15] investigated the leader-follower formation control for multiwheel mobile robots by combining a motion controller with a dynamic controller based on sliding mode. Considering the bounded external disturbance and parameter uncertainty of mobile robots, reference [16] proposed a dual-loop attitude tracking robust controller for mobile robots, using SMC with modified arrival law to ensure that the actual speed converges within a finite time. The main disadvantage of SMC is chattering phenomenon due to the discontinuity of the control law. In order to alleviate chattering, adaptive SMC [17] and higher order SMC [18, 19] have been proposed. However, these control methods may bring serious chattering even leading to instability when the system is exposed to a dynamic environment with large uncertainties and disturbances. Moreover, the traditional SMC is conservative to some extent since it ignores the information of uncertainties and disturbances. An effective way to solve this problem is using the disturbance estimation and compensation to decrease the conservatism and improve the control performance. Reference [20] proposed a disturbance observer and super-twisting SMC for the multirobot formation. Reference [21] designed an adaptive high-gain observer for the robot to estimate the nonlinearity that appears in the dynamics of the wheeled robot. In [22], the active disturbance rejection control technology was used to estimate the external disturbance in the inner loop of the double closed-loop strategy. On the other hand, the velocities of omnidirectional mobile robots cannot often be measured due to the lack of sensors which are needed in controller design.

Extended state observer (ESO) based active disturbance rejection control (ADRC), proposed by Han [23], is a powerful tool to cope with uncertainties and external disturbances. The key idea of ADRC is that the total disturbances (including internal uncertain dynamics, cross-coupling, and external disturbances), regarded as an extended state of the system, can be estimated by the ESO and then compensated in the control signal. As a matter of fact, the ESO is a state observer to estimate both the system states and the total disturbances. Considering the advantage of it, ESO is adopted to estimate the total disturbances and then followed by the ESO-based controller constructed to compensate it. Reference [24] employed a nonlinear extended state observer (NESO) to estimate unknown states as well as uncertainties and designed a robust finite-time tracking control scheme to handle wheeled mobile robots with parameter uncertainties and disturbances. Reference [25] used NESO-based estimation and compensation signals into the closed-loop control method; an NESO-based decoupling control method was then proposed to solve the attitude control problem of hypersonic gliding vehicles. In [26], an NESO was used to estimate the system uncertainties, and a saturation-resistant adaptive SMC was designed based on the estimated values to achieve robust trajectory tracking for a wheeled mobile robot. However, it is difficult to find appropriate nonlinear functions to design an NESO in practice. For convenience of theoretical analysis, Gao proposed a linearized and bandwidth-based linear ADRC (LADRC) to simplify parameter tuning and standardize controller tuning [27]. Linear ESO (LESO) with nested inner and outer loops was used in [28] to actively estimate and eliminate generalized interference and improve the estimation accuracy in various practical models. The proposal of LADRC makes the design and adjustment of the controller easier and more effective, and the tracking error will be more decreased than some classic control structures [16], which greatly promotes the engineering application of ADRC. However, the LADRC is loss of design flexibility because the LADRC parameters are adjusted based on bandwidth. A general LADRC with more tuning parameters was proposed in [29], and nowadays LADRC has been employed to various cases in engineering application and becomes much more popular [30, 31].

Reinforcement learning (RL) has been developed rapidly in recent years. As one of the important algorithms of RL, Q-learning is off-policy, tabular, model-free, and based on temporal-difference methods [32]. It has the advantages of not relying on models and having good learning effects for complex systems. In order to improve the control performance, some scholars combined Q-learning with PID control and proposed many excellent control methods [33, 34]. In the autonomous underwater vehicle system, reference [35] proposed a Q-learning PID controller based on RBFNN to improve control performance, in which Q-learning neural network was used to adaptively optimize control parameters. Reference [36] combined model-based Q-learning into the predictive control setting to provide closed-loop stability in online learning and ultimately improve the performance of the limited range controller. Chen proposed an adaptive auto-disturbance-rejection controller parameter adaptation method based on Q-learning for ship heading control with multiple uncertainties due to wind, wave, and current interference [37].

Inspired by the above statements, this paper investigates the formation tracking control of a multiple omnidirectional mobile robot system. The considered mobile robots have internal modelling uncertainties and external disturbances (considered as total disturbances). To handle the disturbances, an LESO is constructed, and the total disturbances can be effectively estimated through the ESO. Then, on the basis of distributed formation tracking control architecture, an LESO-based SMC (LESO-SMC) is designed for each OMR to ensure that the observer errors and the formation neighbourhood errors are uniformly ultimately bounded (UUB). However, LESO based control is not widely used in practice because there are not adequate methods for LESO parameter adjustment. In view of this, and furthermore for obtaining better control performance, Q-learning is employed to optimize the bandwidth parameters of LESO and the control parameters of SMC and to avoid the complex parameter tuning process. In addition, a simulation example is given to verify the effectiveness of the proposed method.

The main features of the proposed methods are summarized as follows:(1)An LESO is constructed to estimate the ‘total disturbances’ in real time, including both internal parameter uncertainties and external disturbances, and then an LESO-SMC based formation protocol is developed for the OMR system. The LESO provides distinctly better robustness against ‘total disturbances’ by providing accurate input variables to the control system, including the states of the mobile robot at each order, as well as the extended state representing the ‘total disturbances’. Then a corresponding improved strategy on the SMC is made to compensate the influence of the ‘total disturbances’, which ensures a faster convergence rate and decreases the conservatism of the traditional SMC.(2)To take full advantages of the LESO-SMC, an adaptive method of LESO-SMC parameters optimized by Q-learning algorithm is proposed in the formation tracking control of the OMR system. Q-learning is introduced to perform online parameter adaptation (including the observer, sliding mode variables, and controller parameters), which exhibits better formation tracking performance and avoids the complex parameter tuning process.

The organization of this paper is as follows. In Section 2, the dynamic model and some preliminary knowledge are outlined. In Section 3, the proposed controller design based on LESO and SMC is presented. Both the Q-learning algorithm and the Q-learning parameter tuning process are also introduced. The results of numerical simulations are discussed in Section 4, followed by the conclusion of this paper.

Notations: and represent the transpose and inverse of matrix , respectively. represents *n-*dimensional real column vector set. denotes an unit matrix. denotes an zero matrix. stands for Kronecker product. is the sign function. represents the Euclidean norm. denotes the diagonal matrix with its diagonal entries being . and represent the maximum and the minimum eigenvalues of matrix , respectively.

#### 2. Problem Formulation and Preliminaries

##### 2.1. Dynamic Model

The Euler–Lagrange equation of motion can be used to describe the dynamic behaviour of an OMR. The dynamic model of the ith OMR can be described as [31]where , represents the position and orientation angle of the ith robot in the world coordinate frame. is the inertia matrix, is the Coriolis and centrifugal term, is the gravitational force, and denotes the control input. It is assumed that the robots move on flat ground, where gravitational force is 0.

Considering the unknown dynamic disturbances and model uncertainties, a new dynamic model is obtained as follows:where and denote uncertainty terms, and denotes the unknown disturbance term.

The above equation can be rewritten aswhere denotes the uncertain term.

*Assumption 1. *The unknown disturbance term satisfies , where is an unknown positive constant.

*Assumption 2. *The uncertain terms and are bounded. Thus, the uncertain term satisfies , where is an unknown positive constant.

As for the LESO design in the following section, the total disturbance is extended as a new state. Define , , , where is the extended state of the total disturbances . The dynamic model (3) can be transformed into the following system:The dynamic equation of the virtual leader iswhere and indicate the position and velocity of the virtual leader, respectively. is the control input of the virtual leader which can be obtained by some followers.

*Assumption 3. *The derivative of the total disturbances is bounded by an unknown constant , i.e., .

*Remark 1. *Note that in the dynamic model (4) indicates an unknown term, such as external disturbances and model uncertainties for the mechanism of OMRs. In practical applications, the total disturbances mainly include wheel-ground sliding, modelling uncertainty due to robot load variation, etc.

*Remark 2. *In practice, both the speed of the DC motor which drives the OMR forward and its derivative have upper bounds, i.e., , , and are all bounded. is related to and , so we can conclude that and its derivative are bounded. Similarly, and its time derivative are bounded as well. Therefore, the assumptions of and in Assumptions 1 and 2 are reasonable.

##### 2.2. Algebraic Graph Theory

Consider an omnidirectional mobile robot system consisting of one virtual leader and followers. Assume that each robot is a node, and the information exchange among follower robots can be described by a directed graph . The graph is composed of the node set , the edge set , and the adjacency matrix . If there is an edge between agents and , i.e., , then ; otherwise . The set of neighbours of node is denoted by . The in-degree of node is defined as . Then, the in-degree matrix of digraph is , and the Laplacian matrix of digraph is . A path in graph from to is a sequence of distinct vertices starting with and ending with such that consecutive vertices are adjacent. Graph is connected if there is a path between any two vertices. Graph contains a directed spanning tree if there is a vertex (the root node) which can reach all the other vertices through a directed path. The virtual leader’s adjacency weight matrix , where represents the topological weight of the communication between agent and the virtual leader, , if there is communication between the agent and the virtual leader; otherwise, .

*Assumption 4. *For the considered multiple mobile robot systems (2), graph for the follower robots contains a directed spanning tree; i.e., there is a vertex (the root node) which can reach all the other vertices through a directed path.

Lemma 1 (see [38]). *If is a directed graph which contains a directed spanning tree and at least the root-node agent has access to the virtual leader, then the matrix is of full rank.*

#### 3. Main Results

In this section, in order to achieve better formation tracking control performance, the LESO-SMC scheme will be designed for system (3) in the presence of the unknown disturbances and model uncertainties, such that all follower OMRs can track the virtual leader with the given formation configuration in advance and maintain the same speed with the virtual leader. Furthermore, a parameter adaptation method based on Q-learning algorithm is involved in LESO-SMC to avoid the complex parameter tuning process, which displays a better formation tracking control performance.

##### 3.1. Linear Extended State Observer Design

In this section, we use the LESO to estimate the OMR’s total disturbances , which include unknown disturbances and model uncertainties.

The LESO for system (4) is designed as follows:where , and are the estimations of , and , respectively, and is the observer gain to be determined. , where is the ESO bandwidth [39].

Define the estimation errors as follows: ; then the estimation error equation is given by

Let ; equation (7) can be rewritten aswhere is Hurwitz, , .

Slightly different to the proof in [40], the convergence analysis of LESO (6) is given below.

Lemma 2. *Considering the estimation error dynamics (8), the LESO (6) is bounded stable if the observer bandwidth is designed to satisfy the condition .*

*Proof. *Since matrix is Hurwitz, there exists a positive definite matrix which satisfies .where is an unknown constant.

Consider a Lyapunov function candidate for (8) asThen we obtain the time derivative of aswhereUsing Young's inequality, one hasFrom Assumption 3, we have , where is an unknown constant. It can be obtained from (11) thatIf , then is bounded; hence the proposed LESO is bounded stable. The proof is completed.

To achieve a better tracking control performance, in this section, an LESO-SMC-based formation control scheme will be proposed to ensure the tracking performance based on the dynamic models introduced in Section 2.

##### 3.2. Sliding Mode-Based Formation Controller Design

Define the system neighbourhood errors aswhere denotes the desired relative position for the ith robot and the virtual leader.

According to (15), we design the ESO-based sliding mode surface of the overall formation for the ith agent aswhere is a positive constant.

The time derivative of is

The formation tracking controller based on ESO-SMC algorithm for the ith agent is designed aswhere and are positive constants, and is the control input of the virtual leader which can be obtained by some followers.

The designed control structure block diagram is shown in Figure 1.

##### 3.3. Stability Analysis of State Tracking Error Dynamics

We have introduced LESO to estimate the states and uncertainties of each order of OMR; next we will propose the SMC-based formation controller.

Theorem 1. *Consider the OMR system composed of (4) and (5) that satisfies Assumptions 1–4. Under the adaptive formation tracking control algorithm optimized by Q-learning scheme with the LESO (6), the SMC (18) and the Q-learning-based parameter adaptive algorithm (see Section 3.4), the estimation errors of the proposed LESO, and the formation tracking errors are guaranteed to be UUB.*

Similar to [41], we comprehensively consider the closed-loop system composed of the estimation errors of the observer (6) and the sliding variable (16). The stability analysis of the closed-loop system is given below.

*Proof. *Define the positive definite Lyapunov function aswith the time derivative of the second term in the right-hand side of (19) beingThe system neighbourhood error (15) can be expressed in matrix form aswhere , , , , , .

The sliding surface (16) can be rewritten asThen the formation tracking controller (18) can be rewritten aswhere , .

Substituting (21)–(23) into (20), one can obtainSubstituting (14) and (24) into (19), then we havewhere , , , with , , . Since and are positive definite, one haswhere .

Based on Gronwall's inequality, one hasSince , one has .

Hence ; then we obtain the bound of the observer error asSince , one has . Similarly, we can conclude that . Therefore and are UUB. By choosing appropriate parameters, the neighbourhood error can be attracted into a small stable region containing the origin. The proof is completed.

##### 3.4. Q-Learning Based Parameter Optimization Process

The learning process of Q-learning is of continuous environmental interaction. First, at time , select an action value . Then the agent will transfer from the original state to a new state with a probability of . At this time, due to environmental interaction, the agent can get a feedback return , then the time variable is updated, and the agent restarts the above steps in the new state until the optimal strategy is finally obtained. The Q-learning algorithm is shown below.

Parameters: , small , .

*Step 1. *For each state-action pair , , initialize the table entry to zero. Observe the current state *s*.

*Step 2. *Loop for each episode: Select an action *a* for state using -greedy policy and execute it, use the same method to choose for state , then receive immediate reward , and observe the new state . Update the table entry for as follows:

, , until is the terminal state and all of are convergent.

*Step 3. *Output the final policy .

In this paper, we consider the application of Q-learning ideas to the optimization of observers and controllers. Regarding the formation error and the derivative of the error as the state of Q-learning, dividing the controller parameters selection into a reasonable range and combining the divided values as an optional action, a Q-learning algorithm for observer and controller parameters optimization can be obtained. The specific steps of the algorithm are as follows:(a)States set are composed of .(b)Actions set are described as , , , . Reasonably divide the interval of the four parameters and randomly select and combine them to obtain the actions set.(c)In this paper, the stage performance index, i.e., the average of the squared errors of 10 sampling times, is used to design the reward function. Rewards are given as follows:Perform the learning of the value function *Q* according to the algorithm flow mentioned above. Then we get the learned *Q* table and the optimal strategy for online parameter selection.

There are 3 termination conditions as follows. If any one of them is fulfilled then the Q-value training is terminated.(a)It is not desired in practice that the formation error in the control is too large, which will make little sense to continue to iteratively calculate the *Q* table. So when , the training is terminated and reinitialized.(b)The control process has reached a steady state; then the training is terminated, i.e., and .(c)The simulation time ; then the training is terminated.Denote *L*_{t} as the training times. When , the training is terminated and the trained *Q* table can be obtained for online control.

*Remark 3. *Both the formation error and the error variation rate are divided into 7 linguistic values, {NB, NM, NL, *Z*, PL, PM, PB}, so that the number of state set elements is 49. Choose a reasonable range for each element in the parameter set . Based on the experience of adjusting the LESO-SMC controller parameters, set , , , . Let ; i.e., each parameter has 3 possible values; then the number of equivalent selectable actions is . Therefore, the value function matrix for the formation system.

*Remark 4. *The introduction of Q-learning avoids the process of selecting and optimizing the controller, the observer, and the sliding mode parameters. After Q-learning optimization, the convergence performance of the controller and observer can be guaranteed.

#### 4. Numerical Simulations

In this section, numerical simulation examples are used to illustrate the previous conclusions and the effectiveness of the proposed control scheme. Consider a scenario where a multiagent systems composed of three OMRs (followers) are simultaneously tracking a preset target (the virtual leader).

The communication topology *G* is given in Figure 2. The corresponding Laplacian matrix *L* and the adjacency weight matrix *B* can be described as

Each dynamics model of the three OMRs can be described by an Euler–Lagrange equation as follows:where , . Physical parameters and their values of the OMR are shown in Table 1. and denote the positions and the orientation angle of the ith OMR in the *x* and *y* directions in the ground coordinate system, respectively. represents the position and orientation angle of the ith robot in the world coordinate frame, represents the linear velocity and angular velocity of the ith robot in the world coordinate frame, and represents the linear acceleration and angular acceleration of the ith robot in the world coordinate frame.

We assume the initial position states of each OMR are randomly chosen within , the initial angle is 0, and the initial velocity is . , , and . The comparison simulations are carried out by the proposed adaptive method of LESO-SMC parameters with and without Q-learning algorithm, respectively, denoted by 'Q-LESO-SMC′ and 'LESO-SMC′, respectively. The parameters of LESO-SMC are chosen as , , , , and the parameters of Q-LESO-SMC are obtained after online optimization by the Q-learning algorithm in Section 3.4. The other parameters used in the simulation are chosen as .

In order to observe the performance of the two different controllers, the robot formation is implemented in two cases, including the case with constant disturbances and the case with sinusoidal disturbances.

*Case A. *The case with constant disturbances.

Consider the robot formation subject to constant external disturbances, denoted by step functions. The corresponding results are shown in Figures 3–9.

The estimation behaviours of the LESO in two dimensions for Case A with constant disturbances are shown in Figures 3–5. Here , and () denote the position, velocity, and the extended state (total disturbances) estimation in *x*-direction (in *y* direction) for the ith agent, *i* = 1,2,3, respectively. In each figure, the error plots are locally enlarged to highlight the specific convergence time and steady errors. From the comparison results, the estimation errors of all three states converge to a small stable region, and the estimation error of Q-LESO has faster convergence time and smaller steady estimation error than that of LESO under the constant disturbances.

Figure 6 shows the trajectories of three mobile robots and the virtual leader. ☆ and ○ represent the end positions of each follower robot and the virtual leader, respectively. It can be seen from the figure that the robots form a preset formation with the virtual leader as the center.

In order to make the comparison of control performance between LESO-SMC and Q-LESO-SMC, the formation tracking performance of each follower under these two controllers is depicted in Figures 7–8. Here and ( and ) denote the formation tracking position error and velocity error in *x*-direction (in *y* direction) for the ith agent, *i* = 1,2,3, respectively. It can be seen that the convergence speed of formation tracking error with Q-LESO-SMC is faster than that with LESO-SMC method, and Q-LESO-SMC method has stronger ability to suppress disturbance by comparing the magnitudes of the steady-state errors.

The position trajectories for three OMRs are shown in Figure 9, where ○ represents the starting position of each robot. It can be seen that 3 followers quickly track the virtual leader under the proposed Q-LESO-SMC method.

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

**(c)**

*Case B. *The case with sinusoidal disturbances.

For simplicity of simulation, consider the robot formation subject to nonlinear external disturbances, such as sine functions, e.g., . The corresponding results are shown in Figures 10–16.

Figures 10–12 show the comparison of estimation performance between Q-LESO and LESO with sinusoidal disturbances. The comparison of disturbance estimation behaviour is shown in Figure 12, where the ‘total disturbances’ of each follower are approximated by Q-LESO in a shorter period. In addition, the amplitude of the steady state of the estimation error of Q-LESO is smaller than that of LESO as can be seen in the locally enlarged plot.

Similar to Figure 6, Figure 13 shows the trajectories of the three mobile robots and the virtual leader in the same way. As can be seen from the figure, the target triangular formation is achieved with the virtual leader at the center.

Figures 14–15 show the comparison of formation tracking error between Q-LESO-SMC and LESO-SMC with sinusoidal disturbances. As can be seen, the followers track the virtual leader faster with the Q-LESO-SMC-based formation controller. The error steady-state part is locally enlarged, and comparing the magnitude of the steady-state error before and after optimization shows that the error amplitude of Q-LESO-SMC is smaller than that of LESO-SMC.

Similar to Figure 9 in Case A, the position trajectories for three OMRs are shown in Figure 16. It can be seen that 3 followers quickly track the virtual leader under the proposed Q-ESO-SMC method, and the tracking error is smaller than that under the LESO-SMC method.

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

**(c)**

*Remark 5. *In Case B, both LESO and Q-LESO methods can achieve well formation control performance. The comparison results show that the Q-LESO exhibited the advantages of faster convergence rate, smaller tracking error, and better disturbance suppression performance. In addition, to get good performance, the LESO needs larger observer gains, which may go beyond bandwidths of practical systems and make the required control energy infeasible. However, higher observer gains may lead to the bigger overshoot (the so-called peaking phenomenon); see Figures 10–12. Therefore, it is a trade-off between cost and performance.

#### 5. Conclusions

In this paper, considering the uncertainties and external disturbances in the formation process, an LESO-SMC with Q-learning adaptive optimization are proposed to achieve formation tracking of multiple OMRs. The simulation results show that the proposed control method has the advantages of faster convergence rate, higher tracking accuracy, and good steady-state performance. However, the LESO brings large overshoot with increasing bandwidth. We will further investigate this problem in the future.

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare no conflicts of interest.

#### Acknowledgments

This work was supported by the Jiangsu Natural Science Foundation of PR China (Grant no. BK20171019)**,** the National Natural Science Foundation of China (Grant no. 61703175), and the Natural Science Foundation of Huizhou University (Grant no. hzu201806)**.**