#### Abstract

This paper presents an adaptive reinforcement learning- (ARL-) based motion/force tracking control scheme consisting of the optimal motion dynamic control law and force control scheme for multimanipulator systems. Specifically, a new additional term and appropriate state vector are employed in designing the ARL technique for time-varying dynamical systems with online actor/critic algorithm to be established by minimizing the squared Bellman error. Additionally, the force control law is designed after obtaining the computation of constraint force coefficient by the Moore–Penrose pseudo-inverse matrix. The tracking effectiveness of the ARL-based optimal control is verified in the closed-loop system by theoretical analysis. Finally, simulation studies are conducted on a system of three manipulators to validate the physical realization of the proposed optimal tracking control design.

#### 1. Introduction

In recent years, the tracking control problem for robotic systems has been remarkable, considered from not only academia but also industrial automation [1, 2]. Among different schemes, robotic systems can mainly be separated into two categories: (1) mobile robotics [3, 4] and (2) robot manipulators [5]. However, because of the growing requirements for high productivity, the networked robotic systems that are known as the collaboration of many robots can fulfill different tasks by mobile manipulators [6], tractor trailers [3], bilateral teleoperators [5], and cooperating mobile manipulators (CMMs) [7]. It is worth emphasizing that, unlike nonlinear multiagent systems [8, 9] studying the relation between agents, CMMs consider the interaction with the rigid object. In the literature of CMMs’ control objectives, they can be classified into two categories, i.e., multiple mobile robot manipulators in cooperation carrying a common object with unknown parameters and disturbances [7, 10–18]; one of them tightly holds the object by the end effector, and the remaining mobile manipulators’ end effector follows a trajectory on the surface of the object [19, 20]. Most of the existing control literature of networked robotics mainly focuses on the implementation of nonlinear controllers. According to Lyapunov’s direct technique, authors in [5] presented the robust controller for a bilateral teleoperation under time-varying delays and without relative motion. Recently, the backstepping technique of networked robotics has achieved much attention [3, 11, 21]. In [3], a backstepping method is utilized to implement an adaptive trajectory tracking control design for a tractor trailer, where the proposed control scheme can stabilize for cascade control systems. For the purpose of easily computing the backstepping method and handling the full-state constraint, authors in [21, 22] pointed out the appropriate term in the asymmetric barrier Lyapunov function (ABLF). This backstepping method is extended in [11] to deal with unknown dead zone nonlinearity, time-varying term using the input-driven filter term and Nussbaum function, respectively. Another consideration addressed in the existing literature is to mention the holonomic-nonholonomic constraint [12, 19, 20, 23, 24]. Motivated by this approach, the corresponding motion dynamic model and constraint force equation are established for finding appropriate controllers [12, 19, 20, 23, 24]. As a result, the robust adaptive controller was realized by estimating the derivative of the Lyapunov function and combining the constraint force control to obtain the motion/force control scheme for multimanipulator systems [12]. Due to dynamic uncertainties and external disturbance, fuzzy and neural network (NN) method are popularly considered to estimate dynamic uncertainties for guaranteeing system stability [7, 10, 20]. However, the work in [23] utilized an estimation in the absence of fuzzy and NNs for designing a decentralized controller for cooperating manipulators. In contrast to this technique, authors in [7, 12, 25] proposed the centralized control law based on the consideration of dynamics of manipulators. The scheme in [7] extends the neural network-based centralized coordination control to achieve the tracking of each mobile manipulator by the consideration of additional actuators using servo motors. However, these nonlinear control solutions do not take into account the challenges, such as input-state saturation, and control objectives are different from the traditional tracking control problem. Therefore, we propose motion/force control with a progressive optimality principle to improve the tracking performance.

In the implementation of the optimality principle, an important aspect is developed for robotic systems with the consideration of the robot-environment interaction [26–28]. In control design for robotic systems, and so on, there always exists an interaction between the end-effector and the environment, which may lead to various challenges of system performance. Therefore, the control scheme should maintain a solution for considering the system interactions. Needless to say, the control systems of the robot-environment interaction have been paid much attention by many researchers for several decades [28]. Due to the effect of the interaction between the robot’s end effector and the unknown environment, the desired trajectory in the joint space needs to be modified by the optimality approach [26–28]. In [26], the method to generate the virtual desired trajectory can be implemented by adaptive output feedback optimal control for the unknown impedance model. A different approach for finding the virtual reference trajectory can be developed by Q-learning [27]. Furthermore, based on ARL-based admittance control, authors in [28] proposed the completed control structure with the additional term of torque estimation and adaptive controller. An extension of the robot-environment interaction can be known as the physical human-robot interaction [15, 29]. In [29], the dynamic model of an dimensional manipulator system is established under the task space with the consideration of the end effector of the robot and human or environment. Moreover, the dynamic model established more a desired admittance model in the task space for obtaining virtual desired trajectory under constraint region by a soft saturation function [29]. A different discussion of the human-robot interaction with the estimation of human intention was mentioned in [15]. Furthermore, the motion/force control law was proposed using the computation of constraint force without the optimality problem [15]. An approach of optimization for the dual-arm manipulator with the consideration of relative Jacobian was mentioned in [16] by the QP solver. The extension from dual-arm manipulators to multiarm robots is discussed with the classical backstepping structure and optimal control law being constructed to tackle optimal behaviors involving obstacle avoidance and energy saving [11]. Additionally, authors in [17] studied multiarm systems under time-varying communication delay by task-space synchronization with a fixed threshold. For different networked systems of bilateral teleoperation systems, the optimality principle was also considered for implementing control design [30]. However, the disadvantage of the control law in [30] is no consideration in solving the Riccati equation. Furthermore, the optimality principle in robotic systems has been almost focused on the interaction of the human-robot’s end effector and the environment. This article proposes the consideration of optimal control in trajectory tracking in the motion/force control problem.

For finding an optimal control problem to minimize a given performance index in dynamical systems, it is necessary to solve the so-called Hamilton–Jacobi–Bellman (HJB) equation obtained from the dynamic principle. However, it is hard to analytically solve this HJB equation due to leading to a nonlinear partial differential equation. Among the numerical methods which have been considered to solve the HJB equation, the remarkable iterative structure has been developed to solve via online based on the reinforcement learning (RL) principle being inspired by machine learning [27, 31–39]. In order to implement the numerical algorithm for solving the HJB equation, there are two major directions, including online actor/critic [31, 32] and off-policy technique using integral reinforcement learning (IRL) [38, 39]. The first approach using online actor/critic is to develop based on the property of Hamiltonian being 0. As a result, the iterative algorithm can be considered by the Bellman error with solving the optimization problem. For designing the ARL scheme in linear dynamical systems [31] and in nonlinear systems [32, 40], the methods are realized to find the optimal control input being Kronecker product and approximating neural networks (NNs), respectively. Furthermore, this technique is extended for several situations, such as goal representation heuristic dynamic programming (GRHDP) with the multivariable tracking scheme [35] and uncertain discrete-time systems by using NNs [33]. The second approach using the off-policy technique is implemented by considering the deviation of the performance index in two different times with appropriate data collection [38, 39]. It is worth emphasizing that almost existing works discussed the ARL design in time-invariant systems [27, 33, 35, 36, 38, 39]. In contrast, for establishing the optimal control problem in robotic systems, it requires the consideration of ARL in time-varying systems because of the desired time-varying trajectory [31, 32]. Furthermore, the ARL-based optimal control with the actor/critic learning structure has just been discussed in [41] for time-invariant systems. The key idea is to establish simultaneously the learning of neural networks (NNs) in the actor and critic part [41]. The work in [32] developed the actor/critic learning structure for manipulators by combining the sliding mode control (SMC) technique in the absence of constraint force. Regarding the multimanipulator systems, the proposed control law in [12] only implemented the robust adaptive control based on the traditional nonlinear control technique and model separation technique without considering the optimality principle.

In this paper, we investigate the ARL-based motion/force tracking control problem for a multimanipulator system in the presence of disturbance. The control goal is to obtain the unification of the optimality principle and tracking problem. The main contributions of this paper can be summarized as follows:(1)First, it is obviously different from [26, 27, 34, 41, 42] studying optimal control for time-invariant systems; we propose ARL-based optimal control design in the situation of the time-varying nonlinear dynamical system under the influence of the time-varying desired trajectory by utilizing the online actor/critic technique for the motion dynamic model of multimanipulator systems. Furthermore, this work is still proposed for the motion/force control problem by extending more the force control term in the control scheme.(2)Second, different from numerous existing nonlinear control approaches [10, 12, 13, 17, 20, 23–25, 43, 44], the proposed optimal control algorithm is investigated with the unification of the optimization and tracking problem to be discussed through theoretical analysis and simulation studies.

The rest of this article is organized as follows. Section 2 provides the problem statement of this paper. Section 3 describes the ARL-based motion/force control design for multimanipulator systems with tracking analysis. Simulation studies on the multirobots with 3 manipulators are presented in Section 4, and the conclusion remarks are pointed out in Section 5.

#### 2. Preliminaries and Problem Statement

We investigate two-link manipulators, as shown in Figure 1, which includes robots interacted by each pair of manipulators, the unknown environment with the original Cartesian coordinate system , and the corresponding coordinate of the end effector of each robot. The interaction between each pair of manipulators leads to a constraint condition as described in the following equation:where is a vector with the first point being the original point and the end effector under the Cartesian coordinate system and is the distance between end effectors of the th robot manipulator and the th robot manipulator. According to (1), combining all the constraints in this multimanipulator, we can obtain the constraint vector as follows:where , is the joint vector of *n* two-link manipulator systems. It can be seen that there are two main cases in general robots. First, there exist no constraint forces due to no interaction between the robot and the rigid object. Second, the interaction leads to the existence of constraint force, and we only consider the constraint motion under the following assumption.

*Assumption 1. *The number of joints is more than the number of constraint conditions, i.e., .

In [12], the dynamic equation of *n* two-link robot manipulators with constraint force by the interaction between each couple of manipulators can be described aswhere is the vector of Lagrange multiplier.

The control objective is to obtain not only the tracking of joints to desired trajectories but also guaranteeing the constant distance between end effectors of each pair of manipulators. Moreover, the motion/force control objective also mentions the tracking of Lagrangian coefficient in (3). On the contrary, due to the proposed optimality-based motion/force control design, it requires the unification of the optimal control and tracking problem.

*Remark 1. *It is worth emphasizing that different from classical nonlinear control approaches [7, 17, 19, 20, 23, 24, 43], the approach in this paper is developed by the optimal control technique for uncertain systems based on adaptive/approximate reinforcement learning (ARL). Moreover, regarding the mathematical model of multimanipulators, this work is the extension of Chen et al. [11] by considering the holonomic constraint force with separation of the joint vector.

#### 3. Motion/Force Control Design with ARL

In this paper, a completed control structure in Figure 2 is investigated with 2 independent parts, including ARL-based optimal control for the motion dynamic model and the force control scheme for tracking the constraint coefficient. Moreover, the convergence and stability of the whole system (Figure 2) are discussed for determining the performance of the proposed motion/force controller.

##### 3.1. ARL-Based Motion Controller

For the purpose of designing the motion controller in multimanipulator systems, the motion dynamic model is discussed after eliminating the constraint forces. Moreover, it is due to the description of time-varying systems in trajectory tracking control; the modification of the tracking error model needs to be implemented for developing the ARL-based optimal control design.

Because of with independent constraint conditions, there exists an algebraic map satisfyingwhere is the vector of independent joint variables. Taking the derivative of (4), we imply thatwhere . With the purpose of eliminating the constraint force in the motion dynamic model, we use the Moore–Penrose pseudo-inverse matrix of being to achieve the matrix satisfying . Multiplying both sides of (3) with matrix , it leads to the following equation after utilizing :

According to (5) and (6), the motion dynamic can be expressed as

Regarding the independent joint coordinates, the dynamic model can be represented after multiplying with matrix on both sides of (6) as follows:where

*Remark 2. *This paper only discusses the situation that the number of constraint conditions is smaller than the number of joints as mentioned in Assumption 1. For the case of , it leads to no existence of independent joints. As a result, the above modifications are not implemented, and the control method is developed with different approaches [45]. Additionally, no constraint forces are considered in the special case .

The motion control design is proposed aswhereTherefore, with the assumption that is the function of as , we can obtain thatFurthermore, is the function of as is the function of . Combining with (12), a continuous-time affine system has been obtained with the typical formwhere the vector of state variables is .

*Remark 3. *Due to the time-varying reference in trajectory tracking, the closed system needs to be known as the time-varying property. However, the framework of using term (11) and choosing the state variable vector is able to obtain time-invariant systems (12) and (13). Therefore, in contrast to the existing methods in [36], the proposed control easily handles the situation of time-varying systems by the elimination of the term .

###### 3.1.1. The Optimal Control Problem

In this section, we consider the general class of time-varying systems:with the associated cost function to be defined aswhere under the weighting matrices and to be defined with appropriate dimensions. It can be seen that the optimal problem for the motion model of multimanipulators (12) and (13) is to realize an admissible control policy [31, 32] for the purpose of obtaining the minimal value. This is because that the control policy needs to guarantee the existence of the estimated Bellman function obtained from the HJB equation.

*Definition 1. *A control law is considered being “admissible” with respect to the performance index (15) on a compact set , which is known as a continuous signal , if the properties of the stabilization of (14) and the limitation of for every are satisfied.

It is worth emphasizing that, in the general case of time-varying system (14), the optimal control strategy needs to be considered as a time-varying function . Hence, the time-varying Bellman function with respect to arbitrary time can be given asThe HJB equation can be achieved by taking the time derivative of under two different methods. Firstly, it is directly computed along system (14) asSecondly, thanks to the dynamic programming principle, we also compute the derivative of with respect to time asAccording to (17) and (18), the first part of the HJB problem can be given:Moreover, the second part of the HJB problem can be developed with the performance index to be formulated asBecause of utilizing the dynamic programming principle, the performance index is represented asTherefore, we imply that the Bellman function can be rewritten asAs the convergence of , one can derive that

*Remark 4. *It is worth noting here that the term in (19) and (24) is the disadvantage in developing the ARL algorithm, and there has been very little research considering the direct method for this challenge. Authors in [36] developed the solutions of the ARL-based time-varying optimal controller using the framework of the data-driven and Newton–Leibniz formula-based function approximation method. In this paper, we indirectly handle the time-varying tracking error model through the consideration of systems (12) and (13). Therefore, the two parts of the HJB problem can be modified as

###### 3.1.2. ARL-Based Control Design for Independent Joints

It is because of difficulties in directly solving HJB equation (25) to obtain corresponding optimal control algorithm. Hence, the approximate solution with a neural network (NN) is mentioned to develop the ARL design in the motion dynamic model of multimanipulators. Thanks to the smooth property of functions and , they can be described over any compact domain :where is an uncertain ideal NN weight vector, is chosen as the number of neurons, denotes an activation function vector satisfying and , and is known as the estimation error of the Bellman function . Due to the unknown ideal NN weight vector , it is necessary to obtain the corresponding adjusting mechanism for estimating the actor/critic NNs to achieve the optimal control design in the absence of finding the HJB equation analytically (for more details, see [41]). Moreover, it is able to choose the smooth NN activation function vector based on the description of multimanipulators (see Section 4). In [41], thanks to the Weierstrass approximation theorem, we can uniformly approximate not only but also with as . For a fixed number , the approximated Bellman function of critic NN and the approximated optimal control policy of actor NN are represented as

It should be noted that properties (19) and (24) of Hamiltonian with the optimal controller and corresponding value function guarantee that the adjusting mechanisms of actor NN and critic weights are simultaneously tuned to minimize the squared Bellman error and the corresponding integral, respectively. According to the property of HJB problem (25), it leads to . After obtaining the error between approximated functions and optimal control input , , the Bellman error is represented aswhere is the regression vector of the critic part.

Based on the results in [31], the adjusting mechanism of the critic NN is computed:where are constant positive coefficients and is an estimated symmetric matrix being the solution of the differential equation aswhere denotes the resetting time satisfying that . In [31], because of the condition that is positive definite and preventing the covariance wind-up problem, the covariance matrix is chosen as

Additionally, the adjusting mechanism of the actor NN part is represented by minimizing the squared Bellman error:

*Remark 5. *ARL technique has been investigated with many different approaches, such as off-policy integral reinforcement learning (IRL) [46, 47] and Q-learning [42]. However, there is a wide difference between the online actor/critic technique in this paper and IRL and Q-learning. The off-policy IRL requires the complete data collection between the two sequential sampling times in computing the optimal law [46, 47]. Meanwhile, it is obviously different from the off-policy IRL technique; the online actor/critic method in (29), (30), (35), (33) requires the initial values to implement the computation. This could be exemplified by this article design control law under HJB problems (25) and (26), and the off-policy IRL method considers the relation between the deviation of integrals at two sampling times and the collected data under the dynamic programming law. In addition, the Q-learning method mentioned the *Q*-function in terms of not only state variables but also control inputs, being different from the Bellman function [42]. This property can result in the difficulties for implementing the Q-learning technique of the continuous-time system. Additionally, compared with the data-driven method in [48], this work provides a neural network-based technique to avoid the Kronecker product in estimating the actor/critic term. The actor/critic-based approaches have been discussed in [49, 50] for nonlinear affine systems using residual error . However, in view of the consideration of the identifier in [49, 50], it implies the difference in the computation of residual error and training laws in actor/critic weights between the proposed method and the work in [49, 50].

###### 3.1.3. The Control Design of Dependent Joints

After completing the tracking control design of independent joints, we can achieve the remaining dependent joints’ controller based on the relation between two groups of joint variables, including independent joints and dependent joints. According to (8), it leads to the following representation:

On the contrary, thanks to the relation , multiplying both sides of (6) with , we achieve the dependent joints to be computed aswhere

Substituting (36) into (37), we obtain the control law of dependent joint variables to be implemented as follows:

It is worth noting that the problem of synchronization between multiple joints can be considered by the relation between dependent and independent joints. Therefore, this problem is able to obtain the desired trajectories of independent joints and the trajectories of remaining joints depending on these independent joints. Furthermore, it leads to the relationship between the control input of independent joints and dependent joints (39).

##### 3.2. Constraint Force Control Design

For the purpose of designing the motion/force control scheme for *n* manipulators, the constraint force controller and motion control law are necessary to be established based on the computation of lambda coefficient in (3). Before demonstrating the motion/force control design, we need to compute the coefficient to be dependent on control input as follows. According to (3)–(5), we achieve the following equation:where

Moreover, we find the method to compute coefficient after eliminating the term by multiplying both sides with matrix satisfying that

Therefore, we can obtain the following coefficient :with to be chosen aswhere .

Based on the proposed motion/force, control design is described in the following equation:

This leads to the following relations that

It should be noted that the constraint force coefficient depends on the ARL-based motion dynamic control to be described in (47). Therefore, with the purpose of tracking the coefficient , we can propose the remaining term in (47) as given in the following:

The tracking problem will be proved in the next section.

##### 3.3. Convergence and Stability Analysis

With the purpose of considering the stability and convergence of the closed system under the proposed motion-force control design, we need to implement several following steps. First, the Bellman error , critic estimated weight error , and the persistence of excitation (PE) condition need to be given. Second, thanks to the optimal Bellman function and PE condition, the corresponding Lyapunov function candidate is chosen. Third, the stability of the whole system (Figure 1) and the role of constraint force control and dependent joints’ motion control are discussed.

According to (27)–(31), the Bellman error function is dependent on state variable vector as

Combining (49) with (32), we imply the dynamic equation of critic error to be given aswhere and is satisfied by

The nominal system is given after eliminating the term of NN errors:

Due to the purpose of exponential convergence of , it is necessary to guarantee the PE condition as follows:

The following main theorem is proposed with the assumptions being discussed in [41] to estimate the attraction region of tracking errors.

Theorem 1. *Consider the cooperating manipulators in Figure 1 with the assumption of the bound conditions of the NN including ideal weights, activation function , its derivative to be described in [41], and PE condition of the signal vector , and the selection of parameters is implemented with the following condition:where the coefficients in (54) are mentioned in (55), (56), and (35). Let us consider the control structure (Figure 2), (10) with ARL-based control scheme (46), (10), (39), and (30), the associated adjusting mechanisms (35) and (33) for the actual controller, and constraint force control vector (48); then, (1) the actor-critic weight errors and are UUB; (2) the tracking effectiveness of not only but also in the closed control system (Figure 1) is also UUB; (3) the tracking of constraint force coefficient vector and the remaining terms of joint variables’ vector is also UUB.*

*Proof. *First, the tracking effectiveness of motion tracking control is considered under the ARL-based optimal control scheme for corresponding tracking error motion models (12) and (13). Due to the satisfaction of PE condition (53), there exists a time-varying function satisfying several following inequalities:where are positive constant coefficients. The function is one part of the Lyapunov function candidate. Moreover, according to the bound conditions of the proposed NN [41], we can achieve the following results with the purpose of estimating the time derivative of the Lyapunov function in the next steps:Thanks to Bellman function (20) and time-varying (55), a Lyapunov function candidate can be chosen to determine the stability of the whole of the cascade control system (Figure 2) and the tracking of the weights of the actor NN and critic NN:It is worth noting that the term ( is a positive constant coefficient) is the additional term for the purpose of tracking of the whole motion control system. Because is a smooth function and positive definite, there exist two *K*-class functions such thatBecause of (55) and (8), we can determine the strict proper representation of Lyapunov function :Implementing the derivative of along motion system (13) under the control input , we imply thatwhereAccording to (19), the relation between the Bellman function and the optimal control law can be represented asAccording to (62) and optimal law (28), (35), and (55) in (60), we can conclude thatUsing Young’s inequality, one can obtain thatFrom (30), (19), and (56), the following inequality is given:and the term is bounded byFollowing (56) and (49), we haveIt is obvious thatMoreover, it can be seen thatReplacing (64) and (68) in (63), we can conclude thatBased on the estimation , we can obtain thatLet us select the coefficients to be satisfied such that .

With the state vector , it is able to determine that there exist two *K*-class functions satisfyingBased on (72), inequality (71) can be written asIt is obvious that is negative if lies outside the attraction region asAs a result of this, we can conclude that is UUB with attraction region (74) depending on the number of neurons in the critic NN.

Second, the tracking of constraint force is considered through the dynamic of obtained from (47) and (48) asAccording to and are UUB, are bounded; then, is UUB with the attraction region depending on and the bound of ARL-based motion dynamic control . The proposed controller is absolutely implemented for uncertain multi-manipulators with attraction region (74) depending on model error . This completes the proof.

#### 4. Simulation Results

In this section, an example with 3 manipulators (Figure 3) is given to validate the effectiveness of the proposed control structure (Figure 2) by the *m*-file script in MATLAB software. This example has been considered in [12] by the classical robust nonlinear controller. However, it is obviously different from the existing method in [12], and the proposed motion/force control based on ARL is first structured to obtain optimality satisfaction. Similar to the model in [12], the constrained multimanipulator system is represented by dynamic equation (3) with the following constraints:

In this example, the matrices in (3) can be represented as in [12]:

Moreover, to easily compare with the existing work [12], the parameters of multimanipulators are given as . Because of two constraints (76), this multimanipulator has 4 independent joints with the desired trajectories , and the initial values of these independent joints are . According to the given multimanipulator and above parameters, we achieve the necessary models for investigating the control design, including motion dynamic model (7), (37), and constraint force (43). In light of Theorem 1, the control parameters are given as follows: . The dynamic model of manipulators with the above parameters and ARL-based motion/force control scheme is established by the *m*-file script and Simulink in MATLAB software. The purpose of simulations is to verify the tracking effectiveness of not only joint variables but also constraint force coefficients. Additionally, due to the implementation of optimal control for the motion dynamic model by the radial basis function (RBF) network-based actor/critic reinforcement learning algorithm, the convergence of the adjusting mechanism of actor NN (35) and critic NN (33) should be verified, and the activation function is chosen depending on multimanipulator systems:

where the vector of state variables is . Moreover, the adjusting mechanisms of weights in actor and critic are implemented as in (32 and 35). On the contrary, in order to complete the proposed motion/force control structure, the remaining terms need to be realized, (39) and (48). In order to illustrate the performance of the ARL scheme, we examine to obtain the actor/critic weight response and the tracking effectiveness of joint variables. Referring to Figures 4 and 5, it is seen that the response of joint variables 1–6 with tracking errors converging to zero is 10 s, which leads to that the effectiveness of closed systems is highly precise under the ARL technique. Additionally, the convergence of adjusting mechanisms of the actor NN and critic NN to the ideal weight is also obtained (0.0006, −0.0055, 0.0045, 0.0038, 0.014, −0.018, −0.009, 0.019, 0.011, 0.0055, 0.000041, 0.00051, 0.0492, 0.002, 0.000015, 0.0024, 0.0047, 0.00019) (Figures 6–11). Finally, the high-performance responses of constraint force and control inputs are shown in Figures 12 and 13. It should be noted that these simulation results are conducted for three manipulators to be described in the work [12]. However, they are obviously different with [12]; due to the ARL-based approach, the proposed control scheme also considers the convergence of weights in actor/critic neural networks (Figures 6–11).

**(a)**

**(b)**

#### 5. Conclusions

This paper presented an ARL-based motion/force tracking control for multimanipulator systems, which consists of the ARL-based motion control law and nonlinear force controller. Both the minimization of the performance index and closed-loop stability are guaranteed by theoretical analysis. Moreover, the proposed control structure is able to handle the time-varying model using the additional term. The simulation studies are conducted to illustrate the efficacy of the proposed control structure. Future work is to develop completely uncertain multimanipulator systems by integrating model-free reinforcement learning.

#### Data Availability

This publication is supported by multiple datasets, which are available at locations cited in the reference section.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

This research was supported by the Ministry of Education and Training, Vietnam, under Grant B2020-BKA-05.