Research Article  Open Access
Discrete Globalised Dual Heuristic Dynamic Programming in Control of the TwoWheeled Mobile Robot
Abstract
Networkbased control systems have been emerging technologies in the control of nonlinear systems over the past few years. This paper focuses on the implementation of the approximate dynamic programming algorithm in the networkbased tracking control system of the twowheeled mobile robot, Pioneer 2DX. The proposed discrete tracking control system consists of the globalised dual heuristic dynamic programming algorithm, the PD controller, the supervisory term, and an additional control signal. The structure of the supervisory term derives from the stability analysis realised using the Lyapunov stability theorem. The globalised dual heuristic dynamic programming algorithm consists of two structures: the actor and the critic, realised in a form of neural networks. The actor generates the suboptimal control law, while the critic evaluates the realised control strategy by approximation of value function from the Bellman’s equation. The presented discrete tracking control system works online, the neural networks’ weights adaptation process is realised in every iteration step, and the neural networks preliminary learning procedure is not required. The performance of the proposed control system was verified by a series of computer simulations and experiments realised using the wheeled mobile robot Pioneer 2DX.
1. Introduction
A rapid development of the mobile robotics applications in the last few years can be observed. Autonomous wheeled mobile robots (WMRs) have attracted much attention among researchers and engineers, while construction of robots, their sensory systems, and control algorithms were developed. One of the most challenging tasks, which occurs in the implementations of autonomous WMR, is the tracking control problem. It is widely discussed in literature, where different control strategies [1–4] are presented. This shows how significant the problem is. Difficulties met in the realisation of the desired trajectory by WMRs result from the fact that these control objects are described using nonlinear dynamic equations, where some parameters of the model can be unknown or change during the movement, for the sake of disturbances. This results in the necessity of application of computationally complex methods, which can adjust their parameters during the realisation of the trajectory and assure required quality of tracking. Artificial intelligence (AI) methods, like neural networks (NNs) [1, 2, 5, 6], are willingly applied in control systems of robots, for the sake of weights adaptation possibility. The development of AI methods makes the implementation of Bellman’s dynamic programming (DP) [7] idea possible. This group of methods is called approximated dynamic programming algorithms (ADP) [8–12], adaptive critic designs (ACD), neurodynamic programming algorithms, or actorcritic structures. It is included in the larger family of methods adapted using the reinforcement learning (RL) idea. According to [9, 12], the ADP algorithms family is composed of six main schemes: heuristic dynamic programming (HDP), dual heuristic dynamic programming (DHP), globalised DHP (GDHP), and action dependant versions of mentioned earlier algorithms: actiondependent HDP (ADHDP), ADDHP, and ADGDHP. Very good surveys on ADP are given in [9, 13–16]. ADP algorithms have been firstly described for discretetime systems [8, 9, 12] and few years later, for timecontinuous systems [17–21].
Simultaneously with continuous high interest in RL algorithms, a growing number of its applications can be observed. The challenging applications of RL methods are the control problems of autonomous robots like the helicopter [22] or the underwater vehicle [23]. There are implementations of RL algorithms in mobile robot path planning [24], urban traffic signal control [25], or power system control [26], but these are mostly implementations of the Qlearning algorithm [10]. There are not many recent articles concerning ADP algorithms; the example is the application of ADHDP algorithm for a static compensator connected to a power system [27] or HDP and DHP algorithms in target recognition [28]. Application of the ADP algorithms in the control of the wheeled mobile robot is presented in [4] and in the trajectory generating process in [29]. In [30, 31] the HDP algorithm is applied to the control of the nonlinear system with some simulation results. Interesting results are shown in [32], where based on the HDP and the DHP algorithms, new kernel versions were proposed that can obtain better performance than original ones. The performance was tested using the inverted pendulum and the ball and plate benchmark systems. The implementation of the GDHP algorithm for the control of the linear object is described in [33] and for the control of the nonlinear system in [3, 34, 35], the control problem of the turbogenerator, solved using this algorithm, is presented in [36]. The article [37] summarizes the novel developments in policygradient and presents the novel RL architecture, the natural actorcritic (NAC), and the simulation test performed in the cartpole balancing problem. Recent works on ADP algorithms have attempted to solve the problem of implementation of ADP based control systems without a system model knowledge [17–19]. Recent advances in this field also include implementation of ADP algorithms for partially unknown nonlinear systems [19] and robust optimal tracing control for the unknown nonlinear system [38].
The paper presents the application of the ADP algorithm in the GDHP configuration [3, 33–35] in the tracking control problem of the WMR. The discrete tracking control system guarantees a high tracking performance and a stable realisation of the desired trajectory in the face of disturbances. The GDHP algorithm consists of two structures, the actor and the critic, both realised in the form of random vector functional link (RVFL) NNs [2]. Solutions of the tracking control problems presented in literature are often theoretical considerations; there are not many real applications of ADP algorithms in control problems. The proposed discrete neural tracking control system is used for the tracking control of the WMR Pioneer 2DX, where a series of computer simulations and experiments were realised to illustrate the performance of the control algorithm.
The results of the research presented in the paper continue the authors’ earlier works related to the problem of control of the ball and beam systems [39] and the robotic manipulator [40] using DHP algorithm, tracking control of the WMR [41–44] using different ADP algorithms, and the problem of trajectory generating using ADHDP [45]. The remainder of this paper is organised as follows. The WMR dynamics is given in Section 2. The ADP algorithms family is described in Section 3. In Section 4 the GDHP algorithm implemented in the proposed discrete tracking control system is presented and in the following section, the stability is analysed using the Lyapunov function. In Section 6, the effectiveness of the proposed control algorithm is demonstrated through a numerical illustration and an experiment realised using the WMR Pioneer 2DX. Finally, Section 7 gives the conclusion.
2. Dynamical Model of the Wheeled Mobile Robot Pioneer 2DX
The WMR Pioneer 2DX is the control object, shown in Figure 1(a). It is a nonholonomic object, which dynamics is described using nonlinear equations. The WMR is composed of two driving wheels 1 and 2, a third, free rolling castor wheel 3, and a frame 4 (Figure 1(b)). The movement of the WMR is analysed in the plane.
(a)
(b)
Point is a central point of the WMR’s frame, is an angle of the frame’s turn, , , , and are dimensions that result from the WMR’s geometry, , are angles of the driving wheels 1 and 2 rotation, and , are control signals. The dynamical model of the WMR was derived using Maggie’s formalism [2, 46] and assumed in the form where is the vector of angular velocities of driving wheels, is the positive defined inertia matrix, is the vector of centrifugal and Coriolis forces/momentous, is the friction vector, is the vector of disturbances, and is the control vector. Matrices , and the vector take the form where is the vector of WMR’s parameters that result from the object’s geometry, mass distribution, and resistances to motion [2, 46]. The nominal parameters of the WMR Pioneer 2DX were assumed as .
The proposed tracking control system is discrete. A continuous model of the WMR’s dynamics (1) was discretised using Euler’s method and assumed in the form where is a vector that corresponds to the continuous vector of angular velocities , is an index of iteration steps, and is a time discretisation parameter. The state vector was assumed in the form . The discrete tracking errors of angles of the driving wheels rotation and errors of angular velocities were defined as where the desired trajectory () was generated earlier. On the basis of (4) the filtered tracking error was defined in the form where is a positive defined, fixed diagonal matrix.
Substituting the WMR dynamics model (3) and the tracking errors (4) into , calculated on the base of (5), the filtered tracking error was assumed in the form where where is the vector of desired angular accelerations that derives from the expansion of the vector using Euler’s method. The vector includes all nonlinearities of the controlled object.
3. Approximate Dynamic Programming
Bellman’s dynamic programming (DP) is based on the calculation of the value function, the control law, and the state of the object for every step of the process, from the last to the first. That is why it is not applicable in online control. ADP algorithms are also called adaptive critic designs (ACD) [8–16] or neurodynamic programming (NDP) algorithms. They derive from the application of NNs into Bellman’s approach to the optimal control theory, where the value function and the optimal control law are approximated by the critic and the actor. This approach makes realtime control of dynamical objects possible. The ADP algorithms family is schematically shown in Figure 2. It is composed of six algorithms, which differ from each other by the critic’s structure and the weights adaptation rule of the actor’s and the critic’s NN.
The basic structure is the HDP algorithm, in which the critic approximates the value function and the actor generates the suboptimal control law. In the DHP algorithm the critic approximates the difference of the value function with respect to the state of the controlled system. The actor has the same structure as in HDP. Complexity of the critic grows proportionally to the size of the state vector, because the difference of the value function with respect to the dimensional state vector is approximated by critic’s NNs, and the critic’s weights adaptation law is also more complex. The DHP algorithm assures higher quality of tracking control in comparison to HDP [43]. The GDHP algorithm is built in the same way as HDP; its characteristic feature is the critic’s weights adaptation law. It is based on the minimisation of the value function and its difference with respect to the state and can be seen as a combination of the HDP and the DHP critic’s NN adaptation law. The actor structure is the same as in HDP. The difference in complexity of the three basic ADP algorithms is schematically shown in Figure 3.
(a)
(b)
In the HDP and the GDHP algorithm the critic is composed of one NN that approximates the value function, while in the DHP algorithm critic consists of NNs, where is the size of the state vector. For example, in the case of the WMR, where the state vector for the system (6) is of size, the DHP algorithm consists of the actor and the critic realised in a form of two NNs each. In the GDHP algorithm, the actor is composed of two NNs, but the critic is realised in the form of only one NN. The advantage of GDHP over DHP, in the case of complexity of the critic, is even more evident considering the instance of the 6 degrees of freedom robotic manipulator (). The DHP algorithm implemented in the control system for this controlled object should be composed of the actor and the critic realised in a form of six NNs each, while the GDHP would be composed of the actor realised in a form of six NNs, and only one NN in the critic structure. The difference of the complexity of the critic structure increases simultaneously as the state vector of the controlled object increases. The rest of the ADP algorithms are AD versions of the basic algorithms, where the control law generated by the actor’s NN is also the input to the critic’s NN.
4. Globalised Dual Heuristic Dynamic Programming in Tracking Control
The main part of the proposed tracking control system is the GDHP algorithm. There are not many applications of the GDHP algorithms in literature, and existing publications concern rather with theoretical studies [3, 33–36]. In this paper, both the numerical tests and the verification experiments of the neural tracking control system, realised using the WMR Pioneer 2DX, are presented. The GDHP structure generates the control law that minimises the value function [8–16], assumed in the form of equation where is a number of iteration steps, is a discount factor, , and is the local cost function for the th step, assumed in the form where is a positive defined, fixed diagonal matrix.
The GDHP algorithm, schematically shown in Figure 4(a), consists of the following:
(a)
(b)
(i) the predictive model that predicts the WMR’s closedloop state , according to the equation where is the overall tracking control signal of the proposed control system. Its structure derives from the stability analysis presented in the next section. The controlled system’s dynamical model is necessary in the synthesis of the actor’s and the critic’s weights adaptation law in the GDHP algorithm;
(ii) the actor, realised in the form of two RVFL NNs, that generate the suboptimal control law and are expressed by the formula where is an index of the internal loop iteration, is the input vector of the th actor’s NN, it consists of normalised values of the filtered tracking error , errors , desired () and realised () angular velocities of the driving wheels, , is the vector of output layer weights of the th actor’s NN, is the vector of sigmoidal bipolar neuron activation functions, and is the matrix of fixed input weights selected randomly in the NNs initialisation process. Actor’s NNs weights are adapted by the gradient method according to equation where is the fixed diagonal matrix of positive learning rates. The quality rating was assumed in the form where is the output of the critic’s NN, generated on the basis of the predicted state for the step ;
(iii) the critic, realised in the form of one RVFL NN, estimates the value function (8). It is expressed by the formula where is the input vector of the critic’s NN, , is the constant diagonal matrix of positive scaling coefficients, is the vector of output layer weights of the critic’s NN, and is the matrix of fixed input weights selected randomly in the critic’s NN initialisation process. The critic’s RVFL NN is schematically shown in Figure 4(b).
The critic’s weights adaptation procedure in the GDHP algorithm is the most complex among all the ADP structures family. It is based on the minimisation of errors characteristic for the critic’s weights adaptation rule of the HDP algorithm () and the DHP algorithm (), expressed by the formula where is a constant vector, . Weights of the critic’s NN are adapted using the gradient method according to the equation where is the fixed diagonal matrix of positive learning rates and are positive constants.
Adaptation process of NNs’ weights is an interesting feature of the ADP algorithms. It is realised in a form of an internal loop with the iteration index . In every step of the discrete control process calculations, which are connected to the actor’s and the critic’s weights adaptation procedure, are executed according to the scheme shown in Figure 5.
The actorcritic structure adaptation process is organised in the following way: at the beginning of every th iteration step . Actor’s NNs weights are adapted according to the assumed adaptation law (12) by minimisation of the error rate (13). This part of the algorithm, called the “control law improvement routine” [9], leads to the evaluation of the actor’s NNs weights . The next step consists of the adaptation of the critic’s NN weights; it is called the “value function determination operation.” The critic’s NN weights are adapted according to the assumed adaptation law, by the minimisation of the error rate (15), called the temporal difference error (TDE) [12], and the error rate (16). This leads to the calculation of the critic’s NN weights . Next, the internal loop iteration index is increased, and a new cycle of the ADP algorithm adaptation is started. In the presented algorithm, the internal loop breaks, when the number of internal iterations , where is the maximal number of iteration cycles, or when the error is smaller than the assumed positive limit , , . When one of these conditions is satisfied, becomes and becomes . Next index is increased. The actor’s NNs generate control signals and the GDHP structure receives information about a new state of the controlled object. In the next sections index is omitted for the sake of simplicity.
5. Stability Analysis
This paper focuses on the implementation of the ADP algorithm in the networkbased tracking control system of the twowheeled mobile robot, Pioneer 2DX. The proposed discrete tracking control system consists of the GDHP algorithm, the PD controller, the supervisory term, and the additional control signal.
The filtered tracking error was defined in the form (5), where is a positive defined, fixed diagonal matrix selected in the way that the eigenvalues are within a unit disc. Consequently, if the filtered tracking error (5) tends to zero then all the tracking errors go to zero. Filtered tracking error can be expressed as (6), where the vector includes all nonlinearities of the controlled object.
Let us define the control input as where is an estimate of the unknown function.
Then, the closedloop system becomes where the functional estimation error is given by . Equation (19) relates the filtered tracking error with the functional estimation error. In general, the filtered tracking error system (19) can also be expressed as where . If the functional estimation error is bounded in such a way that , is a positive constant and , where is a positive constant, then the next stability results hold.
Let us consider the system given by (3). Let the control action be provided by (18) and assume that the functional estimation error and the unknown disturbance are bounded. The filtered tracking error system (6) is stable provided that where is the maximum eigenvalue of the matrix .
Let us consider the following Lyapunov function candidate: The first difference is Substituting the filtered tracking error dynamics (19) into (23) results in what implies that provided that where and are positive constants. This further implies that The closedloop system is uniformly ultimately bounded (UUB) [47]. The PD controller parameter has to be selected using (21) in order for the closedloop system to be stable. This outerloop signal is viewed as the supervisor’s evaluation feedback to the actor and the critic. In the NN actorcritic control scheme derived in this paper there is no preliminary offline learning phase. The weights are simply initialized at zero, for then the control system is just the PD controller. Therefore, the closedloop system remains stable until the NNs begin to learn.
The proposed discrete tracking control system is composed of the GDHP structure that generates the control signal , the PD controller (), the supervisory term (), and the additional control signal . Structure of the supervisory term derives from the stability analysis performed using the Lyapunov stability theorem. The additional control signal derives from the process of the WMR dynamics model discretisation. The overall tracking control signal was assumed in the form where where is a fixed diagonal matrix of positive PD controller gains, is a diagonal matrix, with elements if or in the other case, , is a positive constant.
The scheme of the discrete neural tracking control system with actorcritic structure in the GDHP configuration is shown in Figure 6.
The stability analysis was performed under the assumption that . Substituting (27) into (6), the closedloop system equation is expressed by the formula The stability analysis was realised using the positive definite Lyapunov candidate function which discretised derivative was assumed in the form Substituting (29) into (31), takes the form On the assumption that all elements of the vector of disturbances are bounded, , where is a positive constant, the difference of the Lyapunov candidate function takes the form The supervisory term’s control signal was assumed in the form where , is a positive constant, and is a positive constant. On the above assumptions the difference of the Lyapunov function (30) is a negative definite.
6. Research Results
Performance of the proposed discrete tracking control system was tested during a series of computer simulations and then verified using the laboratory stand schematically shown in Figure 7.
The laboratory stand consists of the WMR Pioneer 2DX, the power supply and a PC equipped with the dSpace DS1102 digital signal processing board and software: dSpace Control Desk and Matlab/Simulink. The WMR Pioneer 2DX is equipped with the sensory system composed of eight ultrasonic sensors and a scanning laser range finder. The movement of the robot is realised using two independently supplied DC motors with gears (ratio 19.7 : 1) and encoders (500 ticks per shaft revolution). The WMR weights kg, its frame is m long, m width, and its maximal velocity is equal to m/s.
6.1. Simulation Results
Performance of the proposed control system was tested during a series of numerical simulations performed using the Matlab/Simulink software environment. In this section the notation of variables is simplified and the index is omitted. The same set of parameters during simulations as in the experiment was used. The time discretisation parameter was equal to s. In the GDHP structure NNs with eight neurons each were used. The output layer weights of NNs were set to zero in the initialisation process. Parameters of the PD controller , were assumed. One must select using some trial and error experiments or computer simulations. In practice, this has not shown itself to be a problem. The PD controller gains were selected heuristically to satisfy (21). For the sake of the noise that occurs in the signals of the driving wheels angular velocities, incremental encoders were used in the experiment for measurement, the amplification of PD gains in a range of conditions (21) does not improve tracking control quality and can lead to instability. The matrix , in the cost function, was set to , the discount factor was equal to , learning rates of the actor’s NNs and the critic’s NN were equal to and properly, , . Parameters of the supervisory term were set to and . The maximal velocity of point of the WMR’s frame was equal to m/s. During the movement of the WMR two parametric disturbances were simulated (marked on diagrams by ellipses), first in s, when the nominal set of parameters was changed to and the second one, when in s, nominal values of parameters were restored. The first change of parameters corresponds to the situation, when the WMR is loaded by an additional mass kg, and a return to the nominal set of parameters corresponds to the situation, when the additional load is removed.
The desired trajectory of the WMR was computed earlier. In Figure 8(a) the desired angles of the driving wheels’, 1 and 2, rotation are shown; in Figure 8(b) the desired angular velocities are presented. Realisation of the presented trajectory results in movement of point of the WMR on the path in a shape of a digit “8,” with a stop phase in the middle point.
(a)
(b)
The overall tracking control signal , shown in Figure 9(a), consists of the control signals generated by the actor’s NNs , (Figure 9(b)), the PD control signals , (Figure 9(c)), the supervisory term’s control signals , and the additional control signals , shown both in Figure 9(d). At the beginning of the numerical test, values of the PD control signals are big. Next, they are reduced during the NNs adaptation process. The control signals of the actor take the main part in the overall control signals. In time , when the first parametric disturbance occurs, a change in values of the generated control signals can be observed. The additional load changes the dynamics of the WMR; realisation of the desired trajectory requires generating higher values of the control signals. The influence of the disturbance on the WMR’s dynamics is compensated by the actor’s NNs control signals. Analogically, the change of the WMR’s parameters in time , which simulates removal of the additional load, is compensated in the generated control law by reduction of the actor’s NNs control signals values.
(a)
(b)
(c)
(d)
The desired and realised angular velocities of driving wheels 1 and 2 are shown in Figures 10(a) and 10(b), respectively. The biggest differences between the desired and realised angular velocities occur at the beginning of the numerical test. Small changes of realised angular velocities can be observed at the moment, when the parametric disturbances occur.
(a)
(b)
The desired trajectory was realised with tracking errors shown in Figures 11(a) and 11(b) for adequate driving wheels. In Figures 11(c) and 11(d), values of filtered tracking errors and are shown that are minimised by the ADP structure. The highest values of the tracking errors occur at the beginning of the numerical test, when values of the PD control signals are at their highest, and the process of NNs’ zero initial weights adaptation starts. Next, the control signals of the actor’s NNs take the main part of the overall control signals, and the values of tracking errors are reduced. A noticeable increase of the tracking error values occurs at the time of simulated disturbances, but it is reduced by the change of the actor’s NNs control signals.
(a)
(b)
(c)
(d)
Values of the GDHP structure’s NNs weights are shown in Figure 12(a) for the first actor’s NN, in Figure 12(b) for the second one, and in Figure 12(c) for the critic’s NN. In the numerical test, zero initial weights values were used. At the time of the disturbances, changes of weights’ values occur as a result of the adaptation performed in order to reduce the tracking errors.
(a)
(b)
(c)
6.2. Verification Results
After numerical tests were performed, a series of experiments were realised using the WMR Pioneer 2DX. The control algorithm operated in real time during the experiment, thanks to the application of the dSpace DS1102 digital signal processing board. In the experiment, the same parameters of the control system as in the simulation were used. The values of signals from the experiment were not filtered. The control signals are shown in Figure 13. The first disturbance occurs at time s and the second one at time s. The PD control signals (Figure 13(c)) based on the tracking errors calculated on the basis of the realised trajectory, determined by using signals form incremental encoders. These signals are noised, which has an effect on the overall control signals (Figure 13(a)). In contrast, the actor’s NNs control signals (Figure 13(b)) and residual control signals (Figure 13(d)) are smooth. As it was observed in the simulation, at the time of the disturbances, the values of the actor’s NNs control signals changed to compensate the effect of the WMR’s dynamics change.
(a)
(b)
(c)
(d)
The biggest differences between the desired and realised angular velocities, shown in Figure 14, occur at the beginning of the experiment, when the process of the actor’s NNs weights adaptation starts and at the time when the disturbances occur.
(a)
(b)
The tracking errors of wheels 1 and 2 are shown in Figures 15(a) and 15(b); filtered tracking errors are shown in Figures 15(c) and 15(d). Values of errors are noisy, because of the realised method of measurement of the movement parameters. The errors at the beginning of the experiment are at their highest. The change of the load transported by the WMR has noticeable influence on the trajectory realisation process. The method of placing the load on the WMR and removing it has a big influence on temporary values of errors. The increase of errors values results in the adaptation of the actor’s and the critic’s NNs weights in order to minimise tracking errors.
(a)
(b)
(c)
(d)
Values of NNs’ weights are shown in Figure 16. At a time, when the WMR transports an additional load, values of the actor’s NNs weights increase. This is a result of generating higher values of the actor’s control signals for the heavier WMR. The critic’s NN approximates the value function based on the filtered tracking errors, values of its weights increase and when the values of filtered tracking errors increase.
(a)
(b)
(c)
The tracking quality of the proposed control system was compared to the results obtained by the tracking control systems presented earlier, where ADP algorithms in HDP and DHP [43] configuration, or the PD controller (, ), were used. Every experiment was performed in the same conditions, using the same or analogical values of parameters, and the same type of the disturbance.
To evaluate the tracking control quality, the following quality ratings were used:
(i) average of maximal values of the filtered tracking error for wheels 1 () and 2 ():
(ii) average of root mean square error (RMSE) of the filtered tracking errors and : where .
Values of quality ratings are shown in Table 1.

Average of maximal values of the filtered tracking error for wheels 1 () and 2 () is shown in Figure 17(a), and values of RMSE of the filtered tracking errors and are shown in Figure 17(b).
(a)
(b)
On the basis of the obtained results, the higher quality of tracking for the control systems with ADP algorithms in comparison to the PD controller can be noticed. In the presented paper the goal was not to demonstrate the maximal quality of the tracking control attainable using highest feasible to apply the PD controller gains but to illustrate the increase of the quality of the tracking control after adding, to the control system, a part that compensates for nonlinearities of the control system. Values of the quality ratings for the control system with the GDHP structure are close to the ones obtained by the control system with the DHP structure. Simultaneously values of quality ratings are lower than obtained using the HDP algorithm, which means that the application of more complex critic’s NN weights adaptation rule improves the quality of control.
7. Conclusion
The paper presents the discrete tracking control system of the WMR Pioneer 2DX. The main element of the control system is the ADP algorithm in the GDHP configuration. It consists of the actor and the critic, realised in a form of RVFL NNs. The additional elements of the control system, like the PD controller or the supervisory term, assure stability of the tracking control in case of disturbances, or at the beginning of movement, in the case when values of the actor’s NNs weights are not adequately selected for the controlled system; for example, the process of preliminary learning was not realised, or zero initial weights were applied. PD controller gains were selected experimentally for the control system with the GDHP algorithm. Next the experiment for the control system with only the PD controller, with the same parameters, was performed to demonstrate the increase of the tracking control quality for the tracking control system compensating nonlinearities of the control object. It is important to indicate that in a case of realisation of the control system, with nonlinearities compensation, the primary part of the system is the nonlinear compensator. The nonlinear compensator, realised in the form of a GDHP algorithm, compensates for the nonlinearities of the controlled object, as well as the parametrical and the structural disturbances. The GDHP algorithm has the same structure as HDP and its critic’s structure is simpler than in DHP. In the GDHP algorithm the critic’s NN weights are adapted using a more complex adaptation law, which is composed of the critic’s NN weights adaptation rule of the HDP algorithm and the DHP algorithm. This feature assures a high quality of tracking, higher than the quality of tracking obtained when using the control system with the HDP algorithm, and close to the quality of tracking for the control system with the DHP algorithm, which is a significant advantage. The presented control system is stable; the values of errors and NNs’ weights are bounded. Even in the case of zero initial weights of NNs application, or in the case of disturbances, the proposed control system guarantees a stable tracking process. The discrete tracking control system works online and does not require a process of preliminary learning of NNs. Performance of the control system was verified by a series of numerical tests and experiments realised using the WMR Pioneer 2DX.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
References
 R. Fierro and F. L. Lewis, “Control of a nonholonomic mobile robot using neural networks,” IEEE Transactions on Neural Networks, vol. 9, no. 4, pp. 589–600, 1998. View at: Publisher Site  Google Scholar
 M. Giergiel, Z. Hendzel, and W. Zylski, Modelling and Control of Wheeled Mobile Robots, PWN, Warsaw, 2002, (Polish).
 D. Liu, D. Wang, and X. Yang, “An iterative adaptive dynamic programming algorithm for optimal control of unknown discretetime nonlinear systems with constrained inputs,” Information Sciences, vol. 220, pp. 331–342, 2013. View at: Publisher Site  Google Scholar  MathSciNet
 R. Syam, K. Watanabe, and K. Izumi, “Adaptive actorcritic learning for the control of mobile robots by applying predictive models,” Soft Computing, vol. 9, no. 11, pp. 835–845, 2005. View at: Publisher Site  Google Scholar
 W. T. Miller, R. S. Sutton, and P. J. Werbos, Eds., Neural Networks for Control, A Bradford Book, The MIT Press, Cambridge, Mass, USA, 1990.
 Z. Wiesław and G. Piotr, “Verification of multilayer neuralnet controller in manipulator tracking control,” Solid State Phenomena, vol. 164, pp. 99–104, 2010. View at: Publisher Site  Google Scholar
 R. Bellman, Dynamic Programming, Princeton University Press, Princeton, NJ, USA, 1957. View at: MathSciNet
 A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Transactions on Systems, Man and Cybernetics, vol. 13, no. 5, pp. 834–846, 1983. View at: Google Scholar
 A. G. Barto, W. B. Powell, J. Si, and D. Wunsch, Handbook of Learning and Approximate Dynamic Programming, WileyIEEE Press, New York, NY, USA, 2004.
 A. G. Barto and R. Sutton, Reinforcement Learning: An Introduction, MIT Press, Cambridge, Mass, USA, 1998.
 W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, Wiley, Hoboken, NJ, USA, 2007. View at: Publisher Site  MathSciNet
 D. V. Prokhorov and D. C. Wunsch II, “Adaptive critic designs,” IEEE Transactions on Neural Networks, vol. 8, no. 5, pp. 997–1007, 1997. View at: Publisher Site  Google Scholar
 F.Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: an introduction,” IEEE Computational Intelligence Magazine, vol. 4, no. 2, pp. 39–47, 2009. View at: Publisher Site  Google Scholar
 F. L. Lewis, D. Liu, and G. G. Lendaris, “Guest editorial: special issue on adaptive dynamic programming and reinforcement learning in feedback control,” IEEE Transactions on Systems, Man, and Cybernetics B: Cybernetics, vol. 38, no. 4, pp. 896–897, 2008. View at: Publisher Site  Google Scholar
 F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits and Systems Magazine, vol. 9, no. 3, pp. 32–50, 2009. View at: Publisher Site  Google Scholar
 X. Xu, L. Zuo, and Z. Huang, “Reinforcement learning algorithms with function approximation: recent advances and applications,” Information Sciences, vol. 261, pp. 1–31, 2014. View at: Publisher Site  Google Scholar  MathSciNet
 K. G. Vamvoudakis and F. L. Lewis, “Online actorcritic algorithm to solve the continuoustime infinite horizon optimal control problem,” Automatica, vol. 46, no. 5, pp. 878–888, 2010. View at: Publisher Site  Google Scholar  MathSciNet
 K. G. Vamvoudakis and F. L. Lewis, “Multiplayer nonzerosum games: online adaptive learning solution of coupled HamiltonJacobi equations,” Automatica, vol. 47, no. 8, pp. 1556–1569, 2011. View at: Publisher Site  Google Scholar  MathSciNet
 D. Vrabie and F. Lewis, “Neural network approach to continuoustime direct adaptive optimal control for partially unknown nonlinear systems,” Neural Networks, vol. 22, no. 3, pp. 237–246, 2009. View at: Publisher Site  Google Scholar
 L. C. Baird III, “Reinforcement learning in continuous time: advantage updating,” in Proceedings of the IEEE International Conference on Neural Networks (ICNN '94), pp. 2448–2453, June 1994. View at: Google Scholar
 T. Hanselmann, L. Noakes, and A. Zaknich, “Continuoustime adaptive critics,” IEEE Transactions on Neural Networks, vol. 18, no. 3, pp. 631–647, 2007. View at: Publisher Site  Google Scholar
 A. Y. Ng, H. J. Kim, M. I. Jordan, and S. Sastry, “Autonomous helicopter flight via reinforcement learning,” Advances in Neural Information Processing Systems, vol. 16, 2004. View at: Google Scholar
 M. Carreras, J. Yuh, J. Batlle, and P. Ridao, “A behaviorbased scheme using reinforcement learning for autonomous underwater vehicles,” IEEE Journal of Oceanic Engineering, vol. 30, no. 2, pp. 416–427, 2005. View at: Publisher Site  Google Scholar
 M. A. Kareem Jaradat, M. AlRousan, and L. Quadan, “Reinforcement based mobile robot navigation in dynamic environment,” Robotics and ComputerIntegrated Manufacturing, vol. 27, no. 1, pp. 135–149, 2011. View at: Publisher Site  Google Scholar
 P. G. Balaji, X. German, and D. Srinivasan, “Urban traffic signal control using reinforcement learning agents,” IET Intelligent Transport Systems, vol. 4, no. 3, pp. 177–188, 2010. View at: Publisher Site  Google Scholar
 D. Ernst, M. Glavic, and L. Wehenkel, “Power systems stability control: reinforcement learning framework,” IEEE Transactions on Power Systems, vol. 19, no. 1, pp. 427–435, 2004. View at: Publisher Site  Google Scholar
 S. Mohagheghi, G. K. Venayagamoorthy, and R. G. Harley, “Adaptive critic design based neurofuzzy controller for a static compensator in a multimachine power system,” IEEE Transactions on Power Systems, vol. 21, no. 4, pp. 1744–1754, 2006. View at: Publisher Site  Google Scholar
 K. M. Iftekharuddin, “Transformation invariant online target recognition,” IEEE Transactions on Neural Networks, vol. 22, no. 6, pp. 906–918, 2011. View at: Publisher Site  Google Scholar
 J. del R. Millán, “Reinforcement learning of goaldirected obstacleavoiding reaction strategies in an autonomous mobile robot,” Robotics and Autonomous Systems, vol. 15, no. 4, pp. 275–299, 1995. View at: Google Scholar
 X. Zhang, H. Zhang, Q. Sun, and Y. Luo, “Adaptive dynamic programmingbased optimal control of unknown nonaffine nonlinear discretetime systems with proof of convergence,” Neurocomputing, vol. 91, pp. 48–55, 2012. View at: Publisher Site  Google Scholar
 D. Wang, D. Liu, and Q. Wei, “Finitehorizon neurooptimal tracking control for a class of discretetime nonlinear systems using adaptive dynamic programming approach,” Neurocomputing, vol. 78, no. 1, pp. 14–22, 2012. View at: Publisher Site  Google Scholar
 X. Xu, Z. Hou, C. Lian, and H. He, “Online learning control using adaptive critic designs with sparse kernel machines,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 5, pp. 762–775, 2013. View at: Google Scholar
 D. Wang, D. Liu, D. Zhao, Y. Huang, and D. Zhang, “A neuralnetworkbased iterative GDHP approach for solving a class of nonlinear optimal control problems with control constraints,” Neural Computing and Applications, vol. 22, no. 2, pp. 219–227, 2013. View at: Publisher Site  Google Scholar
 M. Fairbank, E. Alonso, and D. Prokhorov, “Simple and fast calculation of the secondorder gradients for globalized dual heuristic dynamic programming in neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 10, pp. 1671–1676, 2012. View at: Publisher Site  Google Scholar
 D. Wang, D. Liu, Q. Wei, D. Zhao, and N. Jin, “Optimal control of unknown nonaffine nonlinear discretetime systems based on adaptive dynamic programming,” Automatica, vol. 48, no. 8, pp. 1825–1832, 2012. View at: Publisher Site  Google Scholar  MathSciNet
 G. K. Venayagamoorthy, D. C. Wunsch, and R. G. Harley, “Adaptive critic based neurocontroller for turbogenerators with global dual heuristic programming,” in Proceeding of the IEEE Power Engineering Society Winter Meeting, vol. 1, pp. 291–294, Singapore, January 2000. View at: Google Scholar
 J. Peters and S. Schaal, “Natural actorcritic,” Neurocomputing, vol. 71, no. 7–9, pp. 1180–1190, 2008. View at: Publisher Site  Google Scholar
 H. Zhang, L. Cui, X. Zhang, and Y. Luo, “Datadriven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method,” IEEE Transactions on Neural Networks, vol. 22, no. 12, pp. 2226–2236, 2011. View at: Publisher Site  Google Scholar
 Z. Hendzel, A. Burghardt, and M. Szuster, “Reinforcement learning in discrete neural control of the underactuated system,” Lecture Notes in Artificial Intelligence, vol. 7894, pp. 64–75, 2013. View at: Google Scholar
 P. Gierlak, M. Szuster, and W. Zylski, “Discrete dualheuristic programming in 3DOF manipulator control,” in Artifical Intelligence and Soft Computing, vol. 6114 of Lecture Notes in Artificial Intelligence, pp. 256–263, 2010. View at: Google Scholar
 Z. Hendzel, “An adaptive critic neural network for motion control of a wheeled mobile robot,” Nonlinear Dynamics, vol. 50, no. 4, pp. 849–855, 2007. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 Z. Hendzel and M. Szuster, “Discrete action dependent heuristic dynamic programming in wheeled mobile robot control,” Solid State Phenomena, vol. 164, pp. 419–424, 2010. View at: Google Scholar
 Z. Hendzel and M. Szuster, “Discrete modelbased adaptive critic designs in wheeled mobile robot control,” Lecture Notes in Computer Science, vol. 6114, no. 2, pp. 264–271, 2010. View at: Publisher Site  Google Scholar
 Z. Hendzel and M. Szuster, “Discrete neural dynamic programming in wheeled mobile robot control,” Communications in Nonlinear Science and Numerical Simulation, vol. 16, no. 5, pp. 2355–2362, 2011. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 Z. Hendzel and M. Szuster, “Neural dynamic programming in reactive navigation of wheeled mobile robot,” in Artificial Intelligence and Soft Computing, vol. 7268 of Lecture Notes in Computer Science, pp. 450–457, 2012. View at: Google Scholar
 J. Giergiel and W. Zylski, “Description of motion of a mobile robot by Maggies Equations,” Journal of Theoretical and Applied Mechanics, vol. 43, no. 3, pp. 511–521, 2005. View at: Google Scholar
 F. L. Lewis, J. Campos, and R. Selmic, NeuroFuzzy Control of Industrial Systems with Actuator Nonlinearities, Society for Industrial and Applied Mathematics, Philadelphia, Pa, USA, 2002. View at: Publisher Site  MathSciNet
Copyright
Copyright © 2014 Marcin Szuster and Zenon Hendzel. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.