Abstract
Autonomous underwater vehicles (AUVs) are widely used to accomplish various missions in the complex marine environment; the design of a control system for AUVs is particularly difficult due to the high nonlinearity, variations in hydrodynamic coefficients, and external force from ocean currents. In this paper, we propose a controller based on deep reinforcement learning (DRL) in a simulation environment for studying the control performance of the vectored thruster AUV. RL is an important method of artificial intelligence that can learn behavior through trialanderror interactions with the environment, so it does not need to provide an accurate AUV control model that is very hard to establish. The proposed RL algorithm only uses the information that can be measured by sensors inside the AUVs as the input parameters, and the outputs of the designed controller are the continuous control actions, which are the commands that are set to the vectored thruster. Moreover, a reward function is developed for deep RL controller considering different factors which actually affect the control accuracy of AUV navigation control. To confirm the algorithm’s effectiveness, a series of simulations are carried out in the designed simulation environment, which is a method to save time and improve efficiency. Simulation results prove the feasibility of the deep RL algorithm applied to the control system for AUV. Furthermore, our work also provides an optional method for robot control problems to deal with improving technology requirements and complicated application environments.
1. Introduction
Since oceans are the most important source in terms of marine life, all kinds of scarce minerals, marine chemical, ocean energy and transportation, and human societies are increasingly dependent on the oceans. Therefore, exploring, developing, exploiting, and protecting the ocean have become a hot issue of global development and technical equipment. Thus, an enormous amount of research effort goes into research and development of all kinds of instruments and equipment, such as large numbers of underwater robots. Unmanned underwater vehicles are a kind of ideal platform to carry out ocean surveying and monitoring [1]. Because the natural environment of the ocean is so harsh for human beings to investigate, autonomous underwater vehicles (AUVs) as ideal platforms for the significant improvement in their performance are widely used for exploring and utilizing resources by carrying different detecting and operating instruments [2, 3]. Although the performance of AUVs has obtained a huge development, there are still a lot of challenging problems appealing to scientists and engineers immensely in this field. For example, the conventional AUVs are unable to perform detailed inspection missions at zero and low forward speed because the control surfaces become ineffective in this condition for control force depends on the forward speed [4–7]. These disadvantages greatly limit the application of AUVs. An important and effective approach to overcome this restriction is to use a vectored thruster, which can use the control force produced by the vectored thrust for controlling the AUVs [8–10]. To perform underwater tasks, it is necessary to design a control system for the vectored thruster AUV to perform precise trajectory tracking control. However, the AUVs are highly complex and coupled nonlinear systems with all kinds of unknown, structured, and unstructured uncertainties corresponding to underwater environment [11, 12]; therefore, it is difficult to establish a precise control model for the designed AUV. Therefore, the control of AUVs has attracted considerable attention in recent years, which needs to satisfy the demand for an accurate trajectory tracking control for AUV against the variations in hydrodynamic coefficients and external forces from ocean currents [13, 14].
Over a few decades, various control methods have been proposed for AUVs to solve vehicle control issues while considering the aforementioned difficulties. The representative methods for AUV control such as proportional integral derivate have been developed for the lowlevel AUV control. In the early work, Jalving [15] designed a PID controller for AUV with steering, diving, and speed subsystems. A controller based on PID was proposed for the position and attitude tracking of the AUVs, and the authors also proved the global convergence of the proposed algorithm [16]. Herman [17] proposed a decoupled PD setpoint controller for underwater vehicles on the basis of previous study [18, 19]. To solve the problem of windup due to uncertain dynamics together with the actuator saturation, many researchers have devoted themselves to this aspect and have achieved many results in theory and application [20–23]. Furthermore, considering the hydrodynamics of underwater vehicle which is highly nonlinear and the uncertainty in the model, some research about the adaptive controller is proposed for controlling underwater vehicles to track the desired trajectory [24–26]. Besides, many researchers have deployed some research on the controller for AUVs combining with other algorithms and have achieved some progress [27–33].
In addition, other different techniques have also been used for controlling AUVs to accomplish tasks, such as sliding model control, backstepping, and model predictive control. Sliding model control (SMC) is one of the most efficient and robust methods to deal with some nonlinear uncertainties and external disturbance [34, 35]. In the earlier literature, Young et al. proposed a controller based on SMC for robust trajectory tracking control of the AUVs [36] and carried out related experiments using adaptive SMC [37]. Cristi et al. designed a decoupled controller using adaptive SMC for AUV diving control [38]. Healey and Lienard proposed multivariable sliding mode control for autonomous diving and steering of unmanned underwater vehicles [39]. Furthermore, in order to improve the performance of SMC, researchers designed a controller based on a higher order sliding model control for diving control [40]. An adaptive robust control system was proposed by employing fuzzy logic, backstepping, and sliding mode control theory [41]. Zain and Harun proposed a nonlinear control method for stabilizing all attitudes and positions of an underactuated X4AUV with four thrusters and six degreesoffreedom (DOFs) according to the Lyapunov stability theory and using backstepping control method [42]. In this work, Steenson et al. developed a depth and pitch controller using the model predictive control method to manoeuvre the vehicle within the constraints of the AUV actuators [43]. Shen et al. proposed a nonlinear model predictive control scheme to control the depth of AUV and to have a friendly interaction with the dynamic path planning method [44]. These research studies evidence a growing need for designing a better controller for underwater vehicles to complete a variety of tasks in different complex unknown environmental conditions.
However, the traditional nonlinear controller is still significantly dependent on the model, and the performance of the modelbased controller will seriously degrade due to a lack of precise knowledge about nonlinearities, uncertainties, and unknown disturbances. Therefore, it is obviously difficult to obtain an accurate dynamic model; the conventional control method is hard to ensure the accurate and automatic control of the AUV [1]. In order to develop real autonomous systems, researchers have turned their attention to artificial intelligence methods, such as using artificial neural networks in AUV control formulations [45]. Fujii developed a selforganizing neuralnetcontroller system as an adaptive motion control system, which can generate autonomously an appropriate controller according to some evaluations of motion of the vehicle [46]. A neural network adaptive controller for diving control of an AUV is presented in this paper [47]. Based on neural network, Shojaei addressed a control formation for underactuated AUVs with limited torque input under environmental disturbances [48]. Many other researchers also carried out a lot of research and also achieved fruitful results from different perspectives [49–51].
In the current study, Zhang et al. proposed an approachanglebased threedimensional pathfollowing control scheme for underactuated AUV which experiences unknown actuator saturation and environmental disturbance [52]. This paper investigates threedimensional target tracking control problem of underactuated AUVs by using coordinate transformation and multilayer neural networks [53]. The authors address the problem of reachable set estimation for continuoustime Takagi–Sugeno (TS) fuzzy systems subject to unknown output delays, and a new controller design method is also discussed based on the reachable set concept for AUVs [54]. In this paper, neural network (NN) based adaptive trajectory tracking control scheme has been designed for underactuated AUVs which are subjected to unknown asymmetrical actuator saturation and unknown dynamics [55]. This paper investigates neural network estimatorsbased faulttolerant tracking control problem for AUV with rudder faults and ocean current disturbance [56]. A robust neural network approximationbased outputfeedback tracking controller is proposed for autonomous underwater vehicles (AUVs) in six degreesoffreedom in this paper [57].
Reinforcement learning (RL) is another important method of artificial intelligence for designing control systems [58]. RL algorithms are able to learn behavior through trialanderror interactions with a dynamic environment [59]. RL can learn a control policy directly without requiring a model [60]. Chris Gaskett and Wettergreen developed an autonomous underwater vehicle for exploration and inspection with onboard intelligent control, which can learn to control its thrusters in response to the command and sensor inputs [61]. A hybrid coordination method is proposed for behaviorbased control architectures, where the behaviors are learned online by reinforcement learning [62]. Carreras et al. presented a hybrid behaviorbased scheme using reinforcement learning for highlevel control of autonomous underwater vehicles [63]. In the paper of ElFakdi, a highlevel RL control system using a Direct Policy Search method is proposed for solving the action selection problem of an autonomous robot in cable tracking task [64]. Fjerdingen analyzed the application of several reinforcement learning techniques for continuous state and action spaces to pipeline following for an AUV [65]. Wu proposed an RL algorithm that learns a statefeedback controller from sampled trajectories of the AUV for tracking the desired depth trajectories. In the work of Frost et al. a behaviorbased architecture using a natural actorcritic RL algorithm is presented for forming the foundation of the system with an extra layer, which uses experience to learn a policy for modulating the behaviors’ weights [66]. In this article, ElFakdi and Carreras proposed a control system based on actorcritic algorithm for solving the action selection problem of an autonomous robot in a cable tracking task [67]. On the other hand, the performance and application scope of RL algorithm is increasing rapidly, due to the development of deep learning [68]. Based on deep reinforcement learning (DRL), a lot of research studies have been carried out and achieved fruitful accomplishments, such as autonomous vehicle control [69–71]. Yu et al. proposed an underwater motion control system through a modified deep deterministic policy gradient, and it is proved that this algorithm is more accurate than traditional PID control in solving the trajectory tracking of AUV [72]. Two reinforcement learning schemes, which include deep deterministic policy gradient and deep Q network, were investigated to control the docking of an AUV onto a fixed platform in a simulation environment [73]. In the works of Carlucho et al. a deep RL framework based on an actorcritic goaloriented deep RL architecture is developed for controlling the AUV’s thrusters directly using the sensory information as input parameter, and experiments on a real AUV demonstrate the applicability of the proposed deep RL approach [74].
Based on research and literature review, we proposed a deep RL based on deep deterministic policy gradient algorithm for lowlevel velocity control of the vectored thruster AUV. In the proposed control scheme, the input parameters are the data that can be measured by the onboard sensors directly, and the outputs of the designed controller are set to the actions of the vectored thruster. Moreover, a reward function is developed for deep RL controller considering different factors which actually affect the accuracy of AUV navigation control. To confirm the algorithm's effectiveness, a series of simulations are carried out in the designed simulation environment, which is a method to save time and improve efficiency. The simulation results demonstrated the feasibility of the proposed deep RL applied on an AUV navigation control. Our work based on reinforcement learning algorithm provides an optional method for AUV control problems to deal with improving technology requirements and complicated application environments. This method based reinforcement learning significantly improves the control performance of AUVs. Furthermore, the simulation results also open up a vast range of prospects for the application of the deep RL method in complex engineering system.
The organization of this paper is as follows. In Section 2, we have briefly introduced the related configuration of a vectored thruster AUV, investigated the kinematic and dynamic of the AUV, and designed a control system based on PID algorithm. In Section 3, we introduce the related knowledge of deep reinforcement learning. In Section 4, we develop our proposed controller based on deep RL. In Section 5, we carry out a series of simulations to confirm the algorithm's effectiveness. In Section 6, we conclude this paper and look to the future work.
2. The Vectored Thruster AUV Model and Control Problems
The tilt angles of the ducted propeller in the AUVs’ yaw and pitch plane are limited in . The vectored thruster AUVs have the ability to perform missions at zero or low forward speeds for the control force provided by vectored thruster. To achieve reliable and accurate control of the AUVs, there are high demands on the autonomous control system design. And the kinematic and dynamic of the AUV are fundamental to design a control system.
The study of the AUVs about modeling and control problems involves many theories and methods of statics and dynamics. Generally, the motion study of AUVs can be divided into two major parts: one is the kinematics analysis model and the other is the dynamics analysis model of AUV. The kinematics analysis model of AUVs is used to complete the study of position and orientation of motion, while the dynamics model deals with the motion of the vehicle caused by the forces and moments.
Generally, the motions of AUVs in underwater environment are related to six degreesoffreedoms (6 DOFs). For analyzing the motion of the vectored thruster AUV in 6 DOFs concisely and efficiently, it is convenient to define two commonly used frames, namely, earthfixed frame and bodyfixed frame, as shown in Figure 1. These DOFs usually refer to the motions about the three coordinate axes of AUVs, including surge, sway, heave, roll, pitch, and yaw, respectively. And the motions mentioned above determine the position and orientation of AUVs in the ocean corresponding to the six DOFs.
In this study, the earthfixed frame is a globalcoordinate system that can be considered to be inertial and fixed to its origin. The bodyfixed frame is a moving frame fixed to the AUV, whose origin is coinciding with the center of mass of the AUVs. To make it convenient in investigating the vectored thruster AUV, the standard notations used to describe the motion of AUV are defined in Table 1.
The general kinematics transformation of AUV between the two independent coordinate systems can be represented as follows:where denotes the vector of position and orientation, represents the transformation matrices from the bodyfixed frame to the earthfixed frame, and represents the corresponding vectors of linear and angular velocity.
The kinematics description of the nonlinear equations of the AUV above can be described separately for linear and angular parameters as follows:where , , and denote the vectors of position and orientation; , , and represent the vectors of linear velocity and angular velocity; and denote the linear and angular velocity transformation matrix between the bodyfixed frame and earthfixed frame, respectively. are defined as follows:where , , and denote , , and , respectively.
The dynamic equations of motion for the underwater vehicles are derived from Newton–Euler equation using the principle of virtual work and D’Alembert’s principle. The equations of motion of underwater vehicle are established based on the traditional sixDOF model in the earthfixed frame, and it can be expressed in the following form:where , denote the vectors of velocity and acceleration related to the bodyfixed frame. iss the matrix of inertia of the vehicle, consisting of the rigid body inertia matrix and the added mass. The terms and are listed in (A.1) and (A.2) in supplementary file, and the developed coefficients of the vehicle used in this paper are listed in Appendix Bin supplementary file.
represents the Corioliscentripetal matrix related to the Coriolis forces and the centripetal effects. The Corioliscentripetal matrix also includes the rigid body term and the added mass term , as defined in the following equation:where refers to the hydrodynamic damping matrix of vehicle, which is mainly composed of linear damping matrix and nonlinear damping matrix . Hence, the terms of the hydrodynamic damping matrix of the underwater vehicle can be described by the following:where denotes the damping matrix due to linear skin friction, represents the nonlinear damping matrix mainly generated from potential damping, wave drift damping, damping due to vortex shedding, and lifting forces. The calculating formulas of terms and in equation (5) are given as (A.3) and (A.4)in supplementary file.
is the vector restoring forces and moments related by gravity and buoyancy of the vehicle.
defines the resultant vector of applied forces and moments on the vehicle in the bodyfixed frame. represents the vector of forces and moments produced by environment disturbances including ocean currents and waves.
In general, the thrust is generated by a propeller mounted at the stern of AUV, and the direction of thrust is collinear with the cylindrical axis of the vehicle's hull. Hence, the applied forces and moments on the vehicle can be expressed as
On the other hand, the direction of thrust provided by the designed vectored thruster can be adjusted according to the need of control AUV. The resultant vector which represents the applied forces and moments acting on AUVs can be expressed as follows:
By comparison with the conventional AUV, there is difference in controlling because the direction of thrust is controlled by the thrustvectoring mechanism. The deflection angle of the duct is a combination of the rudder angle and the elevator angle in the bodyfixed frame, as presented in Figure 2.
The vector of thrust applied on the AUV along with axis in the bodyfixed frame is defined as
Besides, since the thrust of this vectored thruster AUV is provided by the propeller, according to the theory of standard propeller, the thrust can be described aswhere represents the density of water and , , and denote the thrust coefficient, rotation speed, and diameter of the propeller, respectively. Referring to the definition of deflection angles of the duct and in Figure 2 and equation (10), the vector of thrust can be calculated by
In order to study the relations between the vector of thrust and deflection angle of the duct and , the unit vector is defined as follows:
Figure 3 shows the 3D graph of the factor with the tilt angles and. The linear motion of the vehicle is controlled by the vector of thrust according to adjusting the thrust and deflection angle of the duct and . Due to the particularity of the vectored thruster AUV, the vehicle's motions of pitch and yaw are controlled by moments produced by components of thrust vector . The moments acting on the AUV are generated when the thrust is not coincident with the axis of the vehicle's hull. Referring to Figure 2, the moments due to the thrust acting on the center of mass can be expressed aswhere denotes the position vector from the point of action of the thrust to the vehicle center of gravity. The motions of pitch and yaw of this AUV are controlled by moment vector. Because the moment is determined by thrust and deflection angle of the duct and and value of thrust only depends on the rotation speed of the propeller, the value of moment is independent of the forward speed and attitude of AUV.
(a)
(b)
Due to having 6 motional DOFs, the dynamics model of AUVs with highly nonlinear characteristics is a big challenge to design a controller. In order to realize underwater vehicle function designed completely, control system plays an important role in the process of design of AUV [75]. In general, the overall control process of AUV can be represented as shown in Figure 1. In the control process, the controller design is essential for manipulating the AUV. In our vectored thruster AUV, the control system consists of three control loops that represent its surge, pitch, and yaw motions. The inputs of the AUV controller are velocity errors, and the outputs from the controller are the control actions provided by vectored thruster. In the surge control loop, the input to the controller is the linear velocity and the output is the thrust referring to Figure 1. Similarly, the input and output of the pitch controller are angular velocity and the elevator; the input and output of the yaw controller are angular velocity and rudder angle . In order to meet the demand of real applications, controllers for AUVs are usually designed based on the proportionalintegralderivative (PID) algorithm. PID algorithm can be expressed aswhere , , and are the proportional factor, integral factor, and differential factor, respectively, and is defined as an error value from the difference between the desired set point and a measured process variable.
Based on the control process in Figure 4 and PID algorithm in equation (10), a controller is designed aiming at the vectored thruster AUV. To verify control performance of this system, a series of simulations on the performance are carried out, according to a series of analysis and research on AUV mentioned above. Before simulation analysis, the reference velocity is defined by , and the output thrust for AUV is, . When the reference velocities , the simulation result is obtained as shown in Figure 5.
(a)
(b)
(c)
(d)
The simulation results in Figure 5 show that the proposed controller based on PID algorithm is practicable and effectively applied in this reference velocity. When the reference velocities , the simulation result is obtained as shown in Figure 6.
(a)
(b)
(c)
(d)
As shown in Figure 6, the reference linear velocity in a short time, but the angular velocity is different from the reference velocity. On the basis of analyses, it is shown that the thrust is inadequate to meet the need of achieving the reference angular velocity. As described earlier, the thrust is determined by the surge controller, and hence the thrust does not increase although the angular velocity did not reach the reference.
When a new requirement is introduced, such as the reference velocities , with neglect of the velocity setting, this designed AUV controller becomes difficult to implement. In order to solve this problem, the thrust for the AUV is set to a high value in advance. When, the simulation results are shown in Figure 7.
(a)
(b)
(c)
(d)
According to the above, the simulation results and analysis show the inadequacies of the designed AUV controller based on PID algorithm. In order to improve the performance and reduce energy consumption, it is essential to find a new method to design the AUV controller.
3. Deep Reinforcement Learning Control for AUV
3.1. Reinforcement Learning Statement
Reinforcement learning (RL) is a part of machine learning that focuses on studying how an agent optimizes its behavior for a task by interacting with the environment. Then, the environment produces new stats to respond to the executed action in some state. At the same time, the agent receives a new reward value from the environment, which can be seen as the indexes to evaluate the advantages and disadvantages of action. A series of data are generated by the agent and the environment through continuing loop iterations. The basic principle of reinforcement learning is presented in Figure 8.
The environment for agent training in RL can be described as a Markov Decision Process (MDP), where the environment is assumed as fully observable. An MDP can be defined as a 5tuple , where is the ddimensional state space, defines the action space, is the probability of transition to state by taking an action in state, denotes the discount factor for future rewards, and is the function expressing the reward for taking action in a particular state. The policy function represents a mapping from states to actions and denotes a mechanism for choosing action in current state .
The goal of the agent is to maximize the total amount of reward it receives [76, 77]. When a strategy is given, the discounted sum of immediate rewards is defined as return:
The purpose of the reinforcement learning method is to find the optimal policy , which maximizes the return by following the policy. The optimal policy satisfies the function aswhere the performance objective denotes the expected total reward under the policy and is the discount factor.
The statevalue function is defined as the expected value of cumulative discounted rewards from the state corresponding to the policy.
Similar to the statevalue function, the actionvalue function, also known as the function, can be defined as
The statevalue function and the actionvalue function satisfy the Bellman equation.
When the agent utilizes strategy optimal policy , the optimal statevalue function and the actionvalue function achieve the highest return. The optimal functions satisfy the following Bellman equation:
The purpose of RL problems is to learn an optimal policy . A greedy policy is derived from by taking the state and action to get the highest return. Once is obtained by interactions, the optimal policy can be obtained directly by
3.2. Reinforcement Learning in Continuous Domain
Existing RL algorithms mainly consist of valuebased and policybased methods. The first proposed valuebased method is Qlearning, which has become one of the most popular and widely used RL algorithms. In the use process, Qlearning needs to calculate the Qvalue of each state and action and store it in a table. It is precise because of looking up the table in each iterative calculation, so this valuebased algorithm is suitable for those applications where the space of state and action are discrete and the dimension is not too high. In order to resolve the problem about the spaces of state and action being too large, the function approximation to estimate the value function is proposed. Along with the deepening of research, deep neural networks are used to develop a novel artificial agent, which is named deep Qnetwork (DQN), which can learn successful policies from highdimensional state [78]. Due to the use of deep neural networks, this kind of valuebased algorithm has been successfully applied to all sorts of games and achieved good results. Along with depth research into the RL method theory and the extensive application of DQN, it is natural for the emergence of various varieties, which are Double DQN, Dueling DQN, and Rainbow [79].
However, while it could resolve problems with highdimensional state spaces, valuebased methods can only tackle the discrete actions applications but fail in continuous action space. This kind of RL algorithms cannot be applied to the continuous domain directly because it depends on looking up the action that maximizes the actionvalue function, which needs to compete the process of iterative optimization at every step. Besides, if the rough discretization of state action is made, the results will become unacceptable; if the discretization is made so thin, then the results will be difficult to solve. Hence, it may be impracticable when applying this valuebased method to a continuous control domain, such as our control of vectored thruster AUV, while another important RL algorithm, named Policy Gradient (PG) reinforcement learning method, has a wide range of applications in the areas of continuous behavior control. PG methods perform gradient ascent on the policy objective function ; with respect to the parameters of policy, the policy objective function can be defined aswhere is a stochastic policy and is the state distribution. The basic idea behind the PG methods is to adjust the parameters of the policy in the direction of the performance gradient . Hence, the corresponding gradient theorem of the policy objective function is
Then, the policybased methods update the parameter as follows:where α is the learning rate and denotes the stochastic expectation approximations of the gradient of the objective performance.
This equation above shows that the gradient is an exception of possible states and actions. Rather than approximating a value function, the PG methods approximate a stochastic policy using an independent function approximator with its own parameters that maximize the future expected reward. The main advantage of the PG method against valuebased function is using an approximator to represent the policy directly. In the process of PG learning, it should consider the probability distribution of the states and actions simultaneously. Hence, the PG method integrates over both state and action spaces during the training process. There can be no doubt that it consumes a large amount of computing resources for the highdimensional state and action spaces [80]. To solve this drawback, the deterministic policy gradient algorithms for reinforcement learning are presented [81]. Because the map from state spaces to action spaces is fixed in deterministic policy gradient, it is not needed to integrate over all action spaces. Consequently, the deterministic policy gradient needs much fewer samples to train compared with the stochastic policy gradient. This meant that the deterministic policy gradient can be estimated much more efficiently than the stochastic version. With a deterministic policy with a parameter and a discounted state distribution , the performance objective as an expectation can be defined as
The gradient for deterministic policy is
In order to explore the environment fully, a stochastic policy is often necessary. To ensure the deterministic policy gradient algorithm’s adequate exploration, an offpolicy actorcritic learning algorithm is proposed subsequently. The actorcritic algorithms consist of two components in policy gradient, an actor and a critic, respectively. Actor and critic are two different networks and have different policies to realize different functions. The critic estimates the value function which could be the action value (the Qvalue) or state value (the V value). The actor updates the policy distribution in the direction suggested by the critic (such as with policy gradients). Actor is a policy network to produce actions by space exploration, while the critic is a value function to evaluate the actions made by the actor [82]. The critic network is updated by temporaldifference learning, and the actor network is updated by policy gradient. The performance objective functions over the state distribution by this behavior policy:where is the stationary distribution of the behavior policy and is the actionvalue function. Then, the offpolicy deterministic policy gradient can be presented as
Given the policy gradient direction, the update process of actorcritic offpolicy DPG can be presented as
The advantage of the actorcritic algorithm is the ability to implement the singlestep update, which makes make it more efficient. The performance of actorcritic algorithm is decided by critic’s value judgment; nevertheless, it is very difficult to realize convergence, particularly when the actor also needs to upgrade its parameters. To overcome those problems mentioned above, Deep Deterministic Policy Gradient (DDPG) has been presented.
3.3. DDPG in Continuous Domain
Deep reinforcement learning is composed of deep neural network and reinforcement learning. The algorithm structure makes it directly learn control policies from highdimensional stateaction spaces. Due to the excellent performance over a wide range of applications, deep neural network (DNN), which is an artificial neural network (ANN) with several layers between the input and output layers, has arisen and become a very popular research topic in machine learning recently. Thanks to the huge success in a variety of fields such as medical imaging analysis, artificial neural networks have attracted great interest in deep learning. With the development of these neural network structures, it has been used in different areas such as solving engineering control problems.
The basic unit of neural networks is neuron, which is a mathematical function that models the functioning of a biological neuron within an artificial neural network. Because it consists of a large number of layers and neurons in each layer, DNN can always find the right mathematical operation to convert the inputs to outputs, whether linear or nonlinear relationship. In the neural networks, it can be called fully connected layers when each neuron in a layer is connected to all neurons in the next layer. In the network form, each connection contains parameters weight and bias b and applies the activation function σ to the weighted sum , where x is the input vector. The learning process of the neural networks is the process that continuously regulates relative parameters, which mainly include the weights and biases of the network, to reduce the errors generated by the real and prediction results. Combining deep neural networks with reinforcement learning algorithms, deep reinforcement learning (DRL) can be created to resolve previously unresolved problems [83]. In DRL, artificial neural networks can be used as universal function approximator to estimate value function or policybased methods.
4. Control Based on DDPG for Vectored Thruster AUV
Deep Deterministic Policy Gradient (DDPG) is a modelfree offpolicy actorcritic algorithm using deep function approximator that can be used to solve the problems of highdimensional, continuous motion spaces. Because it is proposed based on the concept of DQN, DDPG also uses the deep neural networks as function approximator, which makes it feasible in complex actionspace applications [84]. DDPG contains two independent networks, which are actor network and critic network, respectively. With the parameter θ^{μ}, the actor network represents the deterministic policy , which is used to update the policy that corresponds to the actor in the actorcritic framework. The critic network with parameter θ^{Q} is used to estimate the actionvalue function of the statecation pair and calculate the gradient parameters. In order to make it stable and robust, DDPG algorithm adopts experience replay and the target network.
In order to achieve stable learning, DDPG deploys experience replay and target networks like DQN. Experience replays are a key technology behind many of the latest advancements in deep reinforcement learning [85]. The experience replay was applied to avoid such situations where the training samples are not independent and identically distributed. In the training process, samples are generated by sequential explorations in an environment. In order to break the data correlation, experience replay is implemented by sampling transitions , which are historical data samples from the environment and stored in the replay buffer. And the replay buffer is continually updated by replacing old samples with new ones when the buffer is full. The actor and critic are trained with minibatches sampled from the reply buffer randomly in the training process. The main effect of experience replay is to overcome the problem of correlated data and nonstationary distribution of empirical data. This random sampling method greatly increases the utilization of samples and improves the stability of the algorithm [85].
In order to optimize the statevalue function (critic function) neural network, a loss function based on mean squared error is proposed to carry out the backpropagation. In DDPG, the parameters of the deep neural network for the critic are updated by minimizing the loss function L defined aswhere y_{i} is the target value function generated by the target neural network and can be defined as
Then, the gradient of the loss function L is defined as
The actor policy function is presented with the network parameter θ^{μ}, which is updated using the critic network parameter θ^{Q} to optimize the expected function. The objective function J(θ) is the expected R_{t} under the policy which can be defined as
A policy gradient method is generally obtained by the deterministic policy gradient with respect to network parameter , with the deterministic strategy . The gradient of the cost function with respect to can be expressed as
Hence, the parameter of the actor online policy network for the actor in DDPG can be updated by using the sampled policy
In order to avoid the divergence of the algorithm, separate target networks are created as copies of the original actor network and critic network. In the DDPG algorithm, two target networks and are created for the main critic and actor networks, respectively. The two target networks have the same architecture with the main networks except that the target networks have different parameters .
To improve the stability of the learning, we use the “soft” update method to update the parameters as illustrated by Mnih et al. The weights of the target network are constrained to update slowly by tracking the main networks: , where parameter . With the update like the proposed approach, the algorithm improves the stability of the network significantly. In addition, noise needs to be added samples from a noise process to the actions for improving the efficiency of exploration because the actor policy is deterministic.
According to the above, the vectored thruster has one thruster, and two deflection angles need to be controlled. Hence, it is implied that the control system needs to produce the continuous control outputs of the three parts to achieve the designed reference dynamic state, such as the velocities of AUV. According to the dynamics of AUV, the designed control system must be able to complete the operational task of a nonlinear continuous control of the vectored thruster AUV in a complicated and changeable underwater environment. To solve the continuous control problem, an adaptive control system for the vectored thruster AUV based on DDPG algorithm and the study of AUV is proposed in this work. The aim of this study is to develop a new control algorithm, which has the ability to solve the problem of the vectored thruster AUV with different operative conditions. In our study of the AUV, the architecture of the control system based on DDPG algorithm can be illustrated as in Figure 9. As it can be seen, the designed control architecture can be divided into four factors: AUV reference generator unit, reward function unit, DDPG controller unit, and AUV environment unit.
As shown in Figure 9, the control architecture has an important part named as AUV reference speed generator, which can be used to generate a set of control data points for training AUV more effectively in the process of training. According to the use of the reference generator, the designed controller obtains the operating instructions provided by the reference generator to manipulate the AUV. With this approach, the AUV controller can deal with different setting conditions for the AUV. Another import dynamic information is the measurement data generated by the sensor system of the AUV. This item of measured data is merged with to produce an instantaneous error vector e_{t}. This item presented the difference between the setting parameters and the practical measurement results that provide instantaneous information to the merged information s_{t}. The reward function unit is the main indicator of evaluating the advantage and disadvantages of this algorithm. The input parameter of the reward function model is the instantaneous error vector e_{t}. In this way, the immediate reward r_{t} is defined by the reference state and the error vector e_{t} to evaluate the operation situation of present activity and error in feedback. The AUV controller receives the information summarized in the system state s_{t} of AUV and the immediate reward r_{t} of the current states. According to the input parameters s_{t} and r_{t}, the AUV controller produces the action a_{t} to the AUV simulation environment based on a lot of studies and iterative computation. In practical application, the state information can be measured by the sensor system in AUV, such as DVL and IMU. In our work, the AUV simulation environment is established based on the study mentioned in Section 2; therefore, the state information can be obtained directly by the kinematic and dynamic analysis of AUV. This simulation method ensures the accuracy of development, increases productivity, and shortens development cycles. In addition, the state information based on simulation environment is more widely used at the earlier stage of AUV control based on DDPG algorithm. This method can be used to avoid making extraordinary efforts, especially at a great cost in experiments and computation, to complete the AUV controller training process.
Based on the presented DDPG algorithm architecture, the AUV controller for lowlevel control of vectored thruster AUV is developed. In order to help us represent the control system better, the algorithmic representation is developed to make our code more readable. Therefore, the algorithm for the vectored thruster AUV control can be summarized in the pseudocode as shown in Algorithm 1, and the algorithm workflow is shown in Figure 10.

In line 1 of Algorithm 1, the input parameters are the maximum training episodes M, iteration times of each episode T, the minibatch size of N, the soft target update factor for the deep target networks τ, the minimum and the maximum sizes of the replay buffer, and γ the discount factor. Since Algorithm 1 needs to learn from a continuous control problem, the actor network and critic network are initialized randomly, and the target networks are initialized with the same parameters , , as shown in line 2 and in line 3. In addition, the replay buffer is also needed to initialize at the beginning stage (line 4). As a result of the analysis, this control problem of vectored thruster AUV can be seen as a continuous control task. In Algorithm 1, each episode of the learning process contains a loop task with a fixed number of steps T. In training, the algorithm process, which is shown as line 5 to line 28 of Algorithm 1, is carried out as a loop with the max number of cycles M. In line 6, the random process , which refers to the random Ornstein–Uhlenbeck stochastic process, is carried out for the action exploration. In lines 7 and 8, the AUV simulation environment is initialized with the setting parameters at time 0 s, and the observation state s_{1} can be obtained directly from the AUV simulation environment. In the inner loop from line 9 to line 27, the core part of our algorithm is performed to control the AUV. In the process of this algorithm, a fixed sample time of each step in the inner loop is set to dt, taking the practical application of the real AUV system and the efficiency into consideration. Thus, the training processes develop over time to meet the control system needs with the increased number of cycles. In the establishment of the parameter, the time interval dt must be chosen effectively and reasonably. The control action a_{t} is obtained when the state s_{t} is given because the actor policy is determined (line 10). The action a_{t} is immediately sent to the AUV simulation environment to complete the corresponding motion control (line 11).
In order to improve the efficiency and reliability of the training process, the experience replay buffer is used to update the actor and critic networks in the process of training. To ensure the normal operation of experience replays, the buffer must have a sufficient number of transitions m_{min} stored to train the networks (line 12). When these conditions are fulfilled, a random minibatch N of transitions is sampled from the buffer (line 13 and 14), and then the target stateaction value y_{i} can be calculated, where and can be obtained by the target Q network and target policy network, separately. The critic network can be updated by minimizing the loss function L, and the critic network parameter θQ (line 15) is obtained. In line 16, the actor network is updated by using the sampled policy gradient , and the actor network parameter θ^{μ} is obtained. According to the network parameters θ^{Q}and θ^{μ} calculated in line 14 to 16, the critic target Q network and the actor target policy network are updated in lines 17 and 18. In addition, when the size of the reply buffer stored the experience for training reaches a maximum, then the earliest stored experience needs to be removed to improve efficiency and reduce costs (lines 20–22). The AUV receives the action a_{t}, and then the new state s_{t+1} can be obtained directly from the sensor system on the AUV in a real application, while the state information can be calculated by the simulation environment (line 23). Then, the reward function can calculate the immediate reward r_{t} to evaluate the effects of the action (line 24). Subsequently, with the combination of above the data obtained, the transition is stored in for training the networks. At the end of this algorithm, the critic network represented by and the actor policy network represented by are output.
5. Simulation Results
In our designed control system, the environment simulation is used to simulate the real underwater physics of AUV. During the simulations, the parameter sampling time used in Algorithm 1 is set to dt = 0.1s with the concern about the actual application. In this vectored thruster AUV, the control commands are applied to the three functions, including propulsive force T_{p}, rudder angle α, and elevator angle β representing the deflection angles of the duct. The AUVreceived commands can be defined by setting a vector , where are the force T_{p}, rudder angle α, and elevator angle β, respectively. Those commands are generated by the actor policy network of the designed controller.
In this control algorithm, the state s_{t} in the Markov process represents the current state of the vectored thruster in the underwater environment. In our AUV simulation environment, the state parameters, which can be expressed as , are defined by the instantaneous measurements from the sensors in AUV. The terms are the linear and angular velocities, which can be measured by DVL and IMU. are the linear and angular accelerations corresponding to the linear and angular velocities. e_{t} is the velocity error obtained by the real measured velocity and the setting reference velocity at time t. The ultimate goal of this controller is to minimize the deviations of the real measured variable from the reference settings while minimizing the use of the vectored thruster to reduce the energy consumption. In addition, the fluctuation of the controlled dynamic variable of the AUV is not expected to be large enough, which will make it difficult to use in practical control. In order to accomplish this purpose, the reward function used in Algorithm 1 is essential for evaluating the effects of the executed action a_{t} for the system performance. In order to more fully evaluate the advantages and disadvantages of reward functions, a kind of reward function is proposed with different considerations in our study. Therefore, this immediate reward function r_{t} is defined as follows:where the first item evaluates the square error between the real measurement values from the references. Due to the motion characteristics of AUV, a scale factor , which can be defined as , needs to be added to represent the error more efficiently. In the process of training, the parameters in factor are changed according to the motion characteristics of this AUV. The second term is utilized to describe the actual use degree of the vectored thruster. The third term is added to avoid the vectored thruster producing sudden changes in propulsive force and duct deflection angles. In this term, it can be obtained by calculating the norm between the average of past executed actions and the current action a_{t}. The average of past executed actions is obtained by computing the mean of action over a certain time period t − 1 for each iteration. The last term presents the error accumulation between the real measured dynamic variable from the references . This term is inspired by PID algorithm to reduce eliminate the steady error. The parameter ζ, κ, ξ and σ are scale factor .
In order to verify the feasibility of the proposed Algorithm 1, a numerical simulation is implemented in Python with TensorFlow. According to the related content in Section 4, the policy network uses a deep fully connected neural network with five layers, including three hidden layers, one input layer, and one output layer. The size of the input layer is 18, the sizes of hidden layers are 600 and 400, and the size of the output layer is 3. In addition, in the aspect of the selection of activation function, three hidden layers choose ReLU as activation functions, and the output layer chooses Tanh activation function. The stateaction value networks use a similar network architecture apart from the size of the output layer. In addition, all parameters are set up before carrying out a series of numerical simulations. The maximum episode and step were fixed as M and T. During the process of training, the sampling time is set as dt, which is full in consideration of the calculation speed and accuracy of the designed simulation environment and the practical application of AUV.
The maximum and minimum sizes of the experience replay buffer were set as m_{max} and m_{min}. The learning rate for actor and critic networks is L_{R−A} and L_{R−C}. The discount rate and the soft updating rate for the target networks are γ and τ, respectively. The size of state transitions for the minibatch was defined as N. The parameter setting of the DDPG controller is shown in Table 2.
According to aforementioned proposed Algorithm 1 and related parameters, a series of simulations are carried out to study the effects of each term in reward function equation (36). In order to make a better comparison, the following related simulations are accomplished with the same reference state. The reference state is defined by velocities for the vectored thruster AUV . When the reference velocities , the parameters in the scale factor Then, the performance of the reward function with , and can be simulated, and the results are shown in Figure 11.
(a)
(b)
(c)
(d)
As we can see in Figures 11(a) and 11(b), the linear and angular velocities are very close to the desired set reference velocities . The simulation results have proved the feasibility and correctness of the scheme. The results also illustrate the sample function reward, where only the errors between the real measurement values and the references are taken into consideration and have achieved good practical performance. According to Figures 11(a)–11(d), the linear and angular velocities have the phenomenon of higher volatility because the change extent of the thrust T_{p} and deflection angles of the duct α, β are great. In addition, the linear and angular velocities are bigger than the references, which will result in the unnecessary loss of power. Considering that the reference velocities are set to zero except the velocity in xdirection, the position and orientation of the AUV should be zero except the displacement in xdirection. Those deviations are also needed to be considered in the reward function for improving the performance of Algorithm 1. In order to achieve the goal of reducing energy consumption, the factor κ in the reward function is set to , and the other parameters remain unchanged, as described above. The relative simulations are carried out, and the results are shown in Figure 12.
(a)
(b)
(c)
(d)
As we can see in Figures 12(a)–12(d), the simulation results are obtained based on the reward function with the new factors. By comparing the two results in Figures 11 and 12, the usage of the vectored thruster has declined significantly. The simulation results as shown in those pictures also illustrate the second term of the reward function and the reasonability and validity, which can smooth out the velocity fluctuation and enhance the reference tracking performance of algorithm effectively. Through the comparison of the amplitude variations of the thrust and duct angle with the same parameters, it is shown that the reward function considering energy consumption penalizes the usage of the vectored thruster while reducing its fluctuation range. Meanwhile, this method also reduces the deviations of position and orientation to provide more accurate control for the AUV according to the comparison of two simulation results.
However, the great change extent of the thrust T_{p} and the duct angles α, β will make it difficult for controlling vectored thruster to take advantage of the algorithm for AUV in the real application, while the results have proved that such reward function has achieved a good result. Hence, the third term with the factor , which presents the punishment term of the fluctuation of action outputs, is added in the reward function to evaluate the performance of actions. In addition, in order to further evaluate the reasonable existence of the second term in the reward function, another simulation is carried out with the factor . The simulation results are obtained as shown in Figure 13.
(a)
(b)
(c)
(d)
Comparing the simulations of Figures 11–13, the current result considering the third term of the reward function proves that it is so useful to reduce the change ranges of action outputs. According to Figures 12 and 13, it should be noted that the second term of the reward function plays an important part in reducing the energy consumption for improving the performance of this AUV.
In order to improve the performance of the control system greatly, the first three terms are adopted in the reward function to further take advantage of the algorithm for AUV in the real application. Hence, in the next simulation, the factor is set to , , and , and the results are obtained as shown in Figure 14.
(a)
(b)
(c)
(d)
As we can see in Figure 14, the change ranges of the thrust T_{p} and the duct angles α and β are smaller than before, which makes this algorithm easier to use in the real application for AUV. Performance provides potent proof for the reward function with the second term used to stabilize and smooth the action outputs, while this term is applied in reducing energy consumption in the original design. Although simulations above indicate that this algorithm may gain better results in the vectored thruster AUV, the bias between the control deflection angles of the duct and the goal is large. Meanwhile, the biases of the duct angles α, β and thrust T_{p} lead to the large deviation in the position and orientation of the AUV. In order to further improve the performance of the algorithm, this bias about the thrust and duct angles is needed to be considered in the reward function. Based on the above comparison and consideration, the last term of the reward function, which is inspired by the integral item of error of PID algorithm, is added to reduce the effect of error propagation. The new simulation is carried out with the reward function with all the four terms considered, and the results can be obtained as shown in Figure 15.
(a)
(b)
(c)
(d)
(e)
(f)
The results of simulations are obtained and shown in Figure 15. As it can be seen, the improving performance indicates the effects of using the reward function with the punishment term of error accumulation. As shown in Figures 15(e) and 15(f), the position and orientation of the AUV also can be obtained. In particular, the biases of duct angles, position, and orientation of this AUV decrease effectively. It can be proved that the result of the reward function considering all aspects is good and stable from the comparison between current results in Figure 15 and other results above. After comparison with the simulation results between Figures 5 and 15, it has been found that a high coincidence rate is found between the designed controller based on Algorithm 1 and the traditional PID method. The results of simulation comparing RL and PID are obtained and shown in Figure 16.
As it can be seen, the results of simulation indicate that the controller based on DDPG makes a good performance in controlling the vectored thruster AUV problem. In contrast with the simulation results, the designed controller based on DDPG is better than PID controller in dynamic performance. In order to further research the performance of the designed controller under conditions of greater uncertain factors, the simulations are carried out to study antijamming performance with Gauss white noise excitations. The simulation results are shown in Figure 17.
Under the Gauss white noise disturbances presented in the simulation environment, the controller based on DDPG and the PID controller could realize its function by the simulation. Based on the above research results, the results show that the designed control scheme based on DDPG has a good dynamic and static response and strong antiinterference ability. Simulation results from Figures 16 and 17 show that the proposed controller based on DDPG has better stability, fast convergence rate, and good tracking ability.
In order to test the capability of this algorithm, the other simulations with changed references are carried out. The new reference is set to , and the simulation results can be obtained after training the algorithm. The obtained results are shown in Figure 18.
(a)
(b)
(c)
(d)
As we can see in Figure 18, the angular velocity ω_{y} achieves the setting velocity requirements with good reliability. From Figures 18(c) and 18(d), the thrust output T_{p} is very stable within the limit of ultimate thrust, and the duct angle β is 15°, which is the limiting deflection angle of the duct. By comparing the results between Figures 18 and 7, the designed controller and the reward function accomplish the functionality of controlling angular velocity for this vectored thruster AUV and the control performance is better than conventional PID controller.
We applied DDPG method on the proposed controller for the vectored thruster AUVs, and the training reward and the time consumption can be obtained as shown in Figures 19 and 20.
As we can see in Figure 19, the result showed that the value of the accumulated reward tended to monotonically increase until it reached about 1500 episodes. After that training episode, the accumulated reward tended to stabilize. According to this learning curve, we can discover the development tendency of the proposed controller based on DDPG method as the training proceeds. As we can see in Figure 20, the mean cost time of the episode is 9.24 seconds, and our method costs almost 7.7 hours in the whole 3000 episodes of simulated time, which corresponds to 3.5 days of computation in real time.
6. Conclusion and Future Work
In this paper, an AUV controller based on the Deep Deterministic Policy Gradient (DDPG) was proposed for improving the control performance of the vectored thruster AUV. The proposed algorithm uses the information measured by internal sensors of AUV to provide the control commands for AUV to fulfill the task. There is no requirement to provide a model of the large complex nonlinear system about the vectored thruster AUV to the designed controller, which is essential to the classic control theory. It only needs some input parameters of the AUV, and our proposed algorithm is able to learn a control strategy for the AUV to meet exact implementation requirements. In the learning process, the reward function is fundamental to the DDPG controller to realize the system goal and related functions of the AUV. In this algorithm, a reward function is proposed by considering a series of control precision requirements and the influence of operational constraints. The designed reward function in this paper can effectively improve reliability and stability, reduce energy consumption, and restrain the vectored thruster sudden change. It should be particularly noted that the proposed control system based on DDPG algorithm was developed to realize the lowerlayer motion control for the vectored thruster AUV, although some greater range of applications and more complex dynamic control systems can be solved by this method. Therefore, the controller based on DDPG algorithm has vast application and development prospects.
Furthermore, our proposed algorithm framework for AUV only uses some system states that can be measured by sensors directly as inputs, and it is different from a former method that uses images as input parameters. In this paper, it is proved that the motions of the AUV can be directly controlled by sending lowlevel control commands to the vectored thruster. In order to confirm the algorithm’s effectiveness, a series of simulations are carried out in a simulation environment, which is established by the kinematic and dynamic analysis of the vectored thruster AUV. In this sense, we think the method using simulation environment to replace the real underwater application environment is proved to be cost saving and efficient improvement. In this sense, we think that our works have obtained certain improvements in expanding the application range of AUV control study using the deep reinforcement learning method. Furthermore, our proposed control algorithm provides an optional mentality for controlling underwater vehicles and other kinds of robotics.
Certainly, our present study has its limitations while achieving some achievement. In our proposed algorithm, the simulations are carried out under ideal conditions, so realistic experiments need to be completed to verify the correctness and feasibility of the proposed method. Moreover, more influence factors should be taken into account, such as time delay uncertainty among the sensors, actuators, and controllers. In addition, how to improve performance and achieve stability of the proposed controller is an important task for further research. Finally, control algorithms based on deep reinforcement learning have broad application background and important meanings in theory and practical engineering; therefore, the related research will become more important.
Data Availability
The data of hydrodynamic and thrust coefficients of the AUV used to support the findings of this study are included within the supplementary information file (Appendix B). The other data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Supplementary Materials
The supplementary file includes two parts: the element terms of dynamic equations of motion and the hydrodynamic and thrust coefficients of the AUV. The two supplementary files are important addition to model the complex dynamics of AUVs. (Supplementary Materials)