Journal of Robotics

Volume 2018, Article ID 5781591, 10 pages

https://doi.org/10.1155/2018/5781591

## Dynamic Path Planning of Unknown Environment Based on Deep Reinforcement Learning

Key Laboratory of Intelligent Ammunition Technology, School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

Correspondence should be addressed to Zhian Zhang; moc.361@oyoyazz

Received 11 July 2018; Revised 6 August 2018; Accepted 2 September 2018; Published 18 September 2018

Academic Editor: Gordon R. Pennock

Copyright © 2018 Xiaoyun Lei et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Dynamic path planning of unknown environment has always been a challenge for mobile robots. In this paper, we apply double Q-network (DDQN) deep reinforcement learning proposed by DeepMind in 2016 to dynamic path planning of unknown environment. The reward and punishment function and the training method are designed for the instability of the training stage and the sparsity of the environment state space. In different training stages, we dynamically adjust the starting position and target position. With the updating of neural network and the increase of greedy rule probability, the local space searched by agent is expanded. Pygame module in PYTHON is used to establish dynamic environments. Considering lidar signal and local target position as the inputs, convolutional neural networks (CNNs) are used to generalize the environmental state. Q-learning algorithm enhances the ability of the dynamic obstacle avoidance and local planning of the agents in environment. The results show that, after training in different dynamic environments and testing in a new environment, the agent is able to reach the local target position successfully in unknown dynamic environment.

#### 1. Introduction

Since deep reinforcement learning [1] was first proposed in 2013 formally, tremendous progress has been made in the field of artificial intelligence [2]. Deep Q-network agent was demonstrated to be able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester in Atari 2600 games [3–7]. AlphaGo zero [8, 9] has defeated all previous AlphaGo versions by self-play without using any human chess spectrum. An agent can be trained to play FPS game receiving only pixels and game score as inputs [10]. The aforementioned examples fully demonstrate the great potential in autonomous decision-making field after the reinforcement learning of neural network solving the problem of the curse of dimensionality. In [11, 12], deep reinforcement learning has been applied to autonomous navigation based on the inputs of visual information, which has achieved remarkable success. In [11], Piotr Mirowski et al. highlighted the utility of un/self-supervised auxiliary objectives, namely, depth prediction and loop closure, in providing richer training signals that bootstrap learning and enhance data efficiency. The authors analyze the agent behavior in static mazes feature complex geometry, random start position and orientation, and dynamic goal locations. Their results show that their approach enable the agent navigate within large and visually rich environments that include frequently changing start and goal locations, but the maze layout itself is static. The authors did not test the algorithm in environments with moving obstacles. If there are moving obstacles in environments, this means the images collected by cameras may be unknown for each episode. In [12], Yuke Zhu et al. try to find the minimum length sequence of actions that move an agent from its current location to a target that is specified by an RGB image. To solve the problem of a lack of generalization, i.e., the network should be retrained for new targets, they specify the task objective (i.e., navigation destination) as inputs to the model and addresses problems by introducing shared Siamese layers to the network. But as it is mentioned in the paper, in the network architecture of deep Siamese actor-critic model, the ResNet-50 layers are pretrained on ImageNet and fixed during training. This means that they have to collect a large number of different target and scene images with a constrained background to pretrain the ImageNet before training the navigation model, indicating that the generalization ability is still conditioned to the information of maps in advance.

In this paper, we present a novel path planning algorithm and solve the generalization problem by means of local path planning with deep reinforcement learning DDQN based on lidar sensor information. In the aspect of the recent deep reinforcement learning models, the original training mode results in a large number of samples which are moving states in the free zone in the pool, and the lack of trial-and-error punishment samples and target reward samples ultimately leads to algorithm disconvergence. So, we constrain the starting position and target position by randomly setting target position in the area that is not occupied by the obstacles to expand the state space distribution of the pool of sample.

To evaluate our algorithm, we use TensorFlow to build the DDQN training frameworks for simulation and demonstrate the approach in real world. In simulations, the agent is trained in a lower-level and intermediate dynamic environment. The starting point and target point are randomly generated to ensure diversity and complexity of local environment, and the test environment is a high-level dynamic map. We show details of the agent’s performance in an unseen dynamic map in the real world.

#### 2. Deep Reinforcement Learning with DDQN Algorithm

The conventional Q-learning algorithm [1] cannot effectively plan a path in random dynamic environment because of the lack of generalization ability and a large Q table. To solve the problem of the curse of dimensionality in high dimensional state space, the optimal action value function Q in Q-learning can be parameterized by an approximate value function.

where* θ *is the Q-network parameter. We approximate the value of Q in a definite environment state by function equation, which is mainly linear approximation; thus, there is no need to build a large Q table to determine the corresponding Q values for different state-actions. Neural network is used to approximate the linear function which can obtain nonlinear approximation with generalization ability. () is regarded as the input of the neural network and the output is the value of Q, where is the weight of Q-network. Q table is replaced by Q-network. The training process is to constantly adjust the network weights to reduce the bias of the output of Q-network and the target value of Q. Assuming that the target value of Q is denoted by

*y*, thus the loss function of Q-network is yielded:

where iteration* i* is the current iteration times and () is a pair of state-action.

The update mode of the value of Q is similar to that of Q-learning; that is,

where is the current reward and is a discount factor.

Stochastic gradient descent algorithm is adopted to train the neural network. According to the back propagation of the derivative value of loss function, the network weights are constantly adjusted, resulting in the network output approaching the target value of Q. The network gradient can be derived according to (2) and (3).

Equation (4) indicates that the updates of neural network and Q-learning are simultaneous. The Q-network of the current iteration epoch is updated with the target value of Q updated from the previous iteration epoch.

Reinforcement learning is known to be unstable or even to diverge when a nonlinear function approximator such as a neural network is used to represent the action value (also known as Q) function [13]. This instability has several causes: the correlations present in the sequence of observations and the correlations between the action values (Q) and the target values. To address these instabilities, DeepMind presents a biologically inspired mechanism termed experience replay [14–16] that randomizes over the data. To perform experience replay, DeepMind stores the agent’s experiences, environment state, action, and reward at each time-step* t* in a dataset. The pool of data samples is extended with running e-greedy policy to search the environment. During the learning, train the network with random data from the pool of stored samples, thereby removing correlations in the observation sequence and improving the stability of algorithm.

In 2015, DeepMind improved the original algorithm in the paper “Human-Level Control through Deep Reinforcement Learning” published in* Nature*. They added a target Q-network, which is delayed updating in comparison with the predicted Q-network, to update the target values of Q, thereby restraining a big bias of Q resulting from a fast dynamic updating of the target Q-network. The improved loss function is

where are the parameters of the Q-network at iteration and are the network parameters used to compute the target at iteration . The target network parameters are only updated with the Q-network parameters every C steps and are held fixed between individual updates, thus ensuring that the update of target network is delayed and the estimate of Q is more accurate.

The update mode of Q-learning algorithm results in a problem of overestimate action values. The algorithm estimates the value of a certain state too optimistically, consequently causing that the Q value of subprime action is greater than that of the optimal action, thereby changing the optimal action and affecting the accuracy of the algorithm. To address the overestimate problem, in 2016, DeepMind presented an improved algorithm in the paper “Deep Reinforcement Learning with Double Q-Learning”, namely, Double Q-network algorithm [17], which selected an action by maximizing the value of the predicted Q-network and updating the predicted Q-network with the selected action value in Q-network instead of directly updating the predicted Q-network with the maximum value of target Q-network.

The loss function of the improved algorithm is

The framework illustration of DDQN is shown in Figure 1.