Abstract

As the core technology in the field of mobile robots, the development of robot obstacle avoidance technology substantially enhances the running stability of robots. Built on path planning or guidance, most existing obstacle avoidance methods underperform with low efficiency in complicated and unpredictable environments. In this paper, we propose an obstacle avoidance method with a hierarchical controller based on deep reinforcement learning, which can realize more efficient adaptive obstacle avoidance without path planning. The controller, with multiple neural networks, contains an action selector and an action runner consisting of two neural network strategies and two single actions. Action selectors and each neural network strategy are separately trained in a simulation environment before being deployed on a robot. We validated the method on wheeled robots. More than 200 tests yield a success rate of up to 90%.

1. Introduction

The robot obstacle avoidance method is a comprehensive approach integrating multiple submethods. Generally, the completion of robot obstacle avoidance is separated into three parts: the leading part, the core, and the back-end part. Among them, the algorithmic models such as robot obstacle avoidance and path planning, equivalent to the human brain for decision-making, serve as its core. The leading part is used to obtain obstacle information (by a camera or a radar) to simulate human vision. As in the human nervous system, the back-end part is used for automatic control, which receives signals from the brain to control the body for actions, as shown in Figure 1. Progress in any one of these three parts will advance the robot’s obstacle avoidance technology, allowing faster and better obstacle avoidance and expanding in its obstacle avoidance adaptability in various environments.

From the perspective of navigation and path planning, most traditional obstacle avoidance methods plan one or more paths or navigation paths based on the location of obstacles in the current spatial environment for avoidance. The artificial potential field method, which was proposed by Khatib [1], tended to reach perfection following numerous improvements and expansions under the efforts of many researchers, allowing a wide range of applications. Han introduced kinetic conditions [2], which were used to generate short and smooth routes under aerial conditions for unmanned aerial vehicle (UAV) obstacle avoidance.

Proposed as early as 1968 [3], the A algorithm also shows extensive applications in the field of path planning with various improvements, such as a data-driven A algorithm proposed by Ryo [4]. Recent years have also witnessed efforts on hybrid path planning algorithms such as [5] and various types of bionic path planning methods applied to different scenarios [6]. However, in the obstacle avoidance methods based on path planning, the lack of environmental information in the unfamiliar practical application environment is often ignored. That is also the case for people walking, in which decisions can only be made according to current information acquired by vision. Therefore, we hereby consider building a controller specializing in dealing with obstacles without taking path planning into account.

2. Method

2.1. Deep Reinforcement Learning

Recent years have witnessed a widespread application of model-free deep reinforcement learning (DRL) in the field of robots, which can offer creative ideas and solutions for obstacle avoidance technology. The most distinguishing feature lies in that it enables the robot (program) to learn autonomously in the interaction with the environment, which is quite similar to the human growth and learning process. The concept “trial and error” functioned as the core mechanism of reinforcement learning, and “reward” and “penalty,” as the basic means of learning, were proposed by Waltz and Fu Jingsun in their control theory as early as 1965 [7]. Deep reinforcement learning has made tremendous progress in recent years. The DQN (Deep Q Network) algorithm proposed and refined by Mnih et al. in 2013 and 2015 [8, 9] provided a powerful “weapon” for reinforcement learning. This deep reinforcement learning algorithm was employed in several components of our controller.

However, the “trial and error” learning approach requires a wealth of training for an effective model. Nonetheless, hardware systems such as robots cannot afford a mass of “trial and error” training for obstacle avoidance in terms of time and economy. As a result, training in a simulation environment emerges as a better option. Many recent studies have demonstrated that the experience gained from training in a simulation environment is also applicable to a real environment [10, 11]. It is critical to minimize the gap between the simulation environment and reality during training in a simulation environment. Neunert analyzed the potential causes of the gap in [12], Li solved some of them by system identification [13], and the DeepMind team enhanced fidelity by estimating model parameters [11]. Enlightened by these efforts, we rewrote the CarRacing-v0 environment based on the OpenAI Gym simulation environment [14] to enable the training of our selector.

Although training without any human guidance can yield a better model [15], training and learning with human guidance in their directions are more efficient. Furthermore, existing deep learning algorithms and neural networks are less effective when learning multiple strategies simultaneously [16]. The separate training of subaction strategies and selector strategies [17, 18] were a general approach in Behavior-Based Robotics [19], making the model training more efficient.

Based on the challenges discussed above and the efforts of previous researchers, an obstacle avoidance controller with a hierarchical structure was proposed in this paper. It only selected the information observable from the robot’s current viewing angle as input, without considering path planning, allowing it to handle situations closer to people’s progress in reality. To avoid obstacles, we decomposed an obstacle avoidance action into multiple subactions to train a DQN-based action selector to select and run appropriate actions, including Turn Left, Turn Right, Gas, and Stop. We deployed this method on a wheeled robot, as illustrated in Figure 2, and carried out more than 200 experiments in the real physical environment by arranging artificial obstacles, with the success rate of obstacle avoidance hitting 90%, demonstrating the effectiveness of the controller.

2.2. Overview

Unlike traditional path planning methods, we adopted a controller to avoid obstacles. Figure 3 presents the general framework of the controller. The action selector, along with the following four action runners, served as the core of the whole framework. Based on the DQN algorithm, the action selector made decisions based on the information in the environment and the ongoing actions in the current cycle. The action selector sent an action decision to the action runner for running in each running cycle.

We decomposed the obstacle avoidance action into four subactions: Turn Right, Turn Left, Gas and Braking. Among them, the subactions of Turn Right and Turn Left were also based on the DQN algorithm and determined the deflection angle in this direction based on the decision of the neural network. The subactions of Gas and Braking, as single actions, required no manipulation with the model and ran the selector’s command directly upon receipt. The action selector and each action were separately trained in parallel in our rewritten simulation environment, making the design of complex loss functions unnecessary and parameter tuning and error finding more convenient.

2.3. Simulation Environment

In the CarRacing environment of OpenAI Gym [14] (Figure 4(a)), the designed task object was for the racing car Agent to pass through the entire track as quickly as possible, without crossing the track or the boundary. In pursuit of perfection, the modeling for the simulation of the Gym involved the friction of the site, the wheels of the racing car, the ABS sensors, etc. On this basis, we improved the CarRacing simulation environment, which was called Car2D.

Inspired by [1113], we readjusted the physical model of CarRacing (friction, mass of the wheeled robot, etc.) and added random noise to bring it closer to the actual conditions of our physical robot (Figure 4(b)) and the test site. The original road generation program is unsuitable for a single obstacle avoidance task training as it generates a complete circular road instead of obstacles. We also rewrote the part about road generation and added the obstacle generation function, with which the shape, number, and location of obstacles could be accompanied by random noise generation (Figures 4(c) and 4(d)). Randomization in Car2D could enhance the robustness of the training object, narrowing the gap between the simulation environment and the real environment, as has been demonstrated by studies [10, 20, 21].

For the performance of the four subactions of the robot in the simulation environment, we completed the design based on the physical model of the real environment. Each time running the Gas action command was run, the Agent would provide the robot with equal power to move forward, with the maximum speed reached when the action lasted for several consecutive cycles. The power would be gradually lost due to ground friction if the action stopped. The car would lose power due to running the Braking action command, with the braking distance determined by the car’s mass and the elaboration of the physical model.

The task object for obstacle avoidance required no long runway or various curves; thus, only a limited straight runway was generated in Car2D. According to the set conditions, the environment sent a stop signal to stop this training and restart a new episode (Figure 5). Each cycle Car2D would present an RGB image generated according to the limited viewing angle in front of the Agent and add random noise before providing it to the controller.

2.4. Action Selector

As the core component of the controller, the action selector was implemented based on the DQN algorithm. Compared with the Q-Learning algorithm [22], this algorithm approximated the value function with a convolutional neural network and leveraged the experience reply mechanism to improve the efficiency of the neural network. In addition, a neural network called the Main Net was introduced to the DQN algorithm to generate the current Q value and a Target Net with the same structure as the Main Net was introduced to generate the target Q value. The parameters of Main Net were copied to Target Net every certain number of iterations. By reducing the correlation between the current Q value and the target Q value, the stability of the algorithm has been improved. The pseudocode of the DQN algorithm is provided in Algorithm 1 [23]. For a detailed description of DQN, please refer to [8, 9, 24], as only the relevant details of this model are presented here.

Input: Pixels and reward
Output: Q action-value function
Initialization
Initialize replay memory space
Initialize the Q network (action-value function) with random weights
Initialize target network (action-value function) with weights
1: Fordo
2:  Initialize sequence and preprocessed sequence
3:  Fordo
4:   Following policy, select
5:   Run action in an emulator and observe the reward and image
6:   Set and preprocess
7:   Store transition in
8:   Sample random minibatch of transitions from
9:   Set
10:   Calculate the loss (Perform a gradient descent step on)
    
11:   Train and update weights of
12:  End
13: End

In each running cycle, the action selector makes action decisions based on the environment observed by the Agent and the actions performed in the previous running cycle. The direct use of the RGB image in the simulation environment as the input of the selector would make it tough to apply the model to the real physical environment. For the same input of the model in the real physical environment and the simulation environment, we processed the RGB image from the camera in front of the robot as a matrix describing the position of the obstacles, denoted as . The subactions of Turn Left and Turn Right were mutually exclusive; thus, only 12 states were available for the action space of the action selector. We used a one-dimensional vector with a length of 3 to describe the actions run in the previous cycle, denoted as . Therefore, the state space of the running cycle was characterized as follows:

Table 1 presents the symbols and rewards terms. The reward function was denoted as , indicating that the rewards obtained by through the action , which was defined as follows:

Distance reward encouraged the Agent to move forward as much as possible; the obstacle collision penalty was implemented for hitting an obstacle; the off-the-track penalty was implemented for crossing the boundary; the time penalty encouraged the Agent to avoid obstacles as quickly as possible to prevent them stopping in place; the obstacle avoidance reward was implemented for successful obstacle avoidance; and the speed reward was implemented for the throttle action under the premise of successful obstacle avoidance for another reward, encouraging the Agent to move as fast as possible. It is important to note that the multiple rewards and penalty mechanisms we designed were not superimposed on the Agent at the same time. We adopted the idea of curriculum learning (CL) [25], in which the training started with simple, single, and small obstacles, and the current episode was not stopped even after hitting an obstacle. After the Agent was capable of completing small task objects, the task difficulty would be increased, such as by increasing the area and number of obstacles or by randomizing the location of obstacles.

The score function of the current state was denoted as , which evaluated the current action taking the future state into account, indicating the expectation of the sum of the rewards obtained by the Agent after running the action in the state until the end of the current episode. For the state , and the action run, there exists: where represents the expectation, the discount factor , which was used to measure the degree of influence of future rewards and current rewards on the value. The closer was to , the greater the influence of future rewards, and was taken for training. The optimal value action function of was denoted as , and there exists: then the state transition equation can be obtained: where is the decay learning rate of . Taking as a label, the loss function of network training can be obtained: where can be expressed as follows:

Figure 6 provides the structure of the deep convolutional neural network used in our Main Net, and the Target Net had an identical structure and different parameters. Different from the structure of general DQN algorithms, we inserted a Dropout layer in the last fully connected layer, which was only enabled during training and not when the model was in use.

2.5. Action Runner

In the action runner, only two actions, “Turn Right” and “Turn Left,” which provided the deflection angle of the robot, required training. The two were also based on the DQN algorithm, with the same characteristics of the space state of the input as the action selector, thus not elaborated here. However, the action space had a total of 36 states, which were categorized from to by a step length of . Experimental experience showed that the deflection angle exceeding would cause the robot to move in the opposite direction and eventually leave the runway.

3. Results and Discussion

This section presents the experimental results in the real environment and an analysis of the drawbacks of the model. (a)Experiment Setting

We used a self-built wheeled robot, as shown in Figure 7, to validate the model. The model employed STM32F4 for the chip of the driver board, Raspberry Pi 4B for the upper computer, and ubuntu 18.04 for the system of the upper computer. An infrared camera and two independent infrared distance meters were applied to collect environmental characteristics. The matrix for the position description of the obstacle was constructed mainly through the identification and ranging of obstacles.

Tests showed that the configuration of the running frequency during training and verification in the simulation environment directly affected the success rate of obstacle avoidance. Figure 8 shows the rewards curve for a model validated 100 times at 90 Hz and 37 Hz, which was trained at 37 Hz for 5,600 rounds.

It can be seen from the curve that the same model accomplished the obstacle avoidance task in most cases when running at 90 Hz but failed at 37 Hz. This means that the model has to be run enough times in a second for the robot to have a coherent obstacle avoidance action. Finally, we set the training and testing frequency in the simulation environment to 90 Hz to ensure training efficiency while preventing the actions of the Agent from oscillating. When tested in a real environment, the controller would see its actions oscillate as it runs too fast. Therefore, we had the hard front-end hardware on the wheeled robot that provided environmental information run at a frequency of 55 Hz to monitor obstacles in real-time. The running frequency of the controller was set to 30 Hz to allow for a smoother running of the robot and avoid oscillations.

As the deployed model failed to play its full role due to the limitation of computing power, we quantified the model to improve its running efficiency with reference to some methods introduced or mentioned in [2628]. During the test, the robot was set with a fixed initial speed, and the obstacle avoidance controller was activated when an obstacle was identified to control the robot to avoid the obstacle.

Both the models used in the simulation environment and during the training were written in Python, but most of the programs for the robot were written in C++ and C languages to improve the efficiency of the model. (b)Experimental Results

To validate the effectiveness of the controller, we conducted the test in three cases: a single obstacle, multiple obstacles (Figure 9), and irregular obstacles (Figure 10), with a total of 75 different obstacle positions. In more than 200 tests, successful obstacle avoidance was observed in most cases for multiple or irregular obstacles, with a success rate of up to 90%.

Unlike in our simulation environment, there existed no road limitations for some robot tasks in the real environment, such as in wild meadows and deserts. To test the generalization ability of the controller, we removed the road restrictions from the real environment and conducted the test 100 times, in which the controller also completed the obstacle avoidance task, with a success rate hitting 91%. However, there were several times when the obstacle avoidance effect failed to meet the expectations. In the case of a small gap between two obstacles, a farther path instead of passing through the gap would be the option (Figure 11 top half), similar to some performance in the simulation environment with small or few obstacles (Figure 11 bottom half). It seemed that when conditions permitted, the controller would leverage the drivable space furthest and stay away from obstacles as much as possible. We set 20 different ratios of obstacles to the width of the road in the simulation environment, each tested 100 times. Figure 12 shows the curve for the success rate of obstacle avoidance.

It can be seen that the success rate of obstacle avoidance became lower with the increase in the ratio of the obstacle to the road width. The obstacle avoidance failed when the Agent could not pass through the gap between obstacles, with a success rate of 90%. Our idea was validated.

4. Conclusion

This paper proposed a hierarchical controller for robot obstacle avoidance in progress, which enabled the robot to avoid obstacles without path planning flexibly. By decomposing the obstacle avoidance control task into multiple subactions of GAS, Turn Left, Turn Right, and Braking, an action strategy could be decomposed into multiple strategies for separate training. The idea of curriculum learning (CL) was used to improve training efficiency. With the limited information in front, our controller brought us a step closer to the real action of people dealing with unknown environments.

The controller was trained in a simulation environment before being deployed on a wheeled robot, producing satisfactory results with consistent performance compared with the simulation environment. Wheeled robots deployed with controllers yielded a success rate of up to 90% in more than 200 obstacle avoidance tests. It also applies to an environment similar to the wild without road restrictions, exhibiting good performance.

The controller was less effective in the case of dense obstacles with small gaps. More adjustments in running may be required for dynamic obstacles to yield better results. For these cases, traditional obstacle avoidance methods may outperform our controller when complete environmental information is available. We will make more efforts to solve problems and improve the controller.

Data Availability

All data, models, and code generated or used during the study appear in the submitted article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The work described in this paper was fully supported by a grant from the Student’s Platform for Innovation and Entrepreneurship Training Program of ANHUI UNIVERSITY OF TECHNOLOGY.