Abstract
The pathplanning approach plays an important role in determining how long the mobile robots can travel. To solve the pathplanning problem of mobile robots in an unknown environment, a potential and dynamic Qlearning (PDQL) approach is proposed, which combines Qlearning with the artificial potential field and dynamic reward function to generate a feasible path. The proposed algorithm has a significant improvement in computing time and convergence speed compared to its classical counterpart. Experiments undertaken on simulated maps confirm that the PDQL when used for the pathplanning problem of mobile robots in an unknown environment outperforms the stateoftheart algorithms with respect to two metrics: path length and turning angle. The simulation results show the effectiveness and practicality of the proposal for mobile robot path planning.
1. Introduction
With the development of artificial intelligence, mobile robotics are widely used in various fields: medical robots that assist patients or people with disabilities [1, 2]; teaching robots that provide flexible platforms for educational research [3]; swarm intelligent robots capable of military applications [4, 5]; and cleaning robots that serve people’s productive lives [6]. These applications are all dependent on the path planning of mobile robots (MRs). Path planning is the calculation of a feasible path from a start node to a goal node in a map or grid without colliding with obstacles on the way [7]. It requires MR to be equipped with sensors, onboard computers, and motion systems to plan and move in partially or completely unknown environments.
Path planning refers to the problem of finding a route (path) between a start node and a goal node on a map (grid) [8]. Many wellestablished algorithms have been proposed to solve the pathplanning problems for MR. The former approach includes searchbased planning algorithms: Dijkstra [9], [10], [11], etc. Due to their universality and ease of implementation, these algorithms achieve significant results in searching for paths, but the search time exponentially grows with the resolution size and search depth of the map; samplingbased planning algorithms: rapidly randomexploring tree (RRT) [12], probabilistic roadmap (PRM) [13], etc. The advantage of these algorithms is that they are effective and fast in highdimensional path search, the disadvantage is that these algorithms usually sample the environment for random search to find paths, the results are often not optimal, and it is difficult to find a feasible path in environments with narrow passages. Artificial potential field (APF) [14] and BUG algorithm [15] are widely used in path planning for local obstacle avoidance. Although these algorithms are computationally simple for dynamic obstacle environments and fast in path search, the optimal path is often not obtained, and the search path may be erroneous when the obstacles are large in a complex environment. Another type of algorithm is the intelligent algorithm: it is an algorithm that people model by natureinspired or human mind to imitate solving problems [16]. Typical algorithms are particle swarm optimization (PSO) [17], ant colony optimization (ACO) [18], neural network (NN) [19], and other algorithms. Intelligent algorithms play an effective role in solving complex dynamic environments, but there are common problems such as slow computation speed, poor stability, poor realtime performance, and easy to fall into local optimality.
All of the abovementioned algorithms have their own advantages and disadvantages. Most of the studies treat MR as a mass point to simplify the pathplanning problem and ignore dynamic uncertainties during motion [20], resulting in a searched path too close to the obstacle.
1.1. Related Work
Recently, due to the development of artificial intelligence, Qlearning, the most commonly used algorithm in reinforcement learning, has been increasingly used in the field of path planning. Reinforcement learning is a class of unsupervised algorithms [21] that emphasize the learning process of the MR interacting with its environment. The goal of the MR is to evaluate the action to be chosen in each state of its environment by learning a stateaction value function for that state.
A common approach is to improve the initialization of Qlearning by combining it with other algorithms. In [22, 23], algorithms improved the initialization of Qlearning with a bionic algorithm [23] to accelerate the learning speed and significantly reduce the computation time of the mobile robot; to solve the problem of excessive dimensionality of Qtable, the studies [24–26] combine neural network with Qlearning, which converge to the optimal policy faster and reach the goal point with less distance. Zhang et al. [27] proposed a selfadaptive reinforcementexploration Qlearning (SAREQ) to address the problems of many repetitions and uneven exploration of classical Qlearning. Simulation experiments show that the algorithm has significant advantages over classical Qleaning in terms of the average number of turns, success rate, and the number of shortest planned routes. In order to solve the problems of slow convergence speed and long planning paths when robots use the Qlearning algorithm to plan paths in unknown environments, Zhao et al. [28] proposed the experiencememory Qlearning (EMQL) algorithm, which improves the autonomous learning capability of mobile robots by continuously updating the shortest distance from the current state node to the starting point. The comparison results from the planning time, the number of iterations, and path length show that the algorithm has obvious advantages in terms of convergence speed and optimization capability.
To overcome the above limitations, this study proposes an improved Qlearning [29] to solve the pathplanning problem of MR and refers to the proposed algorithm as the potential and dynamic Qlearning (PDQL). The proposed algorithm exploits the superiorities and minimizes the limitations of Qlearning in path planning for MR. The proposed approach enables the MR to escape from deadend paths blocked by obstacles. The effectiveness, superiority, and rapidity of the PDQL algorithm are demonstrated through simulation and comparison experiments.
1.2. Original Contributions
The PDQL algorithm proposed in this study is applied to solve the path planning for MRs in an unknown environment with the following main contributions:(1)The PDQL is a novel proposal that can find the feasible path for MR in an unknown environment, outperforming motionplanning proposals based on the stateoftheart pathplanning algorithms.(2)This study hybridizes Qlearning with APF to initialize Qtable, avoiding the random movement around the start point at the beginning of the algorithm.(3)The proposed algorithm changes the constant reward into a dynamic reward, speeds up the convergence of the algorithm, and avoids the algorithm from falling into a deadend path blocked by obstacles.(4)By comparing with other algorithms, it is proved that the proposed algorithm in this study can solve the MR pathplanning problems in an unknown environment considering the path length, collision avoidance, and smoothness.
The MR path planning is a broad research area involving planning, obstacle avoidance, and algorithms. To better describe the problem and prove the effectiveness of the proposed PDQL algorithm, the remainder of this study is organized as follows: first, the pathplanning problem is described in Section 2; then the proposed PDQL algorithm is described in Section 3; the comparative experimental and simulation results f are presented in Section 4; and finally, conclusions are drawn and future work is discussed in Section 5.
2. Preliminaries
In this section, this study first describes the MR pathplanning problem and then outlines the classical Qlearning.
2.1. Problem Formulation
The task studied in this study is the pathplanning problem of a mobile robot in an unknown environment. The information in the environment (obstacle location, shape, and orientation) is completely unknown before the mobile robot performs a pathplanning task, and only the starting and goal points are known. Mobile robots need to plan the shortest path length from the starting point to the goal point without colliding with obstacles in the environment [30].
This study studies path planning on an eightway connected twodimensional grid of nodes. The pathplanning problem can be defined in the following forms: for state space (MR workspace); for obstacle space (MR unreachable space); for free space (MR reachable space); for initial state (MR starting position); and for goal state (MR goal position). The purpose of MR is to calculate a path that does not collide with obstacles.
The MR moves from one grid to one of the eight adjacent grids , which is defined as a cost function as , as shown in equation (1).
The problem studied in this study is to plan a path that does not collide with obstacles so that the MR reaches the end point from the starting point with a shorter path length.
To illustrate the planning problem using the PDQL in this study, the pathplanning process is shown in Figure 1. The MR is defined as a circle of radius , and the working environment of MR is divided into a grid map with sides of . In a Cartesian coordinate, the MR aims to plan a feasible path from the start S (1, 1) to the goal G (7, 7) without collision with obstacles. For better obstacle avoidance, this study defines (Figure 1(a)) the following: when there are no obstacles around, the MR can move to the adjacent eight directions (north, northeast, east, southeast, south, southwest, west, and northwest). The MR cannot diagonally cross the obstacles’ barrier. When the obstacles are located around the MR, the movement directions of the MR are shown in Figure 1(b). Figure 1(c) shows the path planned by the proposed PDQL algorithm.
2.2. QLearning Algorithm
Qlearning [29] is a valuebased reinforcement learning algorithm proposed by Watkins in 1989. The MR path planning can be expressed as follows: at each discrete time series (offline strategic temporal difference) [31]. The framework of the reinforcement learning is shown in Figure 2.
The MR can receive a state (coordinates of the MR in the current state) from the environment and interact with the environment through action (direction and distance of MR movement). The environment will provide a new state (coordinates of the MR in the next state) and also give an immediate return , where is the set of states in the maritime environment, and is the set of actions available at the state . The MR interacts with the environment through continuous feedback, generating more data (states and returns) and using the new data to further improve its own behavior. Qtable is the expectation that can gain by taking action in state , which is updated by the equation (2).
The above function can also be written as equations (3) and (4):where () is a learning rate parameter, and () is a discount rate parameter.
After the convergence of the Qlearning, the mobile robot can obtain a convergent Qtable to guide it on how to obtain the maximum cumulative reward and eventually learn the optimal action (optimal strategy) to complete the pathplanning task. Watkins et al. [29] show that when the states and actions are finite, the Qvalues can be represented as a Qtable. Qlearning converges to the optimal policy under the condition that the mobile robot always selects an action after convergence when each state and action has an infinite number of visits to the Qtable, and the maximum Qvalues for each state are obtained as shown in equation (5)as follows:
The pseudocode of the classical Qlearning algorithm is summarized as in Algorithm 1.

In this study, the Qlearning is applied to the MR path planning for the following reasons:(1)Reinforcement learning has good interaction with the environment without the need for positive or negative labels [32]. The MR gains current knowledge by exploring and learning from the environment, and improves their operational strategies to adapt to the environment.(2)The Qlearning algorithm is highly exploratory and is an iterative trialanderror process, where multiple attempts are made for each possible pair of state actions to obtain the optimal policy as long as time allows.(3)Qlearning uses offpolicy [33], and the selection of actions according to the target strategy can be used to control the distance between the MR and the obstacle.
Although Qlearning has many advantages, there are several shortcomings as follows:(1)In the initial stage, Qtable is initialized to zero or normaldistributed numbers, the motion of MR is completely random and easy to collide with obstacles, resulting in the loss of MR, and the convergence speed is slow and timeconsuming.(2)Qlearning requires a certain memory to store the tracking Qtable. When MR has states and actions, the dimension of the constituted Q table is , and by choosing the maximum Q value to determine the next move’s direction, a total of times need to be compared, with the more complex state space and actions, which will exponentially increase the amount of computation, resulting in long computation time [30].(3)It will lead to local optimum when the environment is complex, and it is easy to fall into a deadend path blocked by the obstacles.
According to the superior characteristics of Qlearning and avoiding its shortcomings, this study proposes a novel algorithm PDQL based on Qlearning to solve the pathplanning problem for MR in an unknown environment. To provide a prior knowledge for MR, this algorithm combines APF with Qlearning to initialize Qtable; the dynamic reward is set to speed up the convergence of the Qlearning. The path length, computing time, and turning angle are improved compared to classical Qlearning.
3. The Proposed PDQL Algorithm
In this section, the proposed PDQL algorithm is elaborated in five stages for solving the pathplanning problem for MR in an unknown environment. To better describe the studied content, the MR pathplanning problem discussed in this study has the following premises:(1)The MR and the obstacle environment are threedimensional objects in practice. To simplify the problem, this study ignores the height of the MR and obstacles and treats them as objects in two dimensions.(2)The location, shape, and size of obstacles in the environment are unknown to MR, and the MR only knows the start and goal positions.(3)The task of the MR is to reach the target from the start with the shortest possible path length.
3.1. QTable Initialization
The Qlearning algorithm usually initializes the Qtable as zeros or normally distributed random numbers, and the absence of prior knowledge of the environment leads the MR to randomly choose actions in the exploration phase, resulting in slow convergence and long computation time of the algorithm. To optimize this problem, this study uses the APF combined with the Qlearning algorithm to optimize the initial Qtable for path planning. The reasons are as follows: it is easy to implement in grid maps, provides prior knowledge of the environment for MR, and speeds up the computation and convergence.
The information in the environment (obstacle shape, size, and location coordinates) is unknown for MR, so Coulomb’s law is used to model the APF for the grid environment. The repulsive force from the obstacle is not calculated, and only the gravitational force is generated by the starting point. The equations are shown in equations (6) and (7)as follows:where is the gravitational field producing gravitational force in state ; is the total potential energy of state ; is the Euclidean distance between state and the center of the target point; and is the scale factor.
The grid of the MR workspace is modeled according to the above method, and in order to make the potential energy range in space between (0, 1) and the highest potential energy at the goal point and the lowest potential energy in the obstacle region, the vector field is normalized using equation (8).where is the potential energy in state ; is the highest potential energy in state ; and through equation (8), a potential energy field is constructed for each grid in the known environment, with a grid potential energy of 1 at the target point and a potential energy of 0 at the obstacle grid, forming a monotonically increasing potential energy field from the starting point to the target point.
The Qtable is initialized using equation (8) as shown in equation (9).
By applying APF to Qtable initialization, the MR is provided with a prior knowledge of the known environment, which avoids the disadvantage of slow convergence caused by the random motion in the exploration phase. This section lays the foundation for defining the action and reward function in the next section.
3.2. Action Selection
In the grid map, the MR uses continuous learning based on the environmental information, and gets rewarded and punished by interacting with the environment. The MR gets rewarded when moving to the white grid, penalized when reaching the color grid, and converged to the optimal value by updating the formula through Qlearning. To avoid obstacles and reach the target point with the shortest path length, eight directions of movement are defined for the MR (north, northeast, east, southeast, south, southwest, west, and northwest). When the radius of MR is defined as , the corresponding movement and movement distance are as follows: action 1: action 2: action 3: action 4: action 5: action 6: action 7: action 8:
3.3. Reward Function
The reward function is used to determine the value of the behavior, and the MR interacts with the environment according to the reward function to adjust the action strategy by the reward value [24]. The right reward function helps to reinforce the desired behavior and punish the improper behavior. In previous reinforcementlearning path planning, the reward value is usually a static constant, which leads to its random search in the environment, resulting in increased convergence time. To solve this problem, a dynamic reward function is proposed in the PDQL algorithm, which provides the goal point and the current position as prior knowledge to the MR. When the MR is closer to the goal point, the larger the reward obtained, prompting it to move in the direction of the goal point and speeding up the convergence. To reduce the computation, this study uses Manhattan distance for calculation.
The reward functions are shown from equations (10) to (14).where is the static reward; is the dynamic reward; is the Manhattan distance from the target point in the state ; is the Manhattan distance from the target point in the next state ; is the coordinate in the state ; is the coordinate in the state ; and is the coordinate in the goal point.
3.4. Convergence for PDQL
To demonstrate the convergence of the proposed algorithm applied to the MR pathplanning task, assuming that is the estimate of for the nth update of the Qvalue, the error of is defined as shown in equation (15).
After n + 1th update of the Qvalue, the error of is shown as in equation (16).
From the proof of equation (16), the (n + 1)th error is smaller than the nth error . As the PDQL algorithm iterates, the Qvalue converges to a definite value. The MR will move from the starting point to the goal point with the optimal action strategy.
3.5. PDQL for Path Planning
In this study, the proposed PDQL algorithm is applied to the pathplanning problem of MR in an unknown environment, and its basic ideas are as follows: the APF is used to initialize Qtable, to provide the prior knowledge in the environment to the MR. The dynamic reward is combined to optimize the reward function, to induce the MR to move toward the goal point, and to control the distance between MR and obstacles.
4. Simulation Results
It is difficult to directly apply the PDQL algorithm to the pathplanning problem of MR, and it requires repeated training to obtain the optimal action strategy, therefore, this study demonstrates the performance and generality of the PDQL through several numerical simulations in this section.
4.1. Environments
In this study, the environment of the MR working is simplified by using a grid map. A circle of radius is used to represent the MR, a square of side length is used to represent each grid node in the working space, white grid indicates the feasible area, and black grid indicates the obstacle (the area where the MR is forbidden to reach). The center of each grid is marked by a Cartesian coordinate, with the x axis indicating the horizontal direction and the y axis indicating the vertical direction. Therefore, the first and second dimensions of the grid map represent the x coordinate and y coordinate of the grid in the map, respectively.
The simulation maps are shown in Figure 3 from M01 to M09, simulating the MR in different environments with locations. From simple to complex environments, numerous challenges in the field of path planning are covered: problems such as planning the short path over long distances and trap points due to local minima.
4.2. Performance Metrics
To test the effectiveness, safety, and speed of the proposed PDQL algorithm in a comprehensive and concrete way, the MR path is evaluated in three performance metrics: path length equation (17), turning angle equation (21), and computing time.
4.2.1. Path Length (m)
The path length is an important indicator to test the path performance. When the MR tracks the path, the shorter the distance traveled, the shorter time required to complete the task, and the less energy consumed. Therefore, this study defines the path length of the MR from the starting position to the goal position as shown in equation (17).where , when , the MR is at the starting position , when , the MR is at the target position , represents the coordinates of the current state of the MR, and represents the coordinates of the next state of the MR.
4.2.2. Turning Angle (rad)
The turning angle is the sum of the change in the heading angle of MR from the start to the goal. When the turning angle is smaller, the path is smoother, less energy is consumed, and the completion time of the mission is shorter. This study defines the turning angle of MR from the start to the goal position as shown in equations (18)∼(21).where , represents the coordinates of the current state of MR; represents the coordinates of the next state of MR.
4.2.3. Turning Angle (rad)
This measure corresponds to the time consumed by the algorithm through iterative computation.
Functions of the path length, turning angle, and computation time to evaluate the path can be used to fully evaluate the degree of excellence of the algorithm to compute the path. The smaller the path length, angle, and computation time, the better is the proposed algorithm.
4.3. Parameters Selection
Before the PDQL algorithm is executed, the values of the two parameters: learning rate () and decay rate () in the Qvalue update equation, need to be determined based on the evaluation function of Section 4. According to Watkins et al. [29], when the value is small, the agent goes through all states in the environment and calculates all possible actions, and the Q value converges to the optimal value. When the value is large, it can expand the exploration range of the agent and prevent the agent from falling into the problem of local optimum. Therefore, based on the above theory, when , this study records the number of iterations that path length converges to the optimum. The test is repeated 30 times to take the average value; considering the computing time of MR, both and are taken as 0.9.
To demonstrate the effectiveness and generalizability of the proposed PDQL algorithm in local path planning, the PDQL algorithm is analyzed and compared with the commonly used MR local pathplanning algorithms [7]: RRT, APF, and fuzzy [34]. The parameters are set as shown in Table 1.
is the maximum number of iterations. For RRT, is the random adoption probability, is the step length, and is the maximum number of sampling. For APF, is the obstacle influence factor, and and are the scale factors. For Fuzzy, is the maximum turning angle for each time.
4.4. PDQL for MR Path Planning
In this section, the proposed PDQL algorithm enables the MR to perform the task with the optimal path length from the starting point to the goal point. To verify the effectiveness and generalizability of the proposed algorithm, tests were conducted on the nine grid maps in Figure 3. Each map shows the best path obtained by the PDQL algorithm (black solid line), S is the starting point, and G is the goal point.
The optimal (suboptimal or optimal under optimal conditions) results of PDQL in 30 repeated tests (the path with the smaller turning angle and computing time under the same pathlength condition is taken) are shown in Figure 3 (the paths of other compared algorithms are not indicated in Figure 3 for a clearer representation of the paths of PDQL). Experimental data and analysis results are recorded in Table 2, and it is checked whether there is a significant difference between the PDQL algorithm and other comparison algorithms by ttest (considering a 0.05 significance level). Among the results, the path length and turning angle of APF and fuzzy are the same in each experimental result, and there is no ttest.
Compared mean path length, turning angle, and computing time of various pathplanning algorithms for MR are shown in Figures 4–6 (the interval on each bar denotes the standard deviation).
It can be seen that in the nine test maps, the PDQL can effectively solve the pathplanning problem of MR from the starting position to the goal position without collision with obstacles under short pathlength condition, and it can achieve excellent results in each environment. These results show that the proposed approach achieves the optimal path length in different environments, which demonstrates its practicality, generalizability, and robustness. Compared to the existing methods, the proposed approach can effectively find the optimal local path. The limitation is that the computing time is longer compared to the comparison algorithms, and it will exponentially grow with the size of the map. The proposed approach has the smallest standard deviation among the comparison algorithms in all cases, which demonstrates the superiority and stability of the PDQL algorithm under different environments. In the experimental process, the randomness of the path length and turning angle derived by the RRT algorithm is too large (standard deviation) and is not suitable for the MR path planning. For APF and fuzzy, although the stability of the path length and turning angle derived each time is good, it requires adjusting the corresponding parameters to avoid the problem of falling into local minima. The APF is not able to calculate the results when the obstacles are too dense.
These results of global path planning for MR show that our method achieves the optimal path length and turning angle in different environments, which demonstrates its practicality, generalizability, and robustness. Compared to the existing methods, the proposed approach can effectively escape from trap points due to local minima and find the optimal global path. The limitation is that the computing time is longer compared to the comparison algorithms, but it is within the acceptable range.
5. Conclusions
In this study, the PDQL for MR pathplanning algorithm is proposed, and simulation experiments are conducted in different environments. The environment in the path planning is unknown, and the PDQL can effectively solve the MR pathplanning task. By combining APF with the Qlearning algorithm to initialize the Qtable and setting the static and dynamic reward functions to provide the prior knowledge in the environment to the MR, the convergence speed and computing time of CQL are accelerated.
To demonstrate the effectiveness and generality of the proposed algorithm, simulation experiments are set up, fully considering the unknown environment in the actual work. It demonstrates that the proposed PDQL algorithm can effectively solve the pathplanning task, and the mean, standard deviation, and ttest of the data analysis show that the proposed algorithm yields the smallest path length and smoothness, which significantly speeds up the computation time and convergence speed of the CQL. The problem of MR falling into local minima and not being able to derive effective paths is avoided.
In future work, the direction of the work can be derived from the problem that Qlearning computation time exponentially grows with the range of the environment. To solve the problem of excessive dimensionality of Qtables in reinforcement learning, Qlearning is combined with neural networks in order to effectively reduce the computation time and the waiting time for MR to perform actions [24, 25], and on the other hand, for the multiMR pathplanning problem [35]. Qlearning minimizes the path length and arrival time of all robots in the environment to their respective destinations and reduces the turning angle of each robot.
Data Availability
The data used to support the findings of this study are available from the corresponding author.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was partially supported by the Basic Scientific Research Business Cost Scientific Research Project of Heilongjiang Provincial University (135509114) and the Joint Guiding Project of Natural Science Foundation of Heilongjiang Province under grant no. LH2019F038.