Abstract

Aiming at the formation and path planning of multirobot systems in an unknown environment, a path planning method for multirobot formation based on improved -learning is proposed. Based on the leader-following approach, the leader robot uses an improved -learning algorithm to plan the path and the follower robot achieves a tracking strategy of gravitational potential field (GPF) by designing a cost function to select actions. Specifically, to improve the Q-learning, -value is initialized by environmental guidance of the target’s GPF. Then, the virtual obstacle-filling avoidance strategy is presented to fill non-obstacles which is judged to tend to concave obstacles with virtual obstacles. Besides, the simulated annealing (SA) algorithm whose controlling temperature is adjusted in real time according to the learning situation of the -learning is applied to improve the action selection strategy. The experimental results show that the improved -learning algorithm reduces the convergence time by 89.9% and the number of convergence rounds by 63.4% compared with the traditional algorithm. With the help of the method, multiple robots have a clear division of labor and quickly plan a globally optimized formation path in a completely unknown environment.

1. Introduction

As robots become more and more widely used in various industries, a single robot cannot be competent for complex tasks. Therefore, multirobot formation [1] and path planning have become research hotspots, and they have good applications [2, 3] in collaborative search, exploration, handling, rescue, and group operations. Path planning of multirobot formation requires multiple robots to form a formation and maintain this positional relationship to move to the target. It is necessary not only to avoid obstacles safely but also to find a better path. In addition, compared to the simpler path planning in the known environment, higher requirements on the ability of multiple robots to plan paths are put in the unknown environment. There have been many implementation methods for multirobot formation, including behavior-based method [4], virtual structure method [5], and leader-following method [6]. The behavior-based method is to design sub-behaviors in advance and choose the behavior to execute according to the changes in the situation, but the accuracy is not enough to integrate various behaviors in a complex environment. The virtual structure method regards the formation as a fixed rigid structure and cannot effectively avoid obstacles. The leader-following method with the advantage of simple and flexible structure mainly realizes collaboration by sharing information of leader. For robot’s path planning, A algorithm [7] and reinforcement learning (RL) algorithm [8] are commonly used in global path planning; the former can effectively solve the optimal path, but it needs to know all the environmental information in advance; the latter can learn autonomously in the environment, but it takes more time. The artificial potential field (APF) method [9] is widely used in local path planning, which can cope with the real-time changing environment but lacks the global planning ability.

For the problem of multirobot formation and path planning, Chen et al. [10] proposed a new leader-following control framework by introducing the RH method, which enables fast convergence of a formation task for a group of mobile robots. Based on the path planning of a single robot, Sruthi et al. [11] designed a nonlinear controller for tracking to achieve the multirobot formation. The above two methods require rigorous modeling of the system and cumbersome theory, which are weak in the application. By mixing formation control with leader-following and priority methods, Sang et al. [12] used the MTAPF algorithm with an improved A algorithm for path planning. Das and Jena [13] implemented collision-free path planning for multiple robots by using an improved particle swarm algorithm and evolutionary operators. Qu et al. [14] used a modified genetic algorithm to plan paths for multiple robots by adding a co-evolution mechanism. Lazarowska [15] used the discrete APF to find crash-free paths for robots in dynamic and static environments. Some of the above methods cannot be carried out in an unknown environment and some cannot plan an optimal path.

At present, -learning is a widely applied reinforcement learning algorithm. The limitation of -learning is that it is trial-and-error learning, which requires constant iteration and is time-consuming. Thus, it needs to be improved to quickly plan a globally optimal path. Maoudj and Hentout [16] initialize the -table to accelerate convergence by presenting a distance-based reward function. Soong et al. [17] integrated the prior knowledge gained from FPA into the traditional -learning, which provided a good exploration basis for accelerating the learning of mobile robots. Xu and Yuan [18] increased the step length of movement and the direction of the robot to plan a fast and smooth path. Oh et al. [19] specified the initial -value of the traditional -learning through the fuzzy rule-based -learning, which speeded up learning and stabilized convergence. Yan and Xiang [20] initialized the -table by using inverse Euclidean distance from the current position to the goal position, which improves the efficiency of -learning. The above methods all initialize the -value simply by some prior information to improve the algorithm, without considering the avoidance of concave obstacles and the adjustment of the action selection strategy.

In summary, there are still many difficult problems in the formation and path planning of multiple robots in unknown environments. In this paper, we adopt the leader-following approach to study the multirobot dynamic formation problem. The innovation in this paper is as follows: The improved -learning algorithm is presented to plan paths, in which environmental guidance and virtual obstacle-filling avoidance strategy are added to accelerate convergence and the SA algorithm is applied to improve the action selection strategy; the follower robot can achieve the tracking strategy of GPF by designing the cost function to select actions.

2.1. -Learning Algorithm

The -learning algorithm [21] is an RL algorithm based on temporal-difference, which combines the Monte Carlo sampling method and the bootstrapping idea of dynamic programming. It is described with the Markov decision process as follows: Firstly, limited state space and action space are given. When the robot needs to accomplish a certain task, it selects and performs the action in the current state, which interacts with the environment. Then, the robot enters the next state and is given an instant reward as feedback by the environment. Finally, the value function is updated according to the update rule by using the reward which is passed to it. One round is continuing the above process until the robot reaches the target, and the rounds are iterated until the cumulative reward is maximum. The update equation of the function iswhere is the state and is the action at current time , is the state and is the action at next time , is the reward obtained by performing action at state , is the learning rate, and is the discount factor.

In order to ensure the exploratory nature of the algorithm, the strategy is usually adopted, with the probability of selecting the action that maximizes the value function, and the small probability that is still reserved for random exploration. The mathematical equation of the strategy iswhere is the selected strategy, is the greedy factor, is a random number, is a randomly selected action, and is an action that maximize -value at state .

The classical -learning algorithm is described in Algorithm1.

begin
  Initialization:
   , , %Initialize value with 0, determine the state set and the action set containing four actions
  for() %The episode cannot exceed m which is the maximum number of episodes
   Given initial state ;
   while ()
    (1) Select an action at state according to ; % is the action selection strategy;
    (2) Execute the action , then enter state and get a reward ; %Get immediate rewards by performing actions to interact with environment
    (3) Update using ;
    % Update the value function according to the update equation by using the reward
    (4) ; %Update state
   end-while
   ; % Update episode
  end-for
end
2.2. APF Method

The APF method is a virtual potential field established artificially, including the GPF and the repulsive potential field. The target generates a gravitational force on the robot to make the robot move towards it. The GPF function iswhere is the GPF factor, is the state position of the robot, is the GPF at , is the state position of the target to be reached by the robot, and is the distance between the robot and the target, which can be measured by specific sensors in practice and is one-dimensional. Gravitation is the negative gradient of the GPF, and the gravitational function is defined as

Obstacles generate a repulsive force on the robot to make the robot move away from it. The repulsive potential field function iswhere is the repulsive potential field factor, is the state position of the robot, is the repulsive potential field at , is the obstacle state position, is the distance between the robot and the obstacle, which can be measured by specific sensors in practice and is one-dimensional. is the influence radius of the obstacle, which is artificially set according to the experimental requirements in practice. Repulsion is the negative gradient of the repulsion potential field, and the repulsion function is defined as

Therefore, the traditional APF method guides the robot’s direction of movement based on the combined force of gravitation and repulsion, but its shortcomings are as follows:(1)When the robot is far away from the target, the gravitational force is much greater than the repulsive force, and it may hit an obstacle(2)When the distance between them is relatively close, the obstacles will repel the robot too much to reach the target(3)When the two reaction forces just cancel out, the phenomenon of local optimum or oscillation may appear

Because of the above shortcomings, the APF method generally cannot be used directly and needs to be improved to use.

2.3. SA Algorithm

The idea of the SA algorithm comes from the solid annealing process, which is an algorithm that jumps out of the local optimum to get the global optimum. The algorithm uses temperature parameters to control convergence in a finite time. Firstly, the initial temperature and the end temperature are set. The algorithm starts from the initial state and takes it as the current state. Then, it generates a new state in its neighborhood and determines whether to accept the new state based on the Metropolis criterion. The generation process of the new state iterates while the decays until is the end temperature. Finally, the algorithm ends with the global approximate optimal solution.

The Metropolis criterion is that when a system enters a state due to a certain change in state , the energy of the system correspondingly changes from to and then the accepted probability equation of the system from to is

When , the new state is accepted as the current state. When , if , the new state is accepted as the current state; otherwise, the new state is not accepted and the system remains in the current state. is the randomly generated number.

3. Improved -Learning Proposed for Path Planning of Leader Robot

3.1. Environmental Guidance Based on GPF of Target

The traditional -learning algorithm has no prior knowledge. In the early learning process, the robot’s aimless exploration causes many invalid iterations and slow convergence. So, the idea of the APF method is introduced to guide moving. In this paper, the robot plans a path in an unknown environment where only the start and the target of the task are known. Due to the less environmental information and the shortcomings of the traditional APF method, only the GPF of the target is introduced to initialize the value without considering the effect of the repulsive potential field. In order to make the target direction consistent with the increasing direction of the -value, the GPF function is constructed aswhere is the GPF factor which is negative and controls the value inversely proportional to the distance, is the distance from the current position to the target, and is a positive constant that prevents the denominator from being 0.

When the robot moves, the instant reward is detected by sensors and the -table is initialized at the same time. Therefore, the instant reward of environmental information is added to the -value initialization. The purpose of RL is to maximize the cumulative reward by maximizing the -value. The robot always tends to choose the action with the maximum -value, which will guide the robot to move toward the target while avoiding obstacles. The mathematical equation of -value initialization with environmental guidance based on GPF of the target iswhere , is the scale coefficient adjusted according to the actual algorithm, is the discount factor, and is the GPF at state .

3.2. Virtual Obstacle-Filling Avoidance Strategy

There will be concave obstacles in a more complex environment. The traditional -learning algorithm can escape from such obstacles through continuous exploration, which greatly extends the learning time. In addition, the robot is more likely to fall into concave obstacles and cannot escape after adding GPF guidance. In the grid map environment, the obstacle grid is the infeasible area and the rest are feasible areas. Setting certain key position grids which is feasible in the path of possibly tending to concave obstacles as infeasible areas can effectively fill and avoid concave obstacles. Therefore, a virtual obstacle-filling avoidance strategy is established for concave obstacles. The strategy is to judge whether the current grid possibly tends to concave obstacles by adding real-time detection information based on the target tendency before the robot takes the next step. Then, it fills non-obstacles on the path of possibly tending to the concave obstacle with virtual obstacles until the concave shape is filled. The filled concave obstacle as a whole becomes an infeasible area, so the robot will not fall into the concave obstacle in subsequent iterations. This strategy makes full use of sensors and the environmental information which have been learned. It not only prevents the robot from falling into concave obstacles but also reduces invalid exploration of some infeasible positions, which improves the efficiency of path planning.

The specific implementation of the virtual obstacle-filling avoidance strategy is as follows.

Firstly, the sensor is used to detect the position status and distance in real time. And a current position-action array is established to store the feasible adjacent positions from the current position. Before the robot moves, the Euclidean distance from the grid positions adjacent to the robot’s current position to the target position is calculated in turn. Next are the specific judgement steps to judge whether the current grid possibly tends to concave obstacles according to the calculated distances.

If the distance is less than the distance from the current position to the target position, it is further judged whether this adjacent position is an obstacle or not. If the adjacent position is not an obstacle, it is feasible and will be added to the corresponding position of the current position-action array.

If the adjacent position is further away from the target or it is further judged an obstacle, it is an infeasible position and will not be added to the corresponding position.

If the final current position-action array is empty, it indicates that the current position completely tends to infeasible areas which may be in a concave obstacle. The current position will be filled with a virtual obstacle.

Finally, each step of the robot is judged until concave obstacles are filled.

One-step filling of the virtual obstacle-filling avoidance strategy is shown in Figure 1. In the figure, the red grid is the robot, and the yellow grid is the target. As Figure 1(a) shows, the robot enters the grey concave obstacle during the path planning process. According to the distance calculation, the adjacent positions which are down, right, and lower right of the robot’s current position are determined as the positions which are near the target more and are the dark grey grid in Figure 1(b). The three adjacent positions are further judged obstacles which are infeasible positions, indicating that the current position is completely toward the infeasible area which may be in a concave obstacle. Thus, the current position is filled with light grey virtual obstacles, which is seen in Figure 1(c).

3.3. Action Selection Strategy Improved by SA

In the process of path planning with -learning, the robot expands the range of movement by exploring the environment and accumulates knowledge of environmental rewards and punishments. Finally, it uses the value function to select the optimal action. In the robot’s iterative learning process, more exploration is required in the early stage, but too much or too long exploration will greatly extend the learning time and reduce the learning efficiency. On the contrary, too little exploration will lead to insufficient experience and the action selected finally may be sub-optimal. Thus, it is necessary to balance exploration and utilization. The traditional strategy often used in the -learning algorithm balances exploration and utilization to a certain extent by setting . However, the fixed greedy factor in the learning process makes random actions selected with the same probability each time, which causes slow convergence and fluctuations after convergence. Therefore, the greedy factor needs to be adjusted dynamically with the learning process. One method commonly used in experiments is to set to decrease at a fixed rate, but it is not universal to set a fixed rate of decrease based on experience.

In response to the above problems, the strategy improved by SA which is used to adjust dynamically is proposed. The controlled temperature of SA is adjusted in real time according to the learning situation of the -learning algorithm. The algorithm explores as much as possible in the early stage of path planning to increase more prior knowledge and prevent local optimum and cancels unnecessary exploration when approaching convergence later. The steps of the action selection strategy improved by the SA algorithm are as follows:(1)Define the temperature parameter and set the initial value . Then, use the sample standard deviation of step numbers for consecutive iterations to control the cooling temperature. The mathematical equation of iswhere , respectively, are the number of steps for consecutive iterations, is the average number of consecutive iterations, and is the control factor, which is obtained by repeatedly adjusting according to the experimental effect and controls in a suitable range. is a smaller non-zero constant to prevent from being 0 after convergence(2)Calculate the accepted probability of randomly selected actions according to the Metropolis criterion. And use it to redefine the greedy factor at . The mathematical equation of iswhere is the -value of the random action selected at state , is the -value of the optimal action at state , is a non-zero constant to prevent the numerator from being 0, and is the temperature parameter(3)If , choose the action randomly, otherwise choose , where is the randomly generated number

3.4. Improved -Learning Algorithm

Compared with the original algorithm, there are three innovations in the improved -learning algorithm proposed in this paper.

Firstly, the -table of the original -learning algorithm is initially a zero-value table without any prior knowledge. The improved -learning algorithm uses the GPF of the known target in the task to initialize the -table, which adds environmental guidance and reduces invalid exploration.

Secondly, the robot moves immediately after selecting an action in the original algorithm. This algorithm designs a virtual obstacle-filling avoidance strategy for judgment before each step. It fills non-obstacles which is judged to tend to concave obstacles with virtual obstacles.

Finally, the original algorithm uses the traditional strategy to select actions. The strategy improved by the SA algorithm is proposed in the new algorithm. It adjusts dynamically by adjusting the temperature in real time according to the learning situation of -learning.

The steps of the improved -learning algorithm are shown in Algorithm 2.

begin
  Initialization:
    , , , , ,
    
    %Establish -table and current position-action array, define as times of consecutive iterations, define as the initial temperature, determine the state set and the action set containing eight actions
  for () %The episode cannot exceed m which is the maximum number of episodes
    Given initial state ;
    If Then use to calculate and update ; %Adjust dynamically using of the SA algorithm.
    while ()
     (1) If exists in the -table then continue to next step;
      Else use to initialize ;
      %Initialize the -table
     (2) If is a feasible area towards the target then add it to the corresponding position of ;
      Else the corresponding position of is kept empty;
      % Add the feasible adjacent positions from the current position to the current position-action array
     (3) If is empty then is completely toward the infeasible area which possibly tends to concave obstacles and fill it with virtual obstacle;
      Select action in state according to which is improved by SA;
      %Fill concave obstacles using the virtual obstacle-filling avoidance strategy while selecting actions
     (4) Execute in , enter and get ;
     (5) Update using ;
     (6) ;
    end-while
    ;
  end-for
end

4. A Path Planning Method for multirobot Formation

4.1. Tracking Strategy Based on GPF for Follower Robot

The steps of the tracking strategy based on GPF for the follower robot are as follows:

Step 1: if the follower robot obtains the coordinates broadcast by the leader robot, it will determine the next target state according to the formation, i.e., the desired target position at this time. Otherwise, it means that the formation has reached the target position and the path planning ends.

Step 2: the follower robot moves to the target position. Firstly, the robot uses the cost function to calculate the cost for the eight neighboring states of the current state, which determines the state s with the smallest cost. Then, it selects the corresponding action and executes it. At the same time, it adopts the virtual obstacle-filling avoidance strategy in parallel with the leader robot to share information. Specifically, the cost function is designed by using the idea of GPF. The GPF of the target to the current position is measured by the Euclidean distance from the current position to the target position, which is proportional to the distance. When the checked state is an obstacle, the penalty function is given a positive value; otherwise, the value is 0. The equation for measuring the GPF is

The cost function equation iswhere is the GPF which is measured, is the horizontal coordinate of the current state, is the vertical coordinate of the current state, is the horizontal coordinate of the target at this moment, is the vertical coordinate of the target at this moment, is the cost function at , is the state at , is the action at , is the adjustment coefficient, and is the penalty function.

Step 3: if the state with the minimum cost entered is the target state at this time, return to step 1 and continue. If the state is not the target state at this time, go to step 2 and continue.

4.2. Design Scheme of Path Planning for Leader-following Formation

Adopting the leader-following method, the design scheme of path planning for leader-following formation proposed in this paper includes three parts:(1)Initialization: the grid environment is adopted, and the starting position and target position of the multiple robots are determined. A leader-following formation is designed and the robots are divided into two types: leader and follower. Then, one robot is selected as the leader or a virtual robot is supposed to act as the leader, and the rest are follower robots. Multiple robots have eight actions including up, down, left, right, upper left, upper right, lower left, and lower right. Each robot is equipped with a sensor, which can detect the environmental information of the grids centering on its position. The multirobot action, step length, and detection range of the sensor are shown in Figure 2(2)Path planning of leader robot: the leader robot is responsible for planning the path. It uses the improved -learning algorithm to plan a globally optimal path with avoiding simple obstacles and concave obstacles after trial-and-error training. At the same time, it broadcasts the position of each step and some environmental information to the follower robot. The process of the leader robot’s path planning is shown in Figure 3(3)Local following of follower robot: the follower robot is responsible for following the leader robot to maintain the formation according to the requirements. When the follower robot receives the position information broadcast by the leader robot, it determines the desired target depending on the formation. Then, it follows locally using the tracking strategy based on the GPF and avoids obstacles autonomously. The process of the follower robot’s path planning is shown in Figure 4

5. Experiments Analysis

According to the design scheme of path planning for multiple robots, the method is tested experimentally. The experiment uses Python standard GUI toolkit Tkinter to establish simulation environments.

5.1. Comparison Experiments of Improved -Learning Algorithm

For the improved -learning algorithm, a grid map with three elements: starting point, target point, and obstacles, is first established. The map size is set to grids, and the resolution of each grid is pixels. The starting position of the robot represented by a red grid is set at (0, 0), and the target position represented by a yellow grid is at (19, 19). Obstacles which are black grids are randomly placed on the map, including concave and simple obstacles. To distinguish actual obstacles from virtual obstacles filled during the algorithm operation, virtual obstacles are gray grids.

The experiment is carried out in a comparative way, and five algorithms are implemented: -L1 is the traditional -learning algorithm, -L2 is the -learning algorithm with the dynamic greedy factor of SA, -L3 adds environmental guidance of GPF on the basis of -L2, -L4 is the -learning algorithm proposed in this paper with the improvements 3.1, 3.2, and 3.3, -L5 is the -learning algorithm with a modified reward function based on -L4. The implementation details of -L1 to -L5 algorithms are shown in Table 1.

The same parameter settings of the algorithm are as follows: the maximum number of iteration rounds is 10000, the learning rate is 0.01, and the discount factor is 0.9. For using the traditional strategy in the algorithm, the greedy factor is 0.2, and the convergence is determined that the standard deviation of step numbers for 10 consecutive iterations is less than 5. Parameter settings for using SA in the algorithm are as follows: the initial temperature is set to 10, the number of consecutive iterations is set to 10, the constant is set to 0.1, the control factor is set to 0.03, and the non-zero constant is set to 1. In the algorithm using the GPF method to improve, the GPF factor is set to -10, the constant is set to 735, and the scale coefficient is set to 0.1. The reward function is set to

After setting the parameters, simulation experiments are conducted. The path planning map, the cumulative reward change graph with the round, and the path planning step numbers change graph with the round are obtained. From the figure, the path planning and convergence of each algorithm can be seen. Figures 5(a)5(e), respectively, show the path planning maps of the robot under algorithms -L1 to -L5. Figures 6(a)6(e), respectively, show the change of cumulative reward with rounds for the robot under algorithms -L1 to -L5. Figures 7(a)7(e), respectively, show the change of step numbers with rounds for the robot under algorithms -L1 to -L5.

Figure 5(c) shows that the robot is trapped in a concave obstacle and cannot escape. Figure 6(c) shows that the cumulative reward curve of path planning changes irregularly during the iterative process. Figure 7(c) shows that the step number curve of path planning changes irregularly during the iterative process. The above three results of -L3 ndicate the algorithm does not converge in the iterative process, and the robot cannot reach the target when only adding the GPF of the target to improve when encountering concave obstacles.

Figures 5(a)5(e) show that the robot uses the -L1, -L2, -L4, and -L5 algorithms to effectively avoid black obstacles and plan a red path from the starting to the target, but the path planned by the -L1 and -L2 algorithms is more tortuous, the -L4 algorithm plans a smoother feasible path, and the -L5 algorithm plans the optimal path. The cumulative reward curve showed by Figures 6(a)6(e) and the step number curve showed by Figures 7(a), 7(b), 7(d), and 7(e) are both stable after iterating to a certain round, indicating that the algorithms gradually converge as the iterations proceed.

However, curves of Figures 6(a) and 7(a) converges with small fluctuations, curves of Figures 6(b) and 7(b) converges with smoothness, curves of Figures 6(d), 7(d), 6(e), and 7(e) reach smoothness in fewer rounds of iteration. The above shows that by adding the improvements proposed in this paper to -L4 algorithm and -L5 algorithm, the experiments achieve better results. The robot moves while initializing the -value by the environmental guidance based on the GPF of the target, which makes the robot guided by the target direction all the time. It removes invalid movement and speeds up the convergence time. When the robot encounters a concave obstacle, it identifies effectively the infeasible area and fills it with light gray virtual obstacles to prevent the robot from falling into the concave obstacle. The SA method is used to dynamically adjust the greedy factor to accelerate the algorithm convergence and make it stable after convergence. In addition, the -L5 algorithm adjusts the reward function on the basis of the -L4 algorithm by giving each step a smaller penalty, and the robot learns the optimal path with the maximum cumulative reward.

Table 2 compares the performance of the above five algorithms after path planning. The data in the table are the average results from conducting several experiments. The analysis is as follows: based on the traditional -learning algorithm, -L2 uses the SA algorithm to improve strategy. Although the convergence round of the algorithm increases, the convergence time is shortened and the overall stability of path planning is improved. Comparing the algorithms -L3 and -L4, if the environmental guidance of GPF is added to the algorithm without the virtual obstacle-filling avoidance strategy, the algorithm is difficult to converge when encountering concave obstacles. Comparing the algorithms -L2 and -L4, by adding environmental guidance and the virtual obstacle-filling avoidance strategy, the convergence time is reduced by 98.5%, and the convergence rounds is reduced by 96.6%; the total step numbers and the length of the path are stabilized at 26 and 32.6274, respectively. Comparing the algorithms -L4 and -L5, adjusting the reward function reasonably on the improved -learning algorithm proposed in this paper will make the robot plan the optimal path quickly, which is 89.9% shorter than the traditional -learning algorithm. And the number of convergence rounds is reduced by 63.4%, the step numbers are reduced to 22 and the length of the path is reduced to 28.6274.

5.2. Experiments of Path Planning for multirobot Formation

After experimenting with the improved algorithm of the leader robot’s path planning, the follower robot is added to verify the effectiveness of the path planning method for multirobot formation in this paper. The experiment uses a triangular formation and three robots. The leader robot is represented by a red grid, and its initial position is . The follower robots are represented by a blue grid and a green grid, respectively, and their initial positions are and , respectively. The target position of the leader robot is , which also determines the target position of the follower robots. The leader robot uses the improved -learning algorithm -L5 to plan the optimal path with the same parameter settings as in 5.1 simulation experiments. The two follower robots, respectively, use the tracking strategy based on the GPF to follow: the penalty function and the adjustment coefficient .

The experimental effect is shown in Figure 8(a) that the three robots quickly reach the target with a fixed triangle formation in an obstacle-free environment. In an environment with static obstacles, the leader robot moves first, and the two follower robots immediately move to the corresponding positions in the formation. The green follower robot firstly encounters a black lateral obstacle. It moves along the obstacle to the target direction and smoothly avoids the obstacle. Then, it continues to accelerate to move to the corresponding position of the formation at the current time to maintain the formation. Finally, the leader robot plans a red path, the two follower robots avoid obstacles by themselves during the following process and plan a green path and a blue path, respectively. The three robots reach the target at the same time and complete the formation task. The experimental effect is shown in Figure 8(b).

6. Conclusion

In this paper, by combining the improved -learning algorithm and the idea of the GPF method, a method for multirobot formation and path planning is proposed. The division of labor among multiple robots is clear. The leader robot uses the improved -learning algorithm to plan the path. It is found that adding environment guidance of the target’s GPF and virtual obstacle-filling avoidance strategy effectively accelerates iterative convergence and avoids concave obstacles. It is stable and efficient for the action selection strategy to be improved by the SA method. At the same time, the follower robot uses a tracking strategy based on the improved GPF to follow in real time, which is simple and efficient. This formation method effectively solves the formation and path planning problems of multiple robots in an unknown environment with concave obstacles. In the future, the multirobot formation will be further studied in the context of dynamic environments and privacy protection.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

The work is supported by the National Natural Science Foundation of China (61673200), the Major Basic Research Project of Natural Science Foundation of Shandong Province of China (ZR2018ZC0438), and the Key Research and Development Program of Yantai City of China (2019XDHZ085).